Skip to content

Commit

Permalink
Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/ker…
Browse files Browse the repository at this point in the history
…nel/git/mszeredi/vfs

Pull overlayfs updates from Miklos Szeredi:
 "This contains two new features:

   - Stack file operations: this allows removal of several hacks from
     the VFS, proper interaction of read-only open files with copy-up,
     possibility to implement fs modifying ioctls properly, and others.

   - Metadata only copy-up: when file is on lower layer and only
     metadata is modified (except size) then only copy up the metadata
     and continue to use the data from the lower file"

* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
  ovl: Enable metadata only feature
  ovl: Do not do metacopy only for ioctl modifying file attr
  ovl: Do not do metadata only copy-up for truncate operation
  ovl: add helper to force data copy-up
  ovl: Check redirect on index as well
  ovl: Set redirect on upper inode when it is linked
  ovl: Set redirect on metacopy files upon rename
  ovl: Do not set dentry type ORIGIN for broken hardlinks
  ovl: Add an inode flag OVL_CONST_INO
  ovl: Treat metacopy dentries as type OVL_PATH_MERGE
  ovl: Check redirects for metacopy files
  ovl: Move some dir related ovl_lookup_single() code in else block
  ovl: Do not expose metacopy only dentry from d_real()
  ovl: Open file with data except for the case of fsync
  ovl: Add helper ovl_inode_realdata()
  ovl: Store lower data inode in ovl_inode
  ovl: Fix ovl_getattr() to get number of blocks from lower
  ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
  ovl: Copy up meta inode data from lowest data inode
  ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
  ...
  • Loading branch information
Linus Torvalds committed Aug 22, 2018
2 parents c22fc16 + 989974c commit d9a185f
Show file tree
Hide file tree
Showing 33 changed files with 1,619 additions and 595 deletions.
3 changes: 1 addition & 2 deletions Documentation/filesystems/Locking
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,7 @@ prototypes:
char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
struct vfsmount *(*d_automount)(struct path *path);
int (*d_manage)(const struct path *, bool);
struct dentry *(*d_real)(struct dentry *, const struct inode *,
unsigned int, unsigned int);
struct dentry *(*d_real)(struct dentry *, const struct inode *);

locking rules:
rename_lock ->d_lock may block rcu-walk
Expand Down
81 changes: 59 additions & 22 deletions Documentation/filesystems/overlayfs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,6 @@ union-filesystems). An overlay-filesystem tries to present a
filesystem which is the result over overlaying one filesystem on top
of the other.

The result will inevitably fail to look exactly like a normal
filesystem for various technical reasons. The expectation is that
many use cases will be able to ignore these differences.


Overlay objects
---------------
Expand Down Expand Up @@ -266,6 +262,30 @@ rightmost one and going left. In the above example lower1 will be the
top, lower2 the middle and lower3 the bottom layer.


Metadata only copy up
--------------------

When metadata only copy up feature is enabled, overlayfs will only copy
up metadata (as opposed to whole file), when a metadata specific operation
like chown/chmod is performed. Full file will be copied up later when
file is opened for WRITE operation.

In other words, this is delayed data copy up operation and data is copied
up when there is a need to actually modify data.

There are multiple ways to enable/disable this feature. A config option
CONFIG_OVERLAY_FS_METACOPY can be set/unset to enable/disable this feature
by default. Or one can enable/disable it at module load time with module
parameter metacopy=on/off. Lastly, there is also a per mount option
metacopy=on/off to enable/disable this feature per mount.

Do not use metacopy=on with untrusted upper/lower directories. Otherwise
it is possible that an attacker can create a handcrafted file with
appropriate REDIRECT and METACOPY xattrs, and gain access to file on lower
pointed by REDIRECT. This should not be possible on local system as setting
"trusted." xattrs will require CAP_SYS_ADMIN. But it should be possible
for untrusted layers like from a pen drive.

Sharing and copying layers
--------------------------

Expand All @@ -284,7 +304,7 @@ though it will not result in a crash or deadlock.
Mounting an overlay using an upper layer path, where the upper layer path
was previously used by another mounted overlay in combination with a
different lower layer path, is allowed, unless the "inodes index" feature
is enabled.
or "metadata only copy up" feature is enabled.

With the "inodes index" feature, on the first time mount, an NFS file
handle of the lower layer root directory, along with the UUID of the lower
Expand All @@ -297,6 +317,10 @@ lower root origin, mount will fail with ESTALE. An overlayfs mount with
does not support NFS export, lower filesystem does not have a valid UUID or
if the upper filesystem does not support extended attributes.

For "metadata only copy up" feature there is no verification mechanism at
mount time. So if same upper is mounted with different set of lower, mount
probably will succeed but expect the unexpected later on. So don't do it.

It is quite a common practice to copy overlay layers to a different
directory tree on the same or different underlying filesystem, and even
to a different machine. With the "inodes index" feature, trying to mount
Expand All @@ -306,27 +330,40 @@ the copied layers will fail the verification of the lower root file handle.
Non-standard behavior
---------------------

The copy_up operation essentially creates a new, identical file and
moves it over to the old name. Any open files referring to this inode
will access the old data.
Overlayfs can now act as a POSIX compliant filesystem with the following
features turned on:

1) "redirect_dir"

Enabled with the mount option or module option: "redirect_dir=on" or with
the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y.

If this feature is disabled, then rename(2) on a lower or merged directory
will fail with EXDEV ("Invalid cross-device link").

2) "inode index"

Enabled with the mount option or module option "index=on" or with the
kernel config option CONFIG_OVERLAY_FS_INDEX=y.

The new file may be on a different filesystem, so both st_dev and st_ino
of the real file may change. The values of st_dev and st_ino returned by
stat(2) on an overlay object are often not the same as the real file
stat(2) values to prevent the values from changing on copy_up.
If this feature is disabled and a file with multiple hard links is copied
up, then this will "break" the link. Changes will not be propagated to
other names referring to the same inode.

Unless "xino" feature is enabled, when overlay layers are not all on the
same underlying filesystem, the value of st_dev may be different for two
non-directory objects in the same overlay filesystem and the value of
st_ino for directory objects may be non persistent and could change even
while the overlay filesystem is still mounted.
3) "xino"

Unless "inode index" feature is enabled, if a file with multiple hard
links is copied up, then this will "break" the link. Changes will not be
propagated to other names referring to the same inode.
Enabled with the mount option "xino=auto" or "xino=on", with the module
option "xino_auto=on" or with the kernel config option
CONFIG_OVERLAY_FS_XINO_AUTO=y. Also implicitly enabled by using the same
underlying filesystem for all layers making up the overlay.

Unless "redirect_dir" feature is enabled, rename(2) on a lower or merged
directory will fail with EXDEV.
If this feature is disabled or the underlying filesystem doesn't have
enough free bits in the inode number, then overlayfs will not be able to
guarantee that the values of st_ino and st_dev returned by stat(2) and the
value of d_ino returned by readdir(3) will act like on a normal filesystem.
E.g. the value of st_dev may be different for two objects in the same
overlay filesystem and the value of st_ino for directory objects may not be
persistent and could change even while the overlay filesystem is mounted.


Changes to underlying filesystems
Expand Down
16 changes: 4 additions & 12 deletions Documentation/filesystems/vfs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -989,8 +989,7 @@ struct dentry_operations {
char *(*d_dname)(struct dentry *, char *, int);
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
struct dentry *(*d_real)(struct dentry *, const struct inode *,
unsigned int, unsigned int);
struct dentry *(*d_real)(struct dentry *, const struct inode *);
};

d_revalidate: called when the VFS needs to revalidate a dentry. This
Expand Down Expand Up @@ -1124,22 +1123,15 @@ struct dentry_operations {
dentry being transited from.

d_real: overlay/union type filesystems implement this method to return one of
the underlying dentries hidden by the overlay. It is used in three
the underlying dentries hidden by the overlay. It is used in two
different modes:

Called from open it may need to copy-up the file depending on the
supplied open flags. This mode is selected with a non-zero flags
argument. In this mode the d_real method can return an error.

Called from file_dentry() it returns the real dentry matching the inode
argument. The real dentry may be from a lower layer already copied up,
but still referenced from the file. This mode is selected with a
non-NULL inode argument. This will always succeed.

With NULL inode and zero flags the topmost real underlying dentry is
returned. This will always succeed.
non-NULL inode argument.

This method is never called with both non-NULL inode and non-zero flags.
With NULL inode the topmost real underlying dentry is returned.

Each dentry has a pointer to its parent dentry, as well as a hash list
of child dentries. Child dentries are basically like files in a
Expand Down
5 changes: 3 additions & 2 deletions fs/btrfs/ctree.h
Original file line number Diff line number Diff line change
Expand Up @@ -3217,8 +3217,9 @@ void btrfs_get_block_group_info(struct list_head *groups_list,
struct btrfs_ioctl_space_info *space);
void btrfs_update_ioctl_balance_args(struct btrfs_fs_info *fs_info,
struct btrfs_ioctl_balance_args *bargs);
ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
struct file *dst_file, u64 dst_loff);
int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
struct file *dst_file, loff_t dst_loff,
u64 olen);

/* file.c */
int __init btrfs_auto_defrag_init(void);
Expand Down
11 changes: 4 additions & 7 deletions fs/btrfs/ioctl.c
Original file line number Diff line number Diff line change
Expand Up @@ -3592,13 +3592,13 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
return ret;
}

ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
struct file *dst_file, u64 dst_loff)
int btrfs_dedupe_file_range(struct file *src_file, loff_t src_loff,
struct file *dst_file, loff_t dst_loff,
u64 olen)
{
struct inode *src = file_inode(src_file);
struct inode *dst = file_inode(dst_file);
u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
ssize_t res;

if (WARN_ON_ONCE(bs < PAGE_SIZE)) {
/*
Expand All @@ -3609,10 +3609,7 @@ ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
return -EINVAL;
}

res = btrfs_extent_same(src, loff, olen, dst, dst_loff);
if (res)
return res;
return olen;
return btrfs_extent_same(src, src_loff, olen, dst, dst_loff);
}

static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
Expand Down
69 changes: 48 additions & 21 deletions fs/file_table.c
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,8 @@ static void file_free_rcu(struct rcu_head *head)
static inline void file_free(struct file *f)
{
security_file_free(f);
percpu_counter_dec(&nr_files);
if (!(f->f_mode & FMODE_NOACCOUNT))
percpu_counter_dec(&nr_files);
call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
}

Expand Down Expand Up @@ -91,6 +92,34 @@ int proc_nr_files(struct ctl_table *table, int write,
}
#endif

static struct file *__alloc_file(int flags, const struct cred *cred)
{
struct file *f;
int error;

f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
if (unlikely(!f))
return ERR_PTR(-ENOMEM);

f->f_cred = get_cred(cred);
error = security_file_alloc(f);
if (unlikely(error)) {
file_free_rcu(&f->f_u.fu_rcuhead);
return ERR_PTR(error);
}

atomic_long_set(&f->f_count, 1);
rwlock_init(&f->f_owner.lock);
spin_lock_init(&f->f_lock);
mutex_init(&f->f_pos_lock);
eventpoll_init_file(f);
f->f_flags = flags;
f->f_mode = OPEN_FMODE(flags);
/* f->f_version: 0 */

return f;
}

/* Find an unused file structure and return a pointer to it.
* Returns an error pointer if some error happend e.g. we over file
* structures limit, run out of memory or operation is not permitted.
Expand All @@ -105,7 +134,6 @@ struct file *alloc_empty_file(int flags, const struct cred *cred)
{
static long old_max;
struct file *f;
int error;

/*
* Privileged users can go above max_files
Expand All @@ -119,26 +147,10 @@ struct file *alloc_empty_file(int flags, const struct cred *cred)
goto over;
}

f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
if (unlikely(!f))
return ERR_PTR(-ENOMEM);

f->f_cred = get_cred(cred);
error = security_file_alloc(f);
if (unlikely(error)) {
file_free_rcu(&f->f_u.fu_rcuhead);
return ERR_PTR(error);
}
f = __alloc_file(flags, cred);
if (!IS_ERR(f))
percpu_counter_inc(&nr_files);

atomic_long_set(&f->f_count, 1);
rwlock_init(&f->f_owner.lock);
spin_lock_init(&f->f_lock);
mutex_init(&f->f_pos_lock);
eventpoll_init_file(f);
f->f_flags = flags;
f->f_mode = OPEN_FMODE(flags);
/* f->f_version: 0 */
percpu_counter_inc(&nr_files);
return f;

over:
Expand All @@ -150,6 +162,21 @@ struct file *alloc_empty_file(int flags, const struct cred *cred)
return ERR_PTR(-ENFILE);
}

/*
* Variant of alloc_empty_file() that doesn't check and modify nr_files.
*
* Should not be used unless there's a very good reason to do so.
*/
struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred)
{
struct file *f = __alloc_file(flags, cred);

if (!IS_ERR(f))
f->f_mode |= FMODE_NOACCOUNT;

return f;
}

/**
* alloc_file - allocate and initialize a 'struct file'
*
Expand Down
46 changes: 6 additions & 40 deletions fs/inode.c
Original file line number Diff line number Diff line change
Expand Up @@ -1595,50 +1595,17 @@ sector_t bmap(struct inode *inode, sector_t block)
}
EXPORT_SYMBOL(bmap);

/*
* Update times in overlayed inode from underlying real inode
*/
static void update_ovl_inode_times(struct dentry *dentry, struct inode *inode,
bool rcu)
{
struct dentry *upperdentry;

/*
* Nothing to do if in rcu or if non-overlayfs
*/
if (rcu || likely(!(dentry->d_flags & DCACHE_OP_REAL)))
return;

upperdentry = d_real(dentry, NULL, 0, D_REAL_UPPER);

/*
* If file is on lower then we can't update atime, so no worries about
* stale mtime/ctime.
*/
if (upperdentry) {
struct inode *realinode = d_inode(upperdentry);

if ((!timespec64_equal(&inode->i_mtime, &realinode->i_mtime) ||
!timespec64_equal(&inode->i_ctime, &realinode->i_ctime))) {
inode->i_mtime = realinode->i_mtime;
inode->i_ctime = realinode->i_ctime;
}
}
}

/*
* With relative atime, only update atime if the previous atime is
* earlier than either the ctime or mtime or if at least a day has
* passed since the last atime update.
*/
static int relatime_need_update(const struct path *path, struct inode *inode,
struct timespec now, bool rcu)
static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
struct timespec now)
{

if (!(path->mnt->mnt_flags & MNT_RELATIME))
if (!(mnt->mnt_flags & MNT_RELATIME))
return 1;

update_ovl_inode_times(path->dentry, inode, rcu);
/*
* Is mtime younger than atime? If yes, update atime:
*/
Expand Down Expand Up @@ -1709,8 +1676,7 @@ static int update_time(struct inode *inode, struct timespec64 *time, int flags)
* This function automatically handles read only file systems and media,
* as well as the "noatime" flag and inode specific "noatime" markers.
*/
bool __atime_needs_update(const struct path *path, struct inode *inode,
bool rcu)
bool atime_needs_update(const struct path *path, struct inode *inode)
{
struct vfsmount *mnt = path->mnt;
struct timespec64 now;
Expand All @@ -1736,7 +1702,7 @@ bool __atime_needs_update(const struct path *path, struct inode *inode,

now = current_time(inode);

if (!relatime_need_update(path, inode, timespec64_to_timespec(now), rcu))
if (!relatime_need_update(mnt, inode, timespec64_to_timespec(now)))
return false;

if (timespec64_equal(&inode->i_atime, &now))
Expand All @@ -1751,7 +1717,7 @@ void touch_atime(const struct path *path)
struct inode *inode = d_inode(path->dentry);
struct timespec64 now;

if (!__atime_needs_update(path, inode, false))
if (!atime_needs_update(path, inode))
return;

if (!sb_start_write_trylock(inode->i_sb))
Expand Down
Loading

0 comments on commit d9a185f

Please sign in to comment.