Skip to content

Commit

Permalink
exec: Move unshare_files and guarantee files_struct.count is correct
Browse files Browse the repository at this point in the history
A while ago it was reported that posix file locking goes wrong when a
multi-threaded process calls exec.  I looked into the history and this
is definitely a regression, that should be fixed if we can.

This set of changes cleanups of the code in exec so hopefully this code
will not regress again.  Then it adds helpers and fixes the users of
files_struct so the reference count is only incremented if COPY_FILES is
passed to clone (or if io_uring takes a reference).  Then it removes
helpers (get_files_struct, __install_fd, __alloc_fd, __close_fd) that
are no longer needed and if used would encourage code that increments
the count of files_struct somewhere besides in clone when COPY_FILES is
passed.

In addition to fixing the bug in exec and simplifing the code this set
of changes by virtue of getting files_struct.count correct it optimizes
fdget.  With proc and other places not temporarily increasing the count
on files_struct __fget_light should succeed more often in being able to
return a struct file without touching it's reference count.

Fixing the count in files_struct was suggested by Oleg[1].

For those that are interested in the history of this issue I have
included as much of it as I could find in the first change.

Since v1:

- Renamed the functions
  __fcheck_files      -> files_lookup_fd_raw
  fcheck_files        -> files_lookup_fd_locked
  fcheck_files        -> files_lookup_fd_rcu
  fcheck_files        -> lookup_fd_rcu
  fcheck_task         -> task_lookup_fd_rcu
  fnext_task          -> task_lookup_next_fd_rcu
  __close_fd_get_file -> close_fd_get_file

- Simplified get_file_raw_ptr

- Removed ksys_close

- Examined the penalty for taking task_lock.  The helper
  task_lookup_next_fd_rcu takes task_lock each iteration.  Concern was
  expressed that this might be a problem.  The function tid_fd_mode
  isn called from tid_fd_revalidate which is called when ever a file
  descriptor file is stat'ed, opened, or otherwise accessed.  The
  function tid_fd_mode histrocally called get_files_struct which took
  and dropped task_lock.  So the volume of task_lock calls is already
  proportional to the number of file descriptors.  A micro benchmark
  did not see the move to task_lookup_next_fd_rcu making a difference
  in performance.  Which suggests that the change to taking the task
  lock for every file descriptor found in task_lookup_next_fd will not
  be a problem.

- Reviewed the code for conflicts with io_uring (especially the
  removal of get_files_struct).  To my surprise no conflicts were
  found as io_uring does not use standard helpers but instead rolls
  it's own version of get_files_struct by hand.

 Documentation/filesystems/files.rst          |   8 +-
 arch/powerpc/platforms/cell/spufs/coredump.c |   2 +-
 drivers/android/binder.c                     |   2 +-
 fs/autofs/dev-ioctl.c                        |   5 +-
 fs/coredump.c                                |   5 +-
 fs/exec.c                                    |  29 +++----
 fs/file.c                                    | 124 +++++++++++++--------------
 fs/io_uring.c                                |   2 +-
 fs/locks.c                                   |  14 +--
 fs/notify/dnotify/dnotify.c                  |   2 +-
 fs/open.c                                    |   2 +-
 fs/proc/fd.c                                 |  48 ++++-------
 include/linux/fdtable.h                      |  40 +++++----
 include/linux/syscalls.h                     |  12 ---
 kernel/bpf/syscall.c                         |  20 +----
 kernel/bpf/task_iter.c                       |  44 +++-------
 kernel/fork.c                                |  12 +--
 kernel/kcmp.c                                |  29 ++-----
 18 files changed, 153 insertions(+), 247 deletions(-)

Eric W. Biederman (25):
      exec: Don't open code get_close_on_exec
      exec: Move unshare_files to fix posix file locking during exec
      exec: Simplify unshare_files
      exec: Remove reset_files_struct
      kcmp: In kcmp_epoll_target use fget_task
      bpf: In bpf_task_fd_query use fget_task
      proc/fd: In proc_fd_link use fget_task
      file: Rename __fcheck_files to files_lookup_fd_raw
      file: Factor files_lookup_fd_locked out of fcheck_files
      file: Replace fcheck_files with files_lookup_fd_rcu
      file: Rename fcheck lookup_fd_rcu
      file: Implement task_lookup_fd_rcu
      proc/fd: In tid_fd_mode use task_lookup_fd_rcu
      kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
      file: Implement task_lookup_next_fd_rcu
      proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
      bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
      proc/fd: In fdinfo seq_show don't use get_files_struct
      file: Merge __fd_install into fd_install
      file: In f_dupfd read RLIMIT_NOFILE once.
      file: Merge __alloc_fd into alloc_fd
      file: Rename __close_fd to close_fd and remove the files parameter
      file: Replace ksys_close with close_fd
      file: Rename __close_fd_get_file close_fd_get_file
      file: Remove get_files_struct

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
v1: https://lkml.kernel.org/r/87ft8l6ic3.fsf@x220.int.ebiederm.org
Reported-by: Jeff Layton <jlayton@redhat.com>
Reported-by: Daniel P. Berrangé <berrange@redhat.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/87r1on1v62.fsf@x220.int.ebiederm.org
Link: https://lists.openvz.org/pipermail/criu/2020-November/045123.html
Link: https://marc.info/?l=openvz-criu&m=160591423214257
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
  • Loading branch information
Eric W. Biederman committed Dec 10, 2020
2 parents 3650b22 + fa67bf8 commit 125c00a
Show file tree
Hide file tree
Showing 18 changed files with 153 additions and 247 deletions.
8 changes: 4 additions & 4 deletions Documentation/filesystems/files.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,15 +62,15 @@ the fdtable structure -
be held.

4. To look up the file structure given an fd, a reader
must use either fcheck() or fcheck_files() APIs. These
must use either lookup_fd_rcu() or files_lookup_fd_rcu() APIs. These
take care of barrier requirements due to lock-free lookup.

An example::

struct file *file;
rcu_read_lock();
file = fcheck(fd);
file = lookup_fd_rcu(fd);
if (file) {
...
}
Expand All @@ -84,7 +84,7 @@ the fdtable structure -
on ->f_count::

rcu_read_lock();
file = fcheck_files(files, fd);
file = files_lookup_fd_rcu(files, fd);
if (file) {
if (atomic_long_inc_not_zero(&file->f_count))
*fput_needed = 1;
Expand All @@ -104,7 +104,7 @@ the fdtable structure -
lock-free, they must be installed using rcu_assign_pointer()
API. If they are looked up lock-free, rcu_dereference()
must be used. However it is advisable to use files_fdtable()
and fcheck()/fcheck_files() which take care of these issues.
and lookup_fd_rcu()/files_lookup_fd_rcu() which take care of these issues.

7. While updating, the fdtable pointer must be looked up while
holding files->file_lock. If ->file_lock is dropped, then
Expand Down
2 changes: 1 addition & 1 deletion arch/powerpc/platforms/cell/spufs/coredump.c
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ static struct spu_context *coredump_next_context(int *fd)
*fd = n - 1;

rcu_read_lock();
file = fcheck(*fd);
file = lookup_fd_rcu(*fd);
ctx = SPUFS_I(file_inode(file))->i_ctx;
get_spu_context(ctx);
rcu_read_unlock();
Expand Down
2 changes: 1 addition & 1 deletion drivers/android/binder.c
Original file line number Diff line number Diff line change
Expand Up @@ -2226,7 +2226,7 @@ static void binder_deferred_fd_close(int fd)
if (!twcb)
return;
init_task_work(&twcb->twork, binder_do_fd_close);
__close_fd_get_file(fd, &twcb->file);
close_fd_get_file(fd, &twcb->file);
if (twcb->file) {
filp_close(twcb->file, current->files);
task_work_add(current, &twcb->twork, TWA_RESUME);
Expand Down
5 changes: 3 additions & 2 deletions fs/autofs/dev-ioctl.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,10 @@
* Copyright 2008 Ian Kent <raven@themaw.net>
*/

#include <linux/module.h>
#include <linux/miscdevice.h>
#include <linux/compat.h>
#include <linux/syscalls.h>
#include <linux/fdtable.h>
#include <linux/magic.h>
#include <linux/nospec.h>

Expand Down Expand Up @@ -289,7 +290,7 @@ static int autofs_dev_ioctl_closemount(struct file *fp,
struct autofs_sb_info *sbi,
struct autofs_dev_ioctl *param)
{
return ksys_close(param->ioctlfd);
return close_fd(param->ioctlfd);
}

/*
Expand Down
5 changes: 1 addition & 4 deletions fs/coredump.c
Original file line number Diff line number Diff line change
Expand Up @@ -585,7 +585,6 @@ void do_coredump(const kernel_siginfo_t *siginfo)
int ispipe;
size_t *argv = NULL;
int argc = 0;
struct files_struct *displaced;
/* require nonrelative corefile path and be extra careful */
bool need_suid_safe = false;
bool core_dumped = false;
Expand Down Expand Up @@ -791,11 +790,9 @@ void do_coredump(const kernel_siginfo_t *siginfo)
}

/* get us an unshared descriptor table; almost always a no-op */
retval = unshare_files(&displaced);
retval = unshare_files();
if (retval)
goto close_fail;
if (displaced)
put_files_struct(displaced);
if (!dump_interrupted()) {
/*
* umh disabled with CONFIG_STATIC_USERMODEHELPER_PATH="" would
Expand Down
29 changes: 13 additions & 16 deletions fs/exec.c
Original file line number Diff line number Diff line change
Expand Up @@ -1257,6 +1257,11 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
goto out;

/* Ensure the files table is not shared. */
retval = unshare_files();
if (retval)
goto out;

/*
* Must be called _before_ exec_mmap() as bprm->mm is
* not visibile until then. This also enables the update
Expand Down Expand Up @@ -1776,21 +1781,16 @@ static int bprm_execve(struct linux_binprm *bprm,
int fd, struct filename *filename, int flags)
{
struct file *file;
struct files_struct *displaced;
int retval;

/*
* Cancel any io_uring activity across execve
*/
io_uring_task_cancel();

retval = unshare_files(&displaced);
if (retval)
return retval;

retval = prepare_bprm_creds(bprm);
if (retval)
goto out_files;
return retval;

check_unsafe_exec(bprm);
current->in_execve = 1;
Expand All @@ -1805,11 +1805,14 @@ static int bprm_execve(struct linux_binprm *bprm,
bprm->file = file;
/*
* Record that a name derived from an O_CLOEXEC fd will be
* inaccessible after exec. Relies on having exclusive access to
* current->files (due to unshare_files above).
* inaccessible after exec. This allows the code in exec to
* choose to fail when the executable is not mmaped into the
* interpreter and an open file descriptor is not passed to
* the interpreter. This makes for a better user experience
* than having the interpreter start and then immediately fail
* when it finds the executable is inaccessible.
*/
if (bprm->fdpath &&
close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
if (bprm->fdpath && get_close_on_exec(fd))
bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;

/* Set the unchanging part of bprm->cred */
Expand All @@ -1827,8 +1830,6 @@ static int bprm_execve(struct linux_binprm *bprm,
rseq_execve(current);
acct_update_integrals(current);
task_numa_free(current, false);
if (displaced)
put_files_struct(displaced);
return retval;

out:
Expand All @@ -1845,10 +1846,6 @@ static int bprm_execve(struct linux_binprm *bprm,
current->fs->in_exec = 0;
current->in_execve = 0;

out_files:
if (displaced)
reset_files_struct(displaced);

return retval;
}

Expand Down
Loading

0 comments on commit 125c00a

Please sign in to comment.