Skip to content

timers-core-2023-04-24

tagged this 24 Apr 16:28
  - Improve the VDSO build time checks to cover all dynamic relocations

    VDSO does not allow dynamic relcations, but the build time check is
    incomplete and fragile.

    It's based on architectures specifying the relocation types to search
    for and does not handle R_*_NONE relocation entries correctly.
    R_*_NONE relocations are injected by some GNU ld variants if they fail
    to determine the exact .rel[a]/dyn_size to cover trailing zeros.
    R_*_NONE relocations must be ignored by dynamic loaders, so they
    should be ignored in the build time check too.

    Remove the architecture specific relocation types to check for and
    validate strictly that no other relocations than R_*_NONE end up
    in the VSDO .so file.

  - Prefer signal delivery to the current thread for
    CLOCK_PROCESS_CPUTIME_ID based posix-timers

    Such timers prefer to deliver the signal to the main thread of a
    process even if the context in which the timer expires is the current
    task. This has the downside that it might wake up an idle thread.

    As there is no requirement or guarantee that the signal has to be
    delivered to the main thread, avoid this by preferring the current
    task if it is part of the thread group which shares sighand.

    This not only avoids waking idle threads, it also distributes the
    signal delivery in case of multiple timers firing in the context
    of different threads close to each other better.

  - Align the tick period properly (again)

    For a long time the tick was starting at CLOCK_MONOTONIC zero, which
    allowed users space applications to either align with the tick or to
    place a periodic computation so that it does not interfere with the
    tick. The alignement of the tick period was more by chance than by
    intention as the tick is set up before a high resolution clocksource is
    installed, i.e. timekeeping is still tick based and the tick period
    advances from there.

    The early enablement of sched_clock() broke this alignement as the time
    accumulated by sched_clock() is taken into account when timekeeping is
    initialized. So the base value now(CLOCK_MONOTONIC) is not longer a
    multiple of tick periods, which breaks applications which relied on
    that behaviour.

    Cure this by aligning the tick starting point to the next multiple of
    tick periods, i.e 1000ms/CONFIG_HZ.

 - A set of NOHZ fixes and enhancements

   - Cure the concurrent writer race for idle and IO sleeptime statistics

     The statitic values which are exposed via /proc/stat are updated from
     the CPU local idle exit and remotely by cpufreq, but that happens
     without any form of serialization. As a consequence sleeptimes can be
     accounted twice or worse.

     Prevent this by restricting the accumulation writeback to the CPU
     local idle exit and let the remote access compute the accumulated
     value.

   - Protect idle/iowait sleep time with a sequence count

     Reading idle/iowait sleep time, e.g. from /proc/stat, can race with
     idle exit updates. As a consequence the readout may result in random
     and potentially going backwards values.

     Protect this by a sequence count, which fixes the idle time
     statistics issue, but cannot fix the iowait time problem because
     iowait time accounting races with remote wake ups decrementing the
     remote runqueues nr_iowait counter. The latter is impossible to fix,
     so the only way to deal with that is to document it properly and to
     remove the assertion in the selftest which triggers occasionally due
     to that.

   - Restructure struct tick_sched for better cache layout

   - Some small cleanups and a better cache layout for struct tick_sched

 - Implement the missing timer_wait_running() callback for POSIX CPU timers

   For unknown reason the introduction of the timer_wait_running() callback
   missed to fixup posix CPU timers, which went unnoticed for almost four
   years.

   While initially only targeted to prevent livelocks between a timer
   deletion and the timer expiry function on PREEMPT_RT enabled kernels, it
   turned out that fixing this for mainline is not as trivial as just
   implementing a stub similar to the hrtimer/timer callbacks.

   The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled systems
   there is a livelock issue independent of RT.

   CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU timers
   out from hard interrupt context to task work, which is handled before
   returning to user space or to a VM. The expiry mechanism moves the
   expired timers to a stack local list head with sighand lock held. Once
   sighand is dropped the task can be preempted and a task which wants to
   delete a timer will spin-wait until the expiry task is scheduled back
   in. In the worst case this will end up in a livelock when the preempting
   task and the expiry task are pinned on the same CPU.

   The timer wheel has a timer_wait_running() mechanism for RT, which uses
   a per CPU timer-base expiry lock which is held by the expiry code and the
   task waiting for the timer function to complete blocks on that lock.

   This does not work in the same way for posix CPU timers as there is no
   timer base and expiry for process wide timers can run on any task
   belonging to that process, but the concept of waiting on an expiry lock
   can be used too in a slightly different way.

   Add a per task mutex to struct posix_cputimers_work, let the expiry task
   hold it accross the expiry function and let the deleting task which
   waits for the expiry to complete block on the mutex.

   In the non-contended case this results in an extra mutex_lock()/unlock()
   pair on both sides.

   This avoids spin-waiting on a task which is scheduled out, prevents the
   livelock and cures the problem for RT and !RT systems.
Assets 2
Loading