Skip to content

sched-core-2021-02-17

tagged this 17 Feb 13:15
[ NOTE: unfortunately this tree had to be freshly rebased today,
        it's a same-content tree of 82891be90f3c (-next published)
        merged with v5.11.

        The main reason for the rebase was an authorship misattribution
        problem with a new commit, which we noticed in the last minute,
        and which we didn't want to be merged upstream. The offending
        commit was deep in the tree, and dependent commits had to be
        rebased as well. ]

- Core scheduler updates:

  - Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the
    preempt=none/voluntary/full boot options (default: full),
    to allow distros to build a PREEMPT kernel but fall back to
    close to PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling
    behavior via a boot time selection.

    There's also the /debug/sched_debug switch to do this runtime.

    This feature is implemented via runtime patching (a new variant of static calls).

    The scope of the runtime patching can be best reviewed by looking
    at the sched_dynamic_update() function in kernel/sched/core.c.

    ( Note that the dynamic none/voluntary mode isn't 100% identical,
      for example preempt-RCU is available in all cases, plus the
      preempt count is maintained in all models, which has runtime
      overhead even with the code patching. )

    The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast majority
    of distributions, are supposed to be unaffected.

  - Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that
    was found via rcutorture triggering a hang. The bug is that
    rcu_idle_enter() may wake up a NOCB kthread, but this happens after
    the last generic need_resched() check. Some cpuidle drivers fix it
    by chance but many others don't.

    In true 2020 fashion the original bug fix has grown into a 5-patch
    scheduler/RCU fix series plus another 16 RCU patches to address
    the underlying issue of missed preemption events. These are the
    initial fixes that should fix current incarnations of the bug.

  - Clean up rbtree usage in the scheduler, by providing & using the following
    consistent set of rbtree APIs:

     partial-order; less() based:
       - rb_add(): add a new entry to the rbtree
       - rb_add_cached(): like rb_add(), but for a rb_root_cached

     total-order; cmp() based:
       - rb_find(): find an entry in an rbtree
       - rb_find_add(): find an entry, and add if not found

       - rb_find_first(): find the first (leftmost) matching entry
       - rb_next_match(): continue from rb_find_first()
       - rb_for_each(): iterate a sub-tree using the previous two

  - Improve the SMP/NUMA load-balancer: scan for an idle sibling in a single pass.
    This is a 4-commit series where each commit improves one aspect of the idle
    sibling scan logic.

  - Improve the cpufreq cooling driver by getting the effective CPU utilization
    metrics from the scheduler

  - Improve the fair scheduler's active load-balancing logic by reducing the number
    of active LB attempts & lengthen the load-balancing interval. This improves
    stress-ng mmapfork performance.

  - Fix CFS's estimated utilization (util_est) calculation bug that can result in
    too high utilization values

- Misc updates & fixes:

   - Fix the HRTICK reprogramming & optimization feature
   - Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code
   - Reduce dl_add_task_root_domain() overhead
   - Fix uprobes refcount bug
   - Process pending softirqs in flush_smp_call_function_from_idle()
   - Clean up task priority related defines, remove *USER_*PRIO and
     USER_PRIO()
   - Simplify the sched_init_numa() deduplication sort
   - Documentation updates
   - Fix EAS bug in update_misfit_status(), which degraded the quality
     of energy-balancing
   - Smaller cleanups

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Assets 2
Loading