Skip to content

Commit

Permalink
Merge tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/lin…
Browse files Browse the repository at this point in the history
…ux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core & fair scheduler changes:

   - Cancel the slice protection of the idle entity (Zihan Zhou)
   - Reduce the default slice to avoid tasks getting an extra tick
     (Zihan Zhou)
   - Force propagating min_slice of cfs_rq when {en,de}queue tasks
     (Tianchen Ding)
   - Refactor can_migrate_task() to elimate looping (I Hsin Cheng)
   - Add unlikey branch hints to several system calls (Colin Ian King)
   - Optimize current_clr_polling() on certain architectures (Yujun
     Dong)

  Deadline scheduler: (Juri Lelli)
   - Remove redundant dl_clear_root_domain call
   - Move dl_rebuild_rd_accounting to cpuset.h

  Uclamp:
   - Use the uclamp_is_used() helper instead of open-coding it (Xuewen
     Yan)
   - Optimize sched_uclamp_used static key enabling (Xuewen Yan)

  Scheduler topology support: (Juri Lelli)
   - Ignore special tasks when rebuilding domains
   - Add wrappers for sched_domains_mutex
   - Generalize unique visiting of root domains
   - Rebuild root domain accounting after every update
   - Remove partition_and_rebuild_sched_domains
   - Stop exposing partition_sched_domains_locked

  RSEQ: (Michael Jeanson)
   - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y
   - Fix segfault on registration when rseq_cs is non-zero
   - selftests: Add rseq syscall errors test
   - selftests: Ensure the rseq ABI TLS is actually 1024 bytes

  Membarriers:
   - Fix redundant load of membarrier_state (Nysal Jan K.A.)

  Scheduler debugging:
   - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior)
   - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar)

  Fixes and cleanups:
   - Always save/restore x86 TSC sched_clock() on suspend/resume
     (Guilherme G. Piccoli)
   - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian
     Andrzej Siewior)"

* tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling()
  sched/debug: Remove CONFIG_SCHED_DEBUG
  sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files
  sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation
  sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional
  sched/debug: Make 'const_debug' tunables unconditional __read_mostly
  sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
  rseq/selftests: Fix namespace collision with rseq UAPI header
  include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h
  sched/topology: Stop exposing partition_sched_domains_locked
  cgroup/cpuset: Remove partition_and_rebuild_sched_domains
  sched/topology: Remove redundant dl_clear_root_domain call
  sched/deadline: Rebuild root domain accounting after every update
  sched/deadline: Generalize unique visiting of root domains
  sched/topology: Wrappers for sched_domains_mutex
  sched/deadline: Ignore special tasks when rebuilding domains
  tracing: Use preempt_model_str()
  xtensa: Rely on generic printing of preemption model
  x86: Rely on generic printing of preemption model
  s390: Rely on generic printing of preemption model
  ...
  • Loading branch information
Linus Torvalds committed Mar 25, 2025
2 parents 5a658af + 3785c7d commit 32b2253
Show file tree
Hide file tree
Showing 50 changed files with 589 additions and 441 deletions.
2 changes: 1 addition & 1 deletion Documentation/scheduler/sched-debug.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
Scheduler debugfs
=================

Booting a kernel with CONFIG_SCHED_DEBUG=y will give access to
Booting a kernel with debugfs enabled will give access to
scheduler specific debug files under /sys/kernel/debug/sched. Some of
those files are described below.

Expand Down
2 changes: 1 addition & 1 deletion Documentation/scheduler/sched-design-CFS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ picked and the current task is preempted.
CFS uses nanosecond granularity accounting and does not rely on any jiffies or
other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the
way the previous scheduler had, and has no heuristics whatsoever. There is
only one central tunable (you have to switch on CONFIG_SCHED_DEBUG):
only one central tunable:

/sys/kernel/debug/sched/base_slice_ns

Expand Down
5 changes: 2 additions & 3 deletions Documentation/scheduler/sched-domains.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,9 +73,8 @@ Architectures may override the generic domain builder and the default SD flags
for a given topology level by creating a sched_domain_topology_level array and
calling set_sched_topology() with this array as the parameter.

The sched-domains debugging infrastructure can be enabled by enabling
CONFIG_SCHED_DEBUG and adding 'sched_verbose' to your cmdline. If you
forgot to tweak your cmdline, you can also flip the
The sched-domains debugging infrastructure can be enabled by 'sched_verbose'
to your cmdline. If you forgot to tweak your cmdline, you can also flip the
/sys/kernel/debug/sched/verbose knob. This enables an error checking parse of
the sched domains which should catch most possible errors (described above). It
also prints out the domain structure in a visual format.
3 changes: 1 addition & 2 deletions Documentation/scheduler/sched-ext.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,8 +107,7 @@ detailed information:
nr_rejected : 0
enable_seq : 1
If ``CONFIG_SCHED_DEBUG`` is set, whether a given task is on sched_ext can
be determined as follows:
Whether a given task is on sched_ext can be determined as follows:

.. code-block:: none
Expand Down
2 changes: 1 addition & 1 deletion Documentation/scheduler/sched-stats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ One of these is produced per domain for each cpu described. (Note that if
CONFIG_SMP is not defined, *no* domains are utilized and these lines
will not appear in the output. <name> is an extension to the domain field
that prints the name of the corresponding sched domain. It can appear in
schedstat version 17 and above, and requires CONFIG_SCHED_DEBUG.)
schedstat version 17 and above.

domain<N> <name> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ CFS usa una granularidad de nanosegundos y no depende de ningún
jiffy o detalles como HZ. De este modo, el gestor de tareas CFS no tiene
noción de "ventanas de tiempo" de la forma en que tenía el gestor de
tareas previo, y tampoco tiene heurísticos. Únicamente hay un parámetro
central ajustable (se ha de cambiar en CONFIG_SCHED_DEBUG):
central ajustable:

/sys/kernel/debug/sched/base_slice_ns

Expand Down
11 changes: 2 additions & 9 deletions arch/arm/kernel/traps.c
Original file line number Diff line number Diff line change
Expand Up @@ -258,13 +258,6 @@ void show_stack(struct task_struct *tsk, unsigned long *sp, const char *loglvl)
barrier();
}

#ifdef CONFIG_PREEMPT
#define S_PREEMPT " PREEMPT"
#elif defined(CONFIG_PREEMPT_RT)
#define S_PREEMPT " PREEMPT_RT"
#else
#define S_PREEMPT ""
#endif
#ifdef CONFIG_SMP
#define S_SMP " SMP"
#else
Expand All @@ -282,8 +275,8 @@ static int __die(const char *str, int err, struct pt_regs *regs)
static int die_counter;
int ret;

pr_emerg("Internal error: %s: %x [#%d]" S_PREEMPT S_SMP S_ISA "\n",
str, err, ++die_counter);
pr_emerg("Internal error: %s: %x [#%d]" S_SMP S_ISA "\n",
str, err, ++die_counter);

/* trap and error numbers are mostly meaningless on ARM */
ret = notify_die(DIE_OOPS, str, regs, err, tsk->thread.trap_no, SIGSEGV);
Expand Down
10 changes: 1 addition & 9 deletions arch/arm64/kernel/traps.c
Original file line number Diff line number Diff line change
Expand Up @@ -172,22 +172,14 @@ static void dump_kernel_instr(const char *lvl, struct pt_regs *regs)
printk("%sCode: %s\n", lvl, str);
}

#ifdef CONFIG_PREEMPT
#define S_PREEMPT " PREEMPT"
#elif defined(CONFIG_PREEMPT_RT)
#define S_PREEMPT " PREEMPT_RT"
#else
#define S_PREEMPT ""
#endif

#define S_SMP " SMP"

static int __die(const char *str, long err, struct pt_regs *regs)
{
static int die_counter;
int ret;

pr_emerg("Internal error: %s: %016lx [#%d]" S_PREEMPT S_SMP "\n",
pr_emerg("Internal error: %s: %016lx [#%d] " S_SMP "\n",
str, err, ++die_counter);

/* trap and error numbers are mostly meaningless on ARM */
Expand Down
3 changes: 1 addition & 2 deletions arch/powerpc/kernel/traps.c
Original file line number Diff line number Diff line change
Expand Up @@ -263,10 +263,9 @@ static int __die(const char *str, struct pt_regs *regs, long err)
{
printk("Oops: %s, sig: %ld [#%d]\n", str, err, ++die_counter);

printk("%s PAGE_SIZE=%luK%s%s%s%s%s%s %s\n",
printk("%s PAGE_SIZE=%luK%s %s%s%s%s %s\n",
IS_ENABLED(CONFIG_CPU_LITTLE_ENDIAN) ? "LE" : "BE",
PAGE_SIZE / 1024, get_mmu_str(),
IS_ENABLED(CONFIG_PREEMPT) ? " PREEMPT" : "",
IS_ENABLED(CONFIG_SMP) ? " SMP" : "",
IS_ENABLED(CONFIG_SMP) ? (" NR_CPUS=" __stringify(NR_CPUS)) : "",
debug_pagealloc_enabled() ? " DEBUG_PAGEALLOC" : "",
Expand Down
7 changes: 1 addition & 6 deletions arch/s390/kernel/dumpstack.c
Original file line number Diff line number Diff line change
Expand Up @@ -198,13 +198,8 @@ void __noreturn die(struct pt_regs *regs, const char *str)
console_verbose();
spin_lock_irq(&die_lock);
bust_spinlocks(1);
printk("%s: %04x ilc:%d [#%d] ", str, regs->int_code & 0xffff,
printk("%s: %04x ilc:%d [#%d]", str, regs->int_code & 0xffff,
regs->int_code >> 17, ++die_counter);
#ifdef CONFIG_PREEMPT
pr_cont("PREEMPT ");
#elif defined(CONFIG_PREEMPT_RT)
pr_cont("PREEMPT_RT ");
#endif
pr_cont("SMP ");
if (debug_pagealloc_enabled())
pr_cont("DEBUG_PAGEALLOC");
Expand Down
9 changes: 2 additions & 7 deletions arch/x86/kernel/dumpstack.c
Original file line number Diff line number Diff line change
Expand Up @@ -395,18 +395,13 @@ NOKPROBE_SYMBOL(oops_end);

static void __die_header(const char *str, struct pt_regs *regs, long err)
{
const char *pr = "";

/* Save the regs of the first oops for the executive summary later. */
if (!die_counter)
exec_summary_regs = *regs;

if (IS_ENABLED(CONFIG_PREEMPTION))
pr = IS_ENABLED(CONFIG_PREEMPT_RT) ? " PREEMPT_RT" : " PREEMPT";

printk(KERN_DEFAULT
"Oops: %s: %04lx [#%d]%s%s%s%s%s\n", str, err & 0xffff,
++die_counter, pr,
"Oops: %s: %04lx [#%d]%s%s%s%s\n", str, err & 0xffff,
++die_counter,
IS_ENABLED(CONFIG_SMP) ? " SMP" : "",
debug_pagealloc_enabled() ? " DEBUG_PAGEALLOC" : "",
IS_ENABLED(CONFIG_KASAN) ? " KASAN" : "",
Expand Down
4 changes: 2 additions & 2 deletions arch/x86/kernel/tsc.c
Original file line number Diff line number Diff line change
Expand Up @@ -959,7 +959,7 @@ static unsigned long long cyc2ns_suspend;

void tsc_save_sched_clock_state(void)
{
if (!sched_clock_stable())
if (!static_branch_likely(&__use_tsc) && !sched_clock_stable())
return;

cyc2ns_suspend = sched_clock();
Expand All @@ -979,7 +979,7 @@ void tsc_restore_sched_clock_state(void)
unsigned long flags;
int cpu;

if (!sched_clock_stable())
if (!static_branch_likely(&__use_tsc) && !sched_clock_stable())
return;

local_irq_save(flags);
Expand Down
6 changes: 1 addition & 5 deletions arch/xtensa/kernel/traps.c
Original file line number Diff line number Diff line change
Expand Up @@ -629,15 +629,11 @@ DEFINE_SPINLOCK(die_lock);
void __noreturn die(const char * str, struct pt_regs * regs, long err)
{
static int die_counter;
const char *pr = "";

if (IS_ENABLED(CONFIG_PREEMPTION))
pr = IS_ENABLED(CONFIG_PREEMPT_RT) ? " PREEMPT_RT" : " PREEMPT";

console_verbose();
spin_lock_irq(&die_lock);

pr_info("%s: sig: %ld [#%d]%s\n", str, err, ++die_counter, pr);
pr_info("%s: sig: %ld [#%d]\n", str, err, ++die_counter);
show_regs(regs);
if (!user_mode(regs))
show_stack(NULL, (unsigned long *)regs->areg[1], KERN_INFO);
Expand Down
7 changes: 0 additions & 7 deletions fs/proc/base.c
Original file line number Diff line number Diff line change
Expand Up @@ -1489,7 +1489,6 @@ static const struct file_operations proc_fail_nth_operations = {
#endif


#ifdef CONFIG_SCHED_DEBUG
/*
* Print out various scheduling related per-task fields:
*/
Expand Down Expand Up @@ -1539,8 +1538,6 @@ static const struct file_operations proc_pid_sched_operations = {
.release = single_release,
};

#endif

#ifdef CONFIG_SCHED_AUTOGROUP
/*
* Print out autogroup related information:
Expand Down Expand Up @@ -3331,9 +3328,7 @@ static const struct pid_entry tgid_base_stuff[] = {
ONE("status", S_IRUGO, proc_pid_status),
ONE("personality", S_IRUSR, proc_pid_personality),
ONE("limits", S_IRUGO, proc_pid_limits),
#ifdef CONFIG_SCHED_DEBUG
REG("sched", S_IRUGO|S_IWUSR, proc_pid_sched_operations),
#endif
#ifdef CONFIG_SCHED_AUTOGROUP
REG("autogroup", S_IRUGO|S_IWUSR, proc_pid_sched_autogroup_operations),
#endif
Expand Down Expand Up @@ -3682,9 +3677,7 @@ static const struct pid_entry tid_base_stuff[] = {
ONE("status", S_IRUGO, proc_pid_status),
ONE("personality", S_IRUSR, proc_pid_personality),
ONE("limits", S_IRUGO, proc_pid_limits),
#ifdef CONFIG_SCHED_DEBUG
REG("sched", S_IRUGO|S_IWUSR, proc_pid_sched_operations),
#endif
NOD("comm", S_IFREG|S_IRUGO|S_IWUSR,
&proc_tid_comm_inode_operations,
&proc_pid_set_comm_operations, {}),
Expand Down
11 changes: 11 additions & 0 deletions include/linux/cpuset.h
Original file line number Diff line number Diff line change
Expand Up @@ -125,9 +125,11 @@ static inline int cpuset_do_page_mem_spread(void)

extern bool current_cpuset_is_being_rebound(void);

extern void dl_rebuild_rd_accounting(void);
extern void rebuild_sched_domains(void);

extern void cpuset_print_current_mems_allowed(void);
extern void cpuset_reset_sched_domains(void);

/*
* read_mems_allowed_begin is required when making decisions involving
Expand Down Expand Up @@ -259,11 +261,20 @@ static inline bool current_cpuset_is_being_rebound(void)
return false;
}

static inline void dl_rebuild_rd_accounting(void)
{
}

static inline void rebuild_sched_domains(void)
{
partition_sched_domains(1, NULL, NULL);
}

static inline void cpuset_reset_sched_domains(void)
{
partition_sched_domains(1, NULL, NULL);
}

static inline void cpuset_print_current_mems_allowed(void)
{
}
Expand Down
2 changes: 0 additions & 2 deletions include/linux/energy_model.h
Original file line number Diff line number Diff line change
Expand Up @@ -240,9 +240,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
struct em_perf_state *ps;
int i;

#ifdef CONFIG_SCHED_DEBUG
WARN_ONCE(!rcu_read_lock_held(), "EM: rcu read lock needed\n");
#endif

if (!sum_util)
return 0;
Expand Down
2 changes: 2 additions & 0 deletions include/linux/preempt.h
Original file line number Diff line number Diff line change
Expand Up @@ -515,6 +515,8 @@ static inline bool preempt_model_rt(void)
return IS_ENABLED(CONFIG_PREEMPT_RT);
}

extern const char *preempt_model_str(void);

/*
* Does the preemption model allow non-cooperative preemption?
*
Expand Down
5 changes: 5 additions & 0 deletions include/linux/sched.h
Original file line number Diff line number Diff line change
Expand Up @@ -382,6 +382,11 @@ enum uclamp_id {
#ifdef CONFIG_SMP
extern struct root_domain def_root_domain;
extern struct mutex sched_domains_mutex;
extern void sched_domains_mutex_lock(void);
extern void sched_domains_mutex_unlock(void);
#else
static inline void sched_domains_mutex_lock(void) { }
static inline void sched_domains_mutex_unlock(void) { }
#endif

struct sched_param {
Expand Down
4 changes: 4 additions & 0 deletions include/linux/sched/deadline.h
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,11 @@ static inline bool dl_time_before(u64 a, u64 b)
struct root_domain;
extern void dl_add_task_root_domain(struct task_struct *p);
extern void dl_clear_root_domain(struct root_domain *rd);
extern void dl_clear_root_domain_cpu(int cpu);

#endif /* CONFIG_SMP */

extern u64 dl_cookie;
extern bool dl_bw_visited(int cpu, u64 cookie);

#endif /* _LINUX_SCHED_DEADLINE_H */
2 changes: 0 additions & 2 deletions include/linux/sched/debug.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,10 @@ extern void show_stack(struct task_struct *task, unsigned long *sp,

extern void sched_show_task(struct task_struct *p);

#ifdef CONFIG_SCHED_DEBUG
struct seq_file;
extern void proc_sched_show_task(struct task_struct *p,
struct pid_namespace *ns, struct seq_file *m);
extern void proc_sched_set_task(struct task_struct *p);
#endif

/* Attach to any functions which should be ignored in wchan output. */
#define __sched __section(".sched.text")
Expand Down
23 changes: 16 additions & 7 deletions include/linux/sched/idle.h
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,21 @@ static __always_inline bool __must_check current_clr_polling_and_test(void)
return unlikely(tif_need_resched());
}

static __always_inline void current_clr_polling(void)
{
__current_clr_polling();

/*
* Ensure we check TIF_NEED_RESCHED after we clear the polling bit.
* Once the bit is cleared, we'll get IPIs with every new
* TIF_NEED_RESCHED and the IPI handler, scheduler_ipi(), will also
* fold.
*/
smp_mb__after_atomic(); /* paired with resched_curr() */

preempt_fold_need_resched();
}

#else
static inline void __current_set_polling(void) { }
static inline void __current_clr_polling(void) { }
Expand All @@ -91,21 +106,15 @@ static inline bool __must_check current_clr_polling_and_test(void)
{
return unlikely(tif_need_resched());
}
#endif

static __always_inline void current_clr_polling(void)
{
__current_clr_polling();

/*
* Ensure we check TIF_NEED_RESCHED after we clear the polling bit.
* Once the bit is cleared, we'll get IPIs with every new
* TIF_NEED_RESCHED and the IPI handler, scheduler_ipi(), will also
* fold.
*/
smp_mb(); /* paired with resched_curr() */

preempt_fold_need_resched();
}
#endif

#endif /* _LINUX_SCHED_IDLE_H */
7 changes: 7 additions & 0 deletions include/linux/sched/mm.h
Original file line number Diff line number Diff line change
Expand Up @@ -531,6 +531,13 @@ enum {

static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
{
/*
* The atomic_read() below prevents CSE. The following should
* help the compiler generate more efficient code on architectures
* where sync_core_before_usermode() is a no-op.
*/
if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
return;
if (current->mm != mm)
return;
if (likely(!(atomic_read(&mm->membarrier_state) &
Expand Down
Loading

0 comments on commit 32b2253

Please sign in to comment.