Skip to content

Commit

Permalink
sched/numa: Allow a floating imbalance between NUMA nodes
Browse files Browse the repository at this point in the history
Currently, an imbalance is only allowed when a destination node
is almost completely idle. This solved one basic class of problems
and was the cautious approach.

This patch revisits the possibility that NUMA nodes can be imbalanced
until 25% of the CPUs are occupied. The reasoning behind 25% is somewhat
superficial -- it's half the cores when HT is enabled.  At higher
utilisations, balancing should continue as normal and keep things even
until scheduler domains are fully busy or over utilised.

Note that this is not expected to be a universal win. Any benchmark
that prefers spreading as wide as possible with limited communication
will favour the old behaviour as there is more memory bandwidth.
Workloads that communicate heavily in pairs such as netperf or tbench
benefit. For the tests I ran, the vast majority of workloads saw
a benefit so it seems to be a worthwhile trade-off.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-4-mgorman@techsingularity.net
  • Loading branch information
Mel Gorman authored and Peter Zijlstra committed Nov 24, 2020
1 parent 5c33900 commit 7d2b5dd
Showing 1 changed file with 11 additions and 10 deletions.
21 changes: 11 additions & 10 deletions kernel/sched/fair.c
Original file line number Diff line number Diff line change
Expand Up @@ -1559,7 +1559,8 @@ struct task_numa_env {
static unsigned long cpu_load(struct rq *rq);
static unsigned long cpu_runnable(struct rq *rq);
static unsigned long cpu_util(int cpu);
static inline long adjust_numa_imbalance(int imbalance, int dst_running);
static inline long adjust_numa_imbalance(int imbalance,
int dst_running, int dst_weight);

static inline enum
numa_type numa_classify(unsigned int imbalance_pct,
Expand Down Expand Up @@ -1939,7 +1940,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
src_running = env->src_stats.nr_running - 1;
dst_running = env->dst_stats.nr_running + 1;
imbalance = max(0, dst_running - src_running);
imbalance = adjust_numa_imbalance(imbalance, dst_running);
imbalance = adjust_numa_imbalance(imbalance, dst_running,
env->dst_stats.weight);

/* Use idle CPU if there is no imbalance */
if (!imbalance) {
Expand Down Expand Up @@ -8995,16 +8997,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

#define NUMA_IMBALANCE_MIN 2

static inline long adjust_numa_imbalance(int imbalance, int dst_running)
static inline long adjust_numa_imbalance(int imbalance,
int dst_running, int dst_weight)
{
unsigned int imbalance_min;

/*
* Allow a small imbalance based on a simple pair of communicating
* tasks that remain local when the source domain is almost idle.
* tasks that remain local when the destination is lightly loaded.
*/
imbalance_min = NUMA_IMBALANCE_MIN;
if (dst_running <= imbalance_min)
if (dst_running < (dst_weight >> 2) && imbalance <= NUMA_IMBALANCE_MIN)
return 0;

return imbalance;
Expand Down Expand Up @@ -9106,9 +9106,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
}

/* Consider allowing a small imbalance between NUMA groups */
if (env->sd->flags & SD_NUMA)
if (env->sd->flags & SD_NUMA) {
env->imbalance = adjust_numa_imbalance(env->imbalance,
busiest->sum_nr_running);
busiest->sum_nr_running, busiest->group_weight);
}

return;
}
Expand Down

0 comments on commit 7d2b5dd

Please sign in to comment.