Skip to content

Commit

Permalink
sched/numa: Avoid migrating task to CPU-less node
Browse files Browse the repository at this point in the history
In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
nodes.  But if the number of the hint page faults on a PMEM node is
the max for a task, The current NUMA balancing policy may try to place
the task on the PMEM node instead of DRAM node.  This is unreasonable,
because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
nodes are ignored when searching the migration target node for a task
in this patch.

To test the patch, we run a workload that accesses more memory in PMEM
node than memory in DRAM node.  Without the patch, the PMEM node will
be chosen as preferred node in task_numa_placement().  While the DRAM
node will be chosen instead with the patch.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com
  • Loading branch information
Huang Ying authored and Peter Zijlstra committed Feb 16, 2022
1 parent 0fb3978 commit 5c7b1aa
Showing 1 changed file with 20 additions and 5 deletions.
25 changes: 20 additions & 5 deletions kernel/sched/fair.c
Original file line number Diff line number Diff line change
Expand Up @@ -1989,7 +1989,7 @@ static int task_numa_migrate(struct task_struct *p)
*/
ng = deref_curr_numa_group(p);
if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
for_each_online_node(nid) {
for_each_node_state(nid, N_CPU) {
if (nid == env.src_nid || nid == p->numa_preferred_nid)
continue;

Expand Down Expand Up @@ -2087,13 +2087,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
unsigned long faults, max_faults = 0;
int nid, active_nodes = 0;

for_each_online_node(nid) {
for_each_node_state(nid, N_CPU) {
faults = group_faults_cpu(numa_group, nid);
if (faults > max_faults)
max_faults = faults;
}

for_each_online_node(nid) {
for_each_node_state(nid, N_CPU) {
faults = group_faults_cpu(numa_group, nid);
if (faults * ACTIVE_NODE_FRACTION > max_faults)
active_nodes++;
Expand Down Expand Up @@ -2247,7 +2247,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)

dist = sched_max_numa_distance;

for_each_online_node(node) {
for_each_node_state(node, N_CPU) {
score = group_weight(p, node, dist);
if (score > max_score) {
max_score = score;
Expand All @@ -2266,7 +2266,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
* inside the highest scoring group of nodes. The nodemask tricks
* keep the complexity of the search down.
*/
nodes = node_online_map;
nodes = node_states[N_CPU];
for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
unsigned long max_faults = 0;
nodemask_t max_group = NODE_MASK_NONE;
Expand Down Expand Up @@ -2405,6 +2405,21 @@ static void task_numa_placement(struct task_struct *p)
}
}

/* Cannot migrate task to CPU-less node */
if (!node_state(max_nid, N_CPU)) {
int near_nid = max_nid;
int distance, near_distance = INT_MAX;

for_each_node_state(nid, N_CPU) {
distance = node_distance(max_nid, nid);
if (distance < near_distance) {
near_nid = nid;
near_distance = distance;
}
}
max_nid = near_nid;
}

if (ng) {
numa_group_count_active_nodes(ng);
spin_unlock_irq(group_lock);
Expand Down

0 comments on commit 5c7b1aa

Please sign in to comment.