Skip to content

Commit

Permalink
---
Browse files Browse the repository at this point in the history
yaml
---
r: 23415
b: refs/heads/master
c: 825a46a
h: refs/heads/master
i:
  23413: 1d575ad
  23411: 7b49453
  23407: b4bcaaa
v: v3
  • Loading branch information
Paul Jackson authored and Linus Torvalds committed Mar 24, 2006
1 parent b70dd86 commit 4e729c6
Show file tree
Hide file tree
Showing 5 changed files with 204 additions and 10 deletions.
2 changes: 1 addition & 1 deletion [refs]
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
---
refs/heads/master: 8a39cc60bfa5a72f32d975729a354daca124f6de
refs/heads/master: 825a46af5ac171f9f41f794a0a00165588ba1589
76 changes: 74 additions & 2 deletions trunk/Documentation/cpusets.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ CONTENTS:
1.4 What are exclusive cpusets ?
1.5 What does notify_on_release do ?
1.6 What is memory_pressure ?
1.7 How do I use cpusets ?
1.7 What is memory spread ?
1.8 How do I use cpusets ?
2. Usage Examples and Syntax
2.1 Basic Usage
2.2 Adding/removing cpus
Expand Down Expand Up @@ -317,7 +318,78 @@ the tasks in the cpuset, in units of reclaims attempted per second,
times 1000.


1.7 How do I use cpusets ?
1.7 What is memory spread ?
---------------------------
There are two boolean flag files per cpuset that control where the
kernel allocates pages for the file system buffers and related in
kernel data structures. They are called 'memory_spread_page' and
'memory_spread_slab'.

If the per-cpuset boolean flag file 'memory_spread_page' is set, then
the kernel will spread the file system buffers (page cache) evenly
over all the nodes that the faulting task is allowed to use, instead
of preferring to put those pages on the node where the task is running.

If the per-cpuset boolean flag file 'memory_spread_slab' is set,
then the kernel will spread some file system related slab caches,
such as for inodes and dentries evenly over all the nodes that the
faulting task is allowed to use, instead of preferring to put those
pages on the node where the task is running.

The setting of these flags does not affect anonymous data segment or
stack segment pages of a task.

By default, both kinds of memory spreading are off, and memory
pages are allocated on the node local to where the task is running,
except perhaps as modified by the tasks NUMA mempolicy or cpuset
configuration, so long as sufficient free memory pages are available.

When new cpusets are created, they inherit the memory spread settings
of their parent.

Setting memory spreading causes allocations for the affected page
or slab caches to ignore the tasks NUMA mempolicy and be spread
instead. Tasks using mbind() or set_mempolicy() calls to set NUMA
mempolicies will not notice any change in these calls as a result of
their containing tasks memory spread settings. If memory spreading
is turned off, then the currently specified NUMA mempolicy once again
applies to memory page allocations.

Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag
files. By default they contain "0", meaning that the feature is off
for that cpuset. If a "1" is written to that file, then that turns
the named feature on.

The implementation is simple.

Setting the flag 'memory_spread_page' turns on a per-process flag
PF_SPREAD_PAGE for each task that is in that cpuset or subsequently
joins that cpuset. The page allocation calls for the page cache
is modified to perform an inline check for this PF_SPREAD_PAGE task
flag, and if set, a call to a new routine cpuset_mem_spread_node()
returns the node to prefer for the allocation.

Similarly, setting 'memory_spread_cache' turns on the flag
PF_SPREAD_SLAB, and appropriately marked slab caches will allocate
pages from the node returned by cpuset_mem_spread_node().

The cpuset_mem_spread_node() routine is also simple. It uses the
value of a per-task rotor cpuset_mem_spread_rotor to select the next
node in the current tasks mems_allowed to prefer for the allocation.

This memory placement policy is also known (in other contexts) as
round-robin or interleave.

This policy can provide substantial improvements for jobs that need
to place thread local data on the corresponding node, but that need
to access large file system data sets that need to be spread across
the several nodes in the jobs cpuset in order to fit. Without this
policy, especially for jobs that might have one thread reading in the
data set, the memory allocation across the nodes in the jobs cpuset
can become very uneven.


1.8 How do I use cpusets ?
--------------------------

In order to minimize the impact of cpusets on critical kernel
Expand Down
29 changes: 28 additions & 1 deletion trunk/include/linux/cpuset.h
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
* cpuset interface
*
* Copyright (C) 2003 BULL SA
* Copyright (C) 2004 Silicon Graphics, Inc.
* Copyright (C) 2004-2006 Silicon Graphics, Inc.
*
*/

Expand Down Expand Up @@ -51,6 +51,18 @@ extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);
extern void cpuset_lock(void);
extern void cpuset_unlock(void);

extern int cpuset_mem_spread_node(void);

static inline int cpuset_do_page_mem_spread(void)
{
return current->flags & PF_SPREAD_PAGE;
}

static inline int cpuset_do_slab_mem_spread(void)
{
return current->flags & PF_SPREAD_SLAB;
}

#else /* !CONFIG_CPUSETS */

static inline int cpuset_init_early(void) { return 0; }
Expand Down Expand Up @@ -99,6 +111,21 @@ static inline char *cpuset_task_status_allowed(struct task_struct *task,
static inline void cpuset_lock(void) {}
static inline void cpuset_unlock(void) {}

static inline int cpuset_mem_spread_node(void)
{
return 0;
}

static inline int cpuset_do_page_mem_spread(void)
{
return 0;
}

static inline int cpuset_do_slab_mem_spread(void)
{
return 0;
}

#endif /* !CONFIG_CPUSETS */

#endif /* _LINUX_CPUSET_H */
3 changes: 3 additions & 0 deletions trunk/include/linux/sched.h
Original file line number Diff line number Diff line change
Expand Up @@ -869,6 +869,7 @@ struct task_struct {
struct cpuset *cpuset;
nodemask_t mems_allowed;
int cpuset_mems_generation;
int cpuset_mem_spread_rotor;
#endif
atomic_t fs_excl; /* holding fs exclusive resources */
struct rcu_head rcu;
Expand Down Expand Up @@ -929,6 +930,8 @@ static inline void put_task_struct(struct task_struct *t)
#define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */
#define PF_RANDOMIZE 0x00800000 /* randomize virtual address space */
#define PF_SWAPWRITE 0x01000000 /* Allowed to write to swap */
#define PF_SPREAD_PAGE 0x04000000 /* Spread page cache over cpuset */
#define PF_SPREAD_SLAB 0x08000000 /* Spread some slab caches over cpuset */

/*
* Only the _current_ task can read/write to tsk->flags, but other
Expand Down
104 changes: 98 additions & 6 deletions trunk/kernel/cpuset.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,14 @@
* Processor and Memory placement constraints for sets of tasks.
*
* Copyright (C) 2003 BULL SA.
* Copyright (C) 2004 Silicon Graphics, Inc.
* Copyright (C) 2004-2006 Silicon Graphics, Inc.
*
* Portions derived from Patrick Mochel's sysfs code.
* sysfs is Copyright (c) 2001-3 Patrick Mochel
* Portions Copyright (c) 2004 Silicon Graphics, Inc.
*
* 2003-10-10 Written by Simon Derr <simon.derr@bull.net>
* 2003-10-10 Written by Simon Derr.
* 2003-10-22 Updates by Stephen Hemminger.
* 2004 May-July Rework by Paul Jackson <pj@sgi.com>
* 2004 May-July Rework by Paul Jackson.
*
* This file is subject to the terms and conditions of the GNU General Public
* License. See the file COPYING in the main directory of the Linux
Expand Down Expand Up @@ -108,7 +107,9 @@ typedef enum {
CS_MEM_EXCLUSIVE,
CS_MEMORY_MIGRATE,
CS_REMOVED,
CS_NOTIFY_ON_RELEASE
CS_NOTIFY_ON_RELEASE,
CS_SPREAD_PAGE,
CS_SPREAD_SLAB,
} cpuset_flagbits_t;

/* convenient tests for these bits */
Expand Down Expand Up @@ -137,6 +138,16 @@ static inline int is_memory_migrate(const struct cpuset *cs)
return test_bit(CS_MEMORY_MIGRATE, &cs->flags);
}

static inline int is_spread_page(const struct cpuset *cs)
{
return test_bit(CS_SPREAD_PAGE, &cs->flags);
}

static inline int is_spread_slab(const struct cpuset *cs)
{
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}

/*
* Increment this atomic integer everytime any cpuset changes its
* mems_allowed value. Users of cpusets can track this generation
Expand Down Expand Up @@ -657,6 +668,14 @@ void cpuset_update_task_memory_state(void)
cs = tsk->cpuset; /* Maybe changed when task not locked */
guarantee_online_mems(cs, &tsk->mems_allowed);
tsk->cpuset_mems_generation = cs->mems_generation;
if (is_spread_page(cs))
tsk->flags |= PF_SPREAD_PAGE;
else
tsk->flags &= ~PF_SPREAD_PAGE;
if (is_spread_slab(cs))
tsk->flags |= PF_SPREAD_SLAB;
else
tsk->flags &= ~PF_SPREAD_SLAB;
task_unlock(tsk);
mutex_unlock(&callback_mutex);
mpol_rebind_task(tsk, &tsk->mems_allowed);
Expand Down Expand Up @@ -956,7 +975,8 @@ static int update_memory_pressure_enabled(struct cpuset *cs, char *buf)
/*
* update_flag - read a 0 or a 1 in a file and update associated flag
* bit: the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE,
* CS_NOTIFY_ON_RELEASE, CS_MEMORY_MIGRATE)
* CS_NOTIFY_ON_RELEASE, CS_MEMORY_MIGRATE,
* CS_SPREAD_PAGE, CS_SPREAD_SLAB)
* cs: the cpuset to update
* buf: the buffer where we read the 0 or 1
*
Expand Down Expand Up @@ -1187,6 +1207,8 @@ typedef enum {
FILE_NOTIFY_ON_RELEASE,
FILE_MEMORY_PRESSURE_ENABLED,
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
FILE_TASKLIST,
} cpuset_filetype_t;

Expand Down Expand Up @@ -1246,6 +1268,14 @@ static ssize_t cpuset_common_file_write(struct file *file, const char __user *us
case FILE_MEMORY_PRESSURE:
retval = -EACCES;
break;
case FILE_SPREAD_PAGE:
retval = update_flag(CS_SPREAD_PAGE, cs, buffer);
cs->mems_generation = atomic_inc_return(&cpuset_mems_generation);
break;
case FILE_SPREAD_SLAB:
retval = update_flag(CS_SPREAD_SLAB, cs, buffer);
cs->mems_generation = atomic_inc_return(&cpuset_mems_generation);
break;
case FILE_TASKLIST:
retval = attach_task(cs, buffer, &pathbuf);
break;
Expand Down Expand Up @@ -1355,6 +1385,12 @@ static ssize_t cpuset_common_file_read(struct file *file, char __user *buf,
case FILE_MEMORY_PRESSURE:
s += sprintf(s, "%d", fmeter_getrate(&cs->fmeter));
break;
case FILE_SPREAD_PAGE:
*s++ = is_spread_page(cs) ? '1' : '0';
break;
case FILE_SPREAD_SLAB:
*s++ = is_spread_slab(cs) ? '1' : '0';
break;
default:
retval = -EINVAL;
goto out;
Expand Down Expand Up @@ -1718,6 +1754,16 @@ static struct cftype cft_memory_pressure = {
.private = FILE_MEMORY_PRESSURE,
};

static struct cftype cft_spread_page = {
.name = "memory_spread_page",
.private = FILE_SPREAD_PAGE,
};

static struct cftype cft_spread_slab = {
.name = "memory_spread_slab",
.private = FILE_SPREAD_SLAB,
};

static int cpuset_populate_dir(struct dentry *cs_dentry)
{
int err;
Expand All @@ -1736,6 +1782,10 @@ static int cpuset_populate_dir(struct dentry *cs_dentry)
return err;
if ((err = cpuset_add_file(cs_dentry, &cft_memory_pressure)) < 0)
return err;
if ((err = cpuset_add_file(cs_dentry, &cft_spread_page)) < 0)
return err;
if ((err = cpuset_add_file(cs_dentry, &cft_spread_slab)) < 0)
return err;
if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0)
return err;
return 0;
Expand Down Expand Up @@ -1764,6 +1814,10 @@ static long cpuset_create(struct cpuset *parent, const char *name, int mode)
cs->flags = 0;
if (notify_on_release(parent))
set_bit(CS_NOTIFY_ON_RELEASE, &cs->flags);
if (is_spread_page(parent))
set_bit(CS_SPREAD_PAGE, &cs->flags);
if (is_spread_slab(parent))
set_bit(CS_SPREAD_SLAB, &cs->flags);
cs->cpus_allowed = CPU_MASK_NONE;
cs->mems_allowed = NODE_MASK_NONE;
atomic_set(&cs->count, 0);
Expand Down Expand Up @@ -2200,6 +2254,44 @@ void cpuset_unlock(void)
mutex_unlock(&callback_mutex);
}

/**
* cpuset_mem_spread_node() - On which node to begin search for a page
*
* If a task is marked PF_SPREAD_PAGE or PF_SPREAD_SLAB (as for
* tasks in a cpuset with is_spread_page or is_spread_slab set),
* and if the memory allocation used cpuset_mem_spread_node()
* to determine on which node to start looking, as it will for
* certain page cache or slab cache pages such as used for file
* system buffers and inode caches, then instead of starting on the
* local node to look for a free page, rather spread the starting
* node around the tasks mems_allowed nodes.
*
* We don't have to worry about the returned node being offline
* because "it can't happen", and even if it did, it would be ok.
*
* The routines calling guarantee_online_mems() are careful to
* only set nodes in task->mems_allowed that are online. So it
* should not be possible for the following code to return an
* offline node. But if it did, that would be ok, as this routine
* is not returning the node where the allocation must be, only
* the node where the search should start. The zonelist passed to
* __alloc_pages() will include all nodes. If the slab allocator
* is passed an offline node, it will fall back to the local node.
* See kmem_cache_alloc_node().
*/

int cpuset_mem_spread_node(void)
{
int node;

node = next_node(current->cpuset_mem_spread_rotor, current->mems_allowed);
if (node == MAX_NUMNODES)
node = first_node(current->mems_allowed);
current->cpuset_mem_spread_rotor = node;
return node;
}
EXPORT_SYMBOL_GPL(cpuset_mem_spread_node);

/**
* cpuset_excl_nodes_overlap - Do we overlap @p's mem_exclusive ancestors?
* @p: pointer to task_struct of some other task.
Expand Down

0 comments on commit 4e729c6

Please sign in to comment.