Skip to content

Commit

Permalink
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/l…
Browse files Browse the repository at this point in the history
…inux/kernel/git/tip/linux-2.6-tip

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
  stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback()
  sched, wait: Use wrapper functions
  sched: Remove a stale comment
  ondemand: Make the iowait-is-busy time a sysfs tunable
  ondemand: Solve a big performance issue by counting IOWAIT time as busy
  sched: Intoduce get_cpu_iowait_time_us()
  sched: Eliminate the ts->idle_lastupdate field
  sched: Fold updating of the last_update_time_info into update_ts_time_stats()
  sched: Update the idle statistics in get_cpu_idle_time_us()
  sched: Introduce a function to update the idle statistics
  sched: Add a comment to get_cpu_idle_time_us()
  cpu_stop: add dummy implementation for UP
  sched: Remove rq argument to the tracepoints
  rcu: need barrier() in UP synchronize_sched_expedited()
  sched: correctly place paranioa memory barriers in synchronize_sched_expedited()
  sched: kill paranoia check in synchronize_sched_expedited()
  sched: replace migration_thread with cpu_stop
  stop_machine: reimplement using cpu_stop
  cpu_stop: implement stop_cpu[s]()
  sched: Fix select_idle_sibling() logic in select_task_rq_fair()
  ...
  • Loading branch information
Linus Torvalds committed May 18, 2010
2 parents 4d7b4ac + 9c6f7e4 commit b8ae30e
Show file tree
Hide file tree
Showing 39 changed files with 1,251 additions and 1,259 deletions.
10 changes: 0 additions & 10 deletions Documentation/RCU/torture.txt
Original file line number Diff line number Diff line change
Expand Up @@ -182,16 +182,6 @@ Similarly, sched_expedited RCU provides the following:
sched_expedited-torture: Reader Pipe: 12660320201 95875 0 0 0 0 0 0 0 0 0
sched_expedited-torture: Reader Batch: 12660424885 0 0 0 0 0 0 0 0 0 0
sched_expedited-torture: Free-Block Circulation: 1090795 1090795 1090794 1090793 1090792 1090791 1090790 1090789 1090788 1090787 0
state: -1 / 0:0 3:0 4:0

As before, the first four lines are similar to those for RCU.
The last line shows the task-migration state. The first number is
-1 if synchronize_sched_expedited() is idle, -2 if in the process of
posting wakeups to the migration kthreads, and N when waiting on CPU N.
Each of the colon-separated fields following the "/" is a CPU:state pair.
Valid states are "0" for idle, "1" for waiting for quiescent state,
"2" for passed through quiescent state, and "3" when a race with a
CPU-hotplug event forces use of the synchronize_sched() primitive.


USAGE
Expand Down
54 changes: 3 additions & 51 deletions Documentation/scheduler/sched-design-CFS.txt
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ provide fair CPU time to each such task group. For example, it may be
desirable to first provide fair CPU time to each user on the system and then to
each task belonging to a user.

CONFIG_GROUP_SCHED strives to achieve exactly that. It lets tasks to be
CONFIG_CGROUP_SCHED strives to achieve exactly that. It lets tasks to be
grouped and divides CPU time fairly among such groups.

CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and
Expand All @@ -220,38 +220,11 @@ SCHED_RR) tasks.
CONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and
SCHED_BATCH) tasks.

At present, there are two (mutually exclusive) mechanisms to group tasks for
CPU bandwidth control purposes:

- Based on user id (CONFIG_USER_SCHED)

With this option, tasks are grouped according to their user id.

- Based on "cgroup" pseudo filesystem (CONFIG_CGROUP_SCHED)

This options needs CONFIG_CGROUPS to be defined, and lets the administrator
These options need CONFIG_CGROUPS to be defined, and let the administrator
create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See
Documentation/cgroups/cgroups.txt for more information about this filesystem.

Only one of these options to group tasks can be chosen and not both.

When CONFIG_USER_SCHED is defined, a directory is created in sysfs for each new
user and a "cpu_share" file is added in that directory.

# cd /sys/kernel/uids
# cat 512/cpu_share # Display user 512's CPU share
1024
# echo 2048 > 512/cpu_share # Modify user 512's CPU share
# cat 512/cpu_share # Display user 512's CPU share
2048
#

CPU bandwidth between two users is divided in the ratio of their CPU shares.
For example: if you would like user "root" to get twice the bandwidth of user
"guest," then set the cpu_share for both the users such that "root"'s cpu_share
is twice "guest"'s cpu_share.

When CONFIG_CGROUP_SCHED is defined, a "cpu.shares" file is created for each
When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
group created using the pseudo filesystem. See example steps below to create
task groups and modify their CPU share using the "cgroups" pseudo filesystem.

Expand All @@ -273,24 +246,3 @@ task groups and modify their CPU share using the "cgroups" pseudo filesystem.

# #Launch gmplayer (or your favourite movie player)
# echo <movie_player_pid> > multimedia/tasks

8. Implementation note: user namespaces

User namespaces are intended to be hierarchical. But they are currently
only partially implemented. Each of those has ramifications for CFS.

First, since user namespaces are hierarchical, the /sys/kernel/uids
presentation is inadequate. Eventually we will likely want to use sysfs
tagging to provide private views of /sys/kernel/uids within each user
namespace.

Second, the hierarchical nature is intended to support completely
unprivileged use of user namespaces. So if using user groups, then
we want the users in a user namespace to be children of the user
who created it.

That is currently unimplemented. So instead, every user in a new
user namespace will receive 1024 shares just like any user in the
initial user namespace. Note that at the moment creation of a new
user namespace requires each of CAP_SYS_ADMIN, CAP_SETUID, and
CAP_SETGID.
20 changes: 4 additions & 16 deletions Documentation/scheduler/sched-rt-group.txt
Original file line number Diff line number Diff line change
Expand Up @@ -126,23 +126,12 @@ priority!
2.3 Basis for grouping tasks
----------------------------

There are two compile-time settings for allocating CPU bandwidth. These are
configured using the "Basis for grouping tasks" multiple choice menu under
General setup > Group CPU Scheduler:

a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" = "user id")

This lets you use the virtual files under
"/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for
each user .

The other option is:

.o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups")
Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real
CPU bandwidth to task groups.

This uses the /cgroup virtual file system and
"/cgroup/<cgroup>/cpu.rt_runtime_us" to control the CPU time reserved for each
control group instead.
control group.

For more information on working with control groups, you should read
Documentation/cgroups/cgroups.txt as well.
Expand All @@ -161,8 +150,7 @@ For now, this can be simplified to just the following (but see Future plans):
===============

There is work in progress to make the scheduling period for each group
("/sys/kernel/uids/<uid>/cpu_rt_period_us" or
"/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well.
("/cgroup/<cgroup>/cpu.rt_period_us") configurable as well.

The constraint on the period is that a subgroup must have a smaller or
equal period to its parent. But realistically its not very useful _yet_
Expand Down
1 change: 0 additions & 1 deletion arch/s390/kernel/time.c
Original file line number Diff line number Diff line change
Expand Up @@ -391,7 +391,6 @@ static void __init time_init_wq(void)
if (time_sync_wq)
return;
time_sync_wq = create_singlethread_workqueue("timesync");
stop_machine_create();
}

/*
Expand Down
75 changes: 73 additions & 2 deletions drivers/cpufreq/cpufreq_ondemand.c
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ enum {DBS_NORMAL_SAMPLE, DBS_SUB_SAMPLE};

struct cpu_dbs_info_s {
cputime64_t prev_cpu_idle;
cputime64_t prev_cpu_iowait;
cputime64_t prev_cpu_wall;
cputime64_t prev_cpu_nice;
struct cpufreq_policy *cur_policy;
Expand Down Expand Up @@ -108,6 +109,7 @@ static struct dbs_tuners {
unsigned int down_differential;
unsigned int ignore_nice;
unsigned int powersave_bias;
unsigned int io_is_busy;
} dbs_tuners_ins = {
.up_threshold = DEF_FREQUENCY_UP_THRESHOLD,
.down_differential = DEF_FREQUENCY_DOWN_DIFFERENTIAL,
Expand Down Expand Up @@ -148,6 +150,16 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu, cputime64_t *wall)
return idle_time;
}

static inline cputime64_t get_cpu_iowait_time(unsigned int cpu, cputime64_t *wall)
{
u64 iowait_time = get_cpu_iowait_time_us(cpu, wall);

if (iowait_time == -1ULL)
return 0;

return iowait_time;
}

/*
* Find right freq to be set now with powersave_bias on.
* Returns the freq_hi to be used right now and will set freq_hi_jiffies,
Expand Down Expand Up @@ -249,6 +261,7 @@ static ssize_t show_##file_name \
return sprintf(buf, "%u\n", dbs_tuners_ins.object); \
}
show_one(sampling_rate, sampling_rate);
show_one(io_is_busy, io_is_busy);
show_one(up_threshold, up_threshold);
show_one(ignore_nice_load, ignore_nice);
show_one(powersave_bias, powersave_bias);
Expand Down Expand Up @@ -299,6 +312,23 @@ static ssize_t store_sampling_rate(struct kobject *a, struct attribute *b,
return count;
}

static ssize_t store_io_is_busy(struct kobject *a, struct attribute *b,
const char *buf, size_t count)
{
unsigned int input;
int ret;

ret = sscanf(buf, "%u", &input);
if (ret != 1)
return -EINVAL;

mutex_lock(&dbs_mutex);
dbs_tuners_ins.io_is_busy = !!input;
mutex_unlock(&dbs_mutex);

return count;
}

static ssize_t store_up_threshold(struct kobject *a, struct attribute *b,
const char *buf, size_t count)
{
Expand Down Expand Up @@ -381,6 +411,7 @@ static struct global_attr _name = \
__ATTR(_name, 0644, show_##_name, store_##_name)

define_one_rw(sampling_rate);
define_one_rw(io_is_busy);
define_one_rw(up_threshold);
define_one_rw(ignore_nice_load);
define_one_rw(powersave_bias);
Expand All @@ -392,6 +423,7 @@ static struct attribute *dbs_attributes[] = {
&up_threshold.attr,
&ignore_nice_load.attr,
&powersave_bias.attr,
&io_is_busy.attr,
NULL
};

Expand Down Expand Up @@ -470,14 +502,15 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)

for_each_cpu(j, policy->cpus) {
struct cpu_dbs_info_s *j_dbs_info;
cputime64_t cur_wall_time, cur_idle_time;
unsigned int idle_time, wall_time;
cputime64_t cur_wall_time, cur_idle_time, cur_iowait_time;
unsigned int idle_time, wall_time, iowait_time;
unsigned int load, load_freq;
int freq_avg;

j_dbs_info = &per_cpu(od_cpu_dbs_info, j);

cur_idle_time = get_cpu_idle_time(j, &cur_wall_time);
cur_iowait_time = get_cpu_iowait_time(j, &cur_wall_time);

wall_time = (unsigned int) cputime64_sub(cur_wall_time,
j_dbs_info->prev_cpu_wall);
Expand All @@ -487,6 +520,10 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
j_dbs_info->prev_cpu_idle);
j_dbs_info->prev_cpu_idle = cur_idle_time;

iowait_time = (unsigned int) cputime64_sub(cur_iowait_time,
j_dbs_info->prev_cpu_iowait);
j_dbs_info->prev_cpu_iowait = cur_iowait_time;

if (dbs_tuners_ins.ignore_nice) {
cputime64_t cur_nice;
unsigned long cur_nice_jiffies;
Expand All @@ -504,6 +541,16 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
idle_time += jiffies_to_usecs(cur_nice_jiffies);
}

/*
* For the purpose of ondemand, waiting for disk IO is an
* indication that you're performance critical, and not that
* the system is actually idle. So subtract the iowait time
* from the cpu idle time.
*/

if (dbs_tuners_ins.io_is_busy && idle_time >= iowait_time)
idle_time -= iowait_time;

if (unlikely(!wall_time || wall_time < idle_time))
continue;

Expand Down Expand Up @@ -617,6 +664,29 @@ static inline void dbs_timer_exit(struct cpu_dbs_info_s *dbs_info)
cancel_delayed_work_sync(&dbs_info->work);
}

/*
* Not all CPUs want IO time to be accounted as busy; this dependson how
* efficient idling at a higher frequency/voltage is.
* Pavel Machek says this is not so for various generations of AMD and old
* Intel systems.
* Mike Chan (androidlcom) calis this is also not true for ARM.
* Because of this, whitelist specific known (series) of CPUs by default, and
* leave all others up to the user.
*/
static int should_io_be_busy(void)
{
#if defined(CONFIG_X86)
/*
* For Intel, Core 2 (model 15) andl later have an efficient idle.
*/
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
boot_cpu_data.x86 == 6 &&
boot_cpu_data.x86_model >= 15)
return 1;
#endif
return 0;
}

static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
unsigned int event)
{
Expand Down Expand Up @@ -679,6 +749,7 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
dbs_tuners_ins.sampling_rate =
max(min_sampling_rate,
latency * LATENCY_MULTIPLIER);
dbs_tuners_ins.io_is_busy = should_io_be_busy();
}
mutex_unlock(&dbs_mutex);

Expand Down
14 changes: 2 additions & 12 deletions drivers/xen/manage.c
Original file line number Diff line number Diff line change
Expand Up @@ -80,20 +80,14 @@ static void do_suspend(void)

shutting_down = SHUTDOWN_SUSPEND;

err = stop_machine_create();
if (err) {
printk(KERN_ERR "xen suspend: failed to setup stop_machine %d\n", err);
goto out;
}

#ifdef CONFIG_PREEMPT
/* If the kernel is preemptible, we need to freeze all the processes
to prevent them from being in the middle of a pagetable update
during suspend. */
err = freeze_processes();
if (err) {
printk(KERN_ERR "xen suspend: freeze failed %d\n", err);
goto out_destroy_sm;
goto out;
}
#endif

Expand Down Expand Up @@ -136,12 +130,8 @@ static void do_suspend(void)
out_thaw:
#ifdef CONFIG_PREEMPT
thaw_processes();

out_destroy_sm:
#endif
stop_machine_destroy();

out:
#endif
shutting_down = SHUTDOWN_INVALID;
}
#endif /* CONFIG_PM_SLEEP */
Expand Down
3 changes: 1 addition & 2 deletions fs/eventpoll.c
Original file line number Diff line number Diff line change
Expand Up @@ -1140,8 +1140,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
* ep_poll_callback() when events will become available.
*/
init_waitqueue_entry(&wait, current);
wait.flags |= WQ_FLAG_EXCLUSIVE;
__add_wait_queue(&ep->wq, &wait);
__add_wait_queue_exclusive(&ep->wq, &wait);

for (;;) {
/*
Expand Down
16 changes: 5 additions & 11 deletions include/linux/cpuset.h
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,7 @@ extern int number_of_cpusets; /* How many cpusets are defined in system? */
extern int cpuset_init(void);
extern void cpuset_init_smp(void);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern void cpuset_cpus_allowed_locked(struct task_struct *p,
struct cpumask *mask);
extern int cpuset_cpus_allowed_fallback(struct task_struct *p);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
Expand Down Expand Up @@ -69,9 +68,6 @@ struct seq_file;
extern void cpuset_task_status_allowed(struct seq_file *m,
struct task_struct *task);

extern void cpuset_lock(void);
extern void cpuset_unlock(void);

extern int cpuset_mem_spread_node(void);

static inline int cpuset_do_page_mem_spread(void)
Expand Down Expand Up @@ -105,10 +101,11 @@ static inline void cpuset_cpus_allowed(struct task_struct *p,
{
cpumask_copy(mask, cpu_possible_mask);
}
static inline void cpuset_cpus_allowed_locked(struct task_struct *p,
struct cpumask *mask)

static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
{
cpumask_copy(mask, cpu_possible_mask);
cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
return cpumask_any(cpu_active_mask);
}

static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
Expand Down Expand Up @@ -157,9 +154,6 @@ static inline void cpuset_task_status_allowed(struct seq_file *m,
{
}

static inline void cpuset_lock(void) {}
static inline void cpuset_unlock(void) {}

static inline int cpuset_mem_spread_node(void)
{
return 0;
Expand Down
Loading

0 comments on commit b8ae30e

Please sign in to comment.