Skip to content

Commit

Permalink
---
Browse files Browse the repository at this point in the history
yaml
---
r: 145944
b: refs/heads/master
c: 3f6280d
h: refs/heads/master
v: v3
  • Loading branch information
Linus Torvalds committed Jun 10, 2009
1 parent f79c550 commit d41ab0e
Show file tree
Hide file tree
Showing 309 changed files with 7,406 additions and 4,949 deletions.
2 changes: 1 addition & 1 deletion [refs]
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
---
refs/heads/master: 92db1e6af747faa129e236d68386af26a0efc12b
refs/heads/master: 3f6280ddf25fa656d0e17960588e52bee48a7547
18 changes: 18 additions & 0 deletions trunk/Documentation/ABI/testing/sysfs-devices-cache_disable
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
What: /sys/devices/system/cpu/cpu*/cache/index*/cache_disable_X
Date: August 2008
KernelVersion: 2.6.27
Contact: mark.langsdorf@amd.com
Description: These files exist in every cpu's cache index directories.
There are currently 2 cache_disable_# files in each
directory. Reading from these files on a supported
processor will return that cache disable index value
for that processor and node. Writing to one of these
files will cause the specificed cache index to be disabled.

Currently, only AMD Family 10h Processors support cache index
disable, and only for their L3 caches. See the BIOS and
Kernel Developer's Guide at
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/31116-Public-GH-BKDG_3.20_2-4-09.pdf
for formatting information and other details on the
cache index disable.
Users: joachim.deguara@amd.com
131 changes: 131 additions & 0 deletions trunk/Documentation/futex-requeue-pi.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
Futex Requeue PI
----------------

Requeueing of tasks from a non-PI futex to a PI futex requires
special handling in order to ensure the underlying rt_mutex is never
left without an owner if it has waiters; doing so would break the PI
boosting logic [see rt-mutex-desgin.txt] For the purposes of
brevity, this action will be referred to as "requeue_pi" throughout
this document. Priority inheritance is abbreviated throughout as
"PI".

Motivation
----------

Without requeue_pi, the glibc implementation of
pthread_cond_broadcast() must resort to waking all the tasks waiting
on a pthread_condvar and letting them try to sort out which task
gets to run first in classic thundering-herd formation. An ideal
implementation would wake the highest-priority waiter, and leave the
rest to the natural wakeup inherent in unlocking the mutex
associated with the condvar.

Consider the simplified glibc calls:

/* caller must lock mutex */
pthread_cond_wait(cond, mutex)
{
lock(cond->__data.__lock);
unlock(mutex);
do {
unlock(cond->__data.__lock);
futex_wait(cond->__data.__futex);
lock(cond->__data.__lock);
} while(...)
unlock(cond->__data.__lock);
lock(mutex);
}

pthread_cond_broadcast(cond)
{
lock(cond->__data.__lock);
unlock(cond->__data.__lock);
futex_requeue(cond->data.__futex, cond->mutex);
}

Once pthread_cond_broadcast() requeues the tasks, the cond->mutex
has waiters. Note that pthread_cond_wait() attempts to lock the
mutex only after it has returned to user space. This will leave the
underlying rt_mutex with waiters, and no owner, breaking the
previously mentioned PI-boosting algorithms.

In order to support PI-aware pthread_condvar's, the kernel needs to
be able to requeue tasks to PI futexes. This support implies that
upon a successful futex_wait system call, the caller would return to
user space already holding the PI futex. The glibc implementation
would be modified as follows:


/* caller must lock mutex */
pthread_cond_wait_pi(cond, mutex)
{
lock(cond->__data.__lock);
unlock(mutex);
do {
unlock(cond->__data.__lock);
futex_wait_requeue_pi(cond->__data.__futex);
lock(cond->__data.__lock);
} while(...)
unlock(cond->__data.__lock);
/* the kernel acquired the the mutex for us */
}

pthread_cond_broadcast_pi(cond)
{
lock(cond->__data.__lock);
unlock(cond->__data.__lock);
futex_requeue_pi(cond->data.__futex, cond->mutex);
}

The actual glibc implementation will likely test for PI and make the
necessary changes inside the existing calls rather than creating new
calls for the PI cases. Similar changes are needed for
pthread_cond_timedwait() and pthread_cond_signal().

Implementation
--------------

In order to ensure the rt_mutex has an owner if it has waiters, it
is necessary for both the requeue code, as well as the waiting code,
to be able to acquire the rt_mutex before returning to user space.
The requeue code cannot simply wake the waiter and leave it to
acquire the rt_mutex as it would open a race window between the
requeue call returning to user space and the waiter waking and
starting to run. This is especially true in the uncontended case.

The solution involves two new rt_mutex helper routines,
rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
allow the requeue code to acquire an uncontended rt_mutex on behalf
of the waiter and to enqueue the waiter on a contended rt_mutex.
Two new system calls provide the kernel<->user interface to
requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_CMP_PI.

FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
and pthread_cond_timedwait()) to block on the initial futex and wait
to be requeued to a PI-aware futex. The implementation is the
result of a high-speed collision between futex_wait() and
futex_lock_pi(), with some extra logic to check for the additional
wake-up scenarios.

FUTEX_REQUEUE_CMP_PI is called by the waker
(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
possibly wake the waiting tasks. Internally, this system call is
still handled by futex_requeue (by passing requeue_pi=1). Before
requeueing, futex_requeue() attempts to acquire the requeue target
PI futex on behalf of the top waiter. If it can, this waiter is
woken. futex_requeue() then proceeds to requeue the remaining
nr_wake+nr_requeue tasks to the PI futex, calling
rt_mutex_start_proxy_lock() prior to each requeue to prepare the
task as a waiter on the underlying rt_mutex. It is possible that
the lock can be acquired at this stage as well, if so, the next
waiter is woken to finish the acquisition of the lock.

FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
their sum is all that really matters. futex_requeue() will wake or
requeue up to nr_wake + nr_requeue tasks. It will wake only as many
tasks as it can acquire the lock for, which in the majority of cases
should be 0 as good programming practice dictates that the caller of
either pthread_cond_broadcast() or pthread_cond_signal() acquire the
mutex prior to making the call. FUTEX_REQUEUE_PI requires that
nr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for
signal.
3 changes: 3 additions & 0 deletions trunk/Documentation/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1577,6 +1577,9 @@ and is between 256 and 4096 characters. It is defined in the file
noinitrd [RAM] Tells the kernel not to load any configured
initial RAM disk.

nointremap [X86-64, Intel-IOMMU] Do not enable interrupt
remapping.

nointroute [IA-64]

nojitter [IA64] Disables jitter checking for ITC timers.
Expand Down
129 changes: 128 additions & 1 deletion trunk/Documentation/memory-barriers.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ Contents:

- Locking functions.
- Interrupt disabling functions.
- Sleep and wake-up functions.
- Miscellaneous functions.

(*) Inter-CPU locking barrier effects.
Expand Down Expand Up @@ -1217,6 +1218,132 @@ barriers are required in such a situation, they must be provided from some
other means.


SLEEP AND WAKE-UP FUNCTIONS
---------------------------

Sleeping and waking on an event flagged in global data can be viewed as an
interaction between two pieces of data: the task state of the task waiting for
the event and the global data used to indicate the event. To make sure that
these appear to happen in the right order, the primitives to begin the process
of going to sleep, and the primitives to initiate a wake up imply certain
barriers.

Firstly, the sleeper normally follows something like this sequence of events:

for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (event_indicated)
break;
schedule();
}

A general memory barrier is interpolated automatically by set_current_state()
after it has altered the task state:

CPU 1
===============================
set_current_state();
set_mb();
STORE current->state
<general barrier>
LOAD event_indicated

set_current_state() may be wrapped by:

prepare_to_wait();
prepare_to_wait_exclusive();

which therefore also imply a general memory barrier after setting the state.
The whole sequence above is available in various canned forms, all of which
interpolate the memory barrier in the right place:

wait_event();
wait_event_interruptible();
wait_event_interruptible_exclusive();
wait_event_interruptible_timeout();
wait_event_killable();
wait_event_timeout();
wait_on_bit();
wait_on_bit_lock();


Secondly, code that performs a wake up normally follows something like this:

event_indicated = 1;
wake_up(&event_wait_queue);

or:

event_indicated = 1;
wake_up_process(event_daemon);

A write memory barrier is implied by wake_up() and co. if and only if they wake
something up. The barrier occurs before the task state is cleared, and so sits
between the STORE to indicate the event and the STORE to set TASK_RUNNING:

CPU 1 CPU 2
=============================== ===============================
set_current_state(); STORE event_indicated
set_mb(); wake_up();
STORE current->state <write barrier>
<general barrier> STORE current->state
LOAD event_indicated

The available waker functions include:

complete();
wake_up();
wake_up_all();
wake_up_bit();
wake_up_interruptible();
wake_up_interruptible_all();
wake_up_interruptible_nr();
wake_up_interruptible_poll();
wake_up_interruptible_sync();
wake_up_interruptible_sync_poll();
wake_up_locked();
wake_up_locked_poll();
wake_up_nr();
wake_up_poll();
wake_up_process();


[!] Note that the memory barriers implied by the sleeper and the waker do _not_
order multiple stores before the wake-up with respect to loads of those stored
values after the sleeper has called set_current_state(). For instance, if the
sleeper does:

set_current_state(TASK_INTERRUPTIBLE);
if (event_indicated)
break;
__set_current_state(TASK_RUNNING);
do_something(my_data);

and the waker does:

my_data = value;
event_indicated = 1;
wake_up(&event_wait_queue);

there's no guarantee that the change to event_indicated will be perceived by
the sleeper as coming after the change to my_data. In such a circumstance, the
code on both sides must interpolate its own memory barriers between the
separate data accesses. Thus the above sleeper ought to do:

set_current_state(TASK_INTERRUPTIBLE);
if (event_indicated) {
smp_rmb();
do_something(my_data);
}

and the waker should do:

my_data = value;
smp_wmb();
event_indicated = 1;
wake_up(&event_wait_queue);


MISCELLANEOUS FUNCTIONS
-----------------------

Expand Down Expand Up @@ -1366,7 +1493,7 @@ WHERE ARE MEMORY BARRIERS NEEDED?

Under normal operation, memory operation reordering is generally not going to
be a problem as a single-threaded linear piece of code will still appear to
work correctly, even if it's in an SMP kernel. There are, however, three
work correctly, even if it's in an SMP kernel. There are, however, four
circumstances in which reordering definitely _could_ be a problem:

(*) Interprocessor interaction.
Expand Down
20 changes: 19 additions & 1 deletion trunk/Documentation/scheduler/sched-rt-group.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
CONTENTS
========

0. WARNING
1. Overview
1.1 The problem
1.2 The solution
Expand All @@ -14,6 +15,23 @@ CONTENTS
3. Future plans


0. WARNING
==========

Fiddling with these settings can result in an unstable system, the knobs are
root only and assumes root knows what he is doing.

Most notable:

* very small values in sched_rt_period_us can result in an unstable
system when the period is smaller than either the available hrtimer
resolution, or the time it takes to handle the budget refresh itself.

* very small values in sched_rt_runtime_us can result in an unstable
system when the runtime is so small the system has difficulty making
forward progress (NOTE: the migration thread and kstopmachine both
are real-time processes).

1. Overview
===========

Expand Down Expand Up @@ -169,7 +187,7 @@ get their allocated time.

Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
the biggest challenge as the current linux PI infrastructure is geared towards
the limited static priority levels 0-139. With deadline scheduling you need to
the limited static priority levels 0-99. With deadline scheduling you need to
do deadline inheritance (since priority is inversely proportional to the
deadline delta (deadline - now).

Expand Down
15 changes: 12 additions & 3 deletions trunk/Documentation/trace/ftrace.txt
Original file line number Diff line number Diff line change
Expand Up @@ -518,9 +518,18 @@ priority with zero (0) being the highest priority and the nice
values starting at 100 (nice -20). Below is a quick chart to map
the kernel priority to user land priorities.

Kernel priority: 0 to 99 ==> user RT priority 99 to 0
Kernel priority: 100 to 139 ==> user nice -20 to 19
Kernel priority: 140 ==> idle task priority
Kernel Space User Space
===============================================================
0(high) to 98(low) user RT priority 99(high) to 1(low)
with SCHED_RR or SCHED_FIFO
---------------------------------------------------------------
99 sched_priority is not used in scheduling
decisions(it must be specified as 0)
---------------------------------------------------------------
100(high) to 139(low) user nice -20(high) to 19(low)
---------------------------------------------------------------
140 idle task priority
---------------------------------------------------------------

The task states are:

Expand Down
Loading

0 comments on commit d41ab0e

Please sign in to comment.