Skip to content

Commit

Permalink
---
Browse files Browse the repository at this point in the history
yaml
---
r: 116190
b: refs/heads/master
c: ed402af
h: refs/heads/master
v: v3
  • Loading branch information
Linus Torvalds committed Oct 20, 2008
1 parent 7d8637e commit 6b35dc8
Show file tree
Hide file tree
Showing 353 changed files with 13,210 additions and 4,410 deletions.
2 changes: 1 addition & 1 deletion [refs]
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
---
refs/heads/master: 40e24c403f325715f9c43b9fed2068641201ee0b
refs/heads/master: ed402af3c23a4804b3f8899263e8d0f97c62ab49
1 change: 1 addition & 0 deletions trunk/.mailmap
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ Kenneth W Chen <kenneth.w.chen@intel.com>
Koushik <raghavendra.koushik@neterion.com>
Leonid I Ananiev <leonid.i.ananiev@intel.com>
Linas Vepstas <linas@austin.ibm.com>
Mark Brown <broonie@sirena.org.uk>
Matthieu CASTET <castet.matthieu@free.fr>
Michael Buesch <mb@bu3sch.de>
Michael Buesch <mbuesch@freenet.de>
Expand Down
File renamed without changes.
99 changes: 99 additions & 0 deletions trunk/Documentation/cgroups/freezer-subsystem.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
The cgroup freezer is useful to batch job management system which start
and stop sets of tasks in order to schedule the resources of a machine
according to the desires of a system administrator. This sort of program
is often used on HPC clusters to schedule access to the cluster as a
whole. The cgroup freezer uses cgroups to describe the set of tasks to
be started/stopped by the batch job management system. It also provides
a means to start and stop the tasks composing the job.

The cgroup freezer will also be useful for checkpointing running groups
of tasks. The freezer allows the checkpoint code to obtain a consistent
image of the tasks by attempting to force the tasks in a cgroup into a
quiescent state. Once the tasks are quiescent another task can
walk /proc or invoke a kernel interface to gather information about the
quiesced tasks. Checkpointed tasks can be restarted later should a
recoverable error occur. This also allows the checkpointed tasks to be
migrated between nodes in a cluster by copying the gathered information
to another node and restarting the tasks there.

Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping
and resuming tasks in userspace. Both of these signals are observable
from within the tasks we wish to freeze. While SIGSTOP cannot be caught,
blocked, or ignored it can be seen by waiting or ptracing parent tasks.
SIGCONT is especially unsuitable since it can be caught by the task. Any
programs designed to watch for SIGSTOP and SIGCONT could be broken by
attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can
demonstrate this problem using nested bash shells:

$ echo $$
16644
$ bash
$ echo $$
16690

From a second, unrelated bash shell:
$ kill -SIGSTOP 16690
$ kill -SIGCONT 16990

<at this point 16990 exits and causes 16644 to exit too>

This happens because bash can observe both signals and choose how it
responds to them.

Another example of a program which catches and responds to these
signals is gdb. In fact any program designed to use ptrace is likely to
have a problem with this method of stopping and resuming tasks.

In contrast, the cgroup freezer uses the kernel freezer code to
prevent the freeze/unfreeze cycle from becoming visible to the tasks
being frozen. This allows the bash example above and gdb to run as
expected.

The freezer subsystem in the container filesystem defines a file named
freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the
cgroup. Subsequently writing "THAWED" will unfreeze the tasks in the cgroup.
Reading will return the current state.

* Examples of usage :

# mkdir /containers/freezer
# mount -t cgroup -ofreezer freezer /containers
# mkdir /containers/0
# echo $some_pid > /containers/0/tasks

to get status of the freezer subsystem :

# cat /containers/0/freezer.state
THAWED

to freeze all tasks in the container :

# echo FROZEN > /containers/0/freezer.state
# cat /containers/0/freezer.state
FREEZING
# cat /containers/0/freezer.state
FROZEN

to unfreeze all tasks in the container :

# echo THAWED > /containers/0/freezer.state
# cat /containers/0/freezer.state
THAWED

This is the basic mechanism which should do the right thing for user space task
in a simple scenario.

It's important to note that freezing can be incomplete. In that case we return
EBUSY. This means that some tasks in the cgroup are busy doing something that
prevents us from completely freezing the cgroup at this time. After EBUSY,
the cgroup will remain partially frozen -- reflected by freezer.state reporting
"FREEZING" when read. The state will remain "FREEZING" until one of these
things happens:

1) Userspace cancels the freezing operation by writing "THAWED" to
the freezer.state file
2) Userspace retries the freezing operation by writing "FROZEN" to
the freezer.state file (writing "FREEZING" is not legal
and returns EIO)
3) The tasks that blocked the cgroup from entering the "FROZEN"
state disappear from the cgroup's set of tasks.
24 changes: 16 additions & 8 deletions trunk/Documentation/controllers/memory.txt
Original file line number Diff line number Diff line change
Expand Up @@ -112,14 +112,22 @@ the per cgroup LRU.

2.2.1 Accounting details

All mapped pages (RSS) and unmapped user pages (Page Cache) are accounted.
RSS pages are accounted at the time of page_add_*_rmap() unless they've already
been accounted for earlier. A file page will be accounted for as Page Cache;
it's mapped into the page tables of a process, duplicate accounting is carefully
avoided. Page Cache pages are accounted at the time of add_to_page_cache().
The corresponding routines that remove a page from the page tables or removes
a page from Page Cache is used to decrement the accounting counters of the
cgroup.
All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
(some pages which never be reclaimable and will not be on global LRU
are not accounted. we just accounts pages under usual vm management.)

RSS pages are accounted at page_fault unless they've already been accounted
for earlier. A file page will be accounted for as Page Cache when it's
inserted into inode (radix-tree). While it's mapped into the page tables of
processes, duplicate accounting is carefully avoided.

A RSS page is unaccounted when it's fully unmapped. A PageCache page is
unaccounted when it's removed from radix-tree.

At page migration, accounting information is kept.

Note: we just account pages-on-lru because our purpose is to control amount
of used pages. not-on-lru pages are tend to be out-of-control from vm view.

2.3 Shared Page Accounting

Expand Down
2 changes: 1 addition & 1 deletion trunk/Documentation/cpusets.txt
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ hooks, beyond what is already present, required to manage dynamic
job placement on large systems.

Cpusets use the generic cgroup subsystem described in
Documentation/cgroup.txt.
Documentation/cgroups/cgroups.txt.

Requests by a task, using the sched_setaffinity(2) system call to
include CPUs in its CPU affinity mask, and using the mbind(2) and
Expand Down
5 changes: 5 additions & 0 deletions trunk/Documentation/filesystems/ext3.txt
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,11 @@ errors=remount-ro(*) Remount the filesystem read-only on an error.
errors=continue Keep going on a filesystem error.
errors=panic Panic and halt the machine if an error occurs.

data_err=ignore(*) Just print an error message if an error occurs
in a file data buffer in ordered mode.
data_err=abort Abort the journal if an error occurs in a file
data buffer in ordered mode.

grpid Give objects the same group ID as their creator.
bsdgroups

Expand Down
28 changes: 18 additions & 10 deletions trunk/Documentation/filesystems/proc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1384,15 +1384,18 @@ causes the kernel to prefer to reclaim dentries and inodes.
dirty_background_ratio
----------------------

Contains, as a percentage of total system memory, the number of pages at which
the pdflush background writeback daemon will start writing out dirty data.
Contains, as a percentage of the dirtyable system memory (free pages + mapped
pages + file cache, not including locked pages and HugePages), the number of
pages at which the pdflush background writeback daemon will start writing out
dirty data.

dirty_ratio
-----------------

Contains, as a percentage of total system memory, the number of pages at which
a process which is generating disk writes will itself start writing out dirty
data.
Contains, as a percentage of the dirtyable system memory (free pages + mapped
pages + file cache, not including locked pages and HugePages), the number of
pages at which a process which is generating disk writes will itself start
writing out dirty data.

dirty_writeback_centisecs
-------------------------
Expand Down Expand Up @@ -2412,24 +2415,29 @@ will be dumped when the <pid> process is dumped. coredump_filter is a bitmask
of memory types. If a bit of the bitmask is set, memory segments of the
corresponding memory type are dumped, otherwise they are not dumped.

The following 4 memory types are supported:
The following 7 memory types are supported:
- (bit 0) anonymous private memory
- (bit 1) anonymous shared memory
- (bit 2) file-backed private memory
- (bit 3) file-backed shared memory
- (bit 4) ELF header pages in file-backed private memory areas (it is
effective only if the bit 2 is cleared)
- (bit 5) hugetlb private memory
- (bit 6) hugetlb shared memory

Note that MMIO pages such as frame buffer are never dumped and vDSO pages
are always dumped regardless of the bitmask status.

Default value of coredump_filter is 0x3; this means all anonymous memory
segments are dumped.
Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only
effected by bit 5-6.

Default value of coredump_filter is 0x23; this means all anonymous memory
segments and hugetlb private memory are dumped.

If you don't want to dump all shared memory segments attached to pid 1234,
write 1 to the process's proc file.
write 0x21 to the process's proc file.

$ echo 0x1 > /proc/1234/coredump_filter
$ echo 0x21 > /proc/1234/coredump_filter

When a new process is created, the process inherits the bitmask status from its
parent. It is useful to set up coredump_filter before the program runs.
Expand Down
2 changes: 1 addition & 1 deletion trunk/Documentation/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -690,7 +690,7 @@ and is between 256 and 4096 characters. It is defined in the file
See Documentation/block/as-iosched.txt and
Documentation/block/deadline-iosched.txt for details.

elfcorehdr= [X86-32, X86_64]
elfcorehdr= [IA64,PPC,SH,X86-32,X86_64]
Specifies physical address of start of kernel core
image elf header. Generally kexec loader will
pass this option to capture kernel.
Expand Down
Loading

0 comments on commit 6b35dc8

Please sign in to comment.