---

yaml --- r: 178790 b: refs/heads/master c: 2243b64 h: refs/heads/master v: v3
git-mirror · Jan 4, 2010 · 9a8e740 · 9a8e740
1 parent f234694
commit 9a8e740
Show file tree

Hide file tree

Showing 268 changed files with 2,221 additions and 2,777 deletions.
diff --git a/[refs] b/[refs]
@@ -1,2 +1,2 @@
 ---
-refs/heads/master: 1df4bb4af42459a4a10e7b21794a6f44463534e6
+refs/heads/master: 2243b649aa9e5669bea5413c5b520f333f96be07
diff --git a/trunk/Documentation/block/00-INDEX b/trunk/Documentation/block/00-INDEX
@@ -1,5 +1,7 @@
 00-INDEX
 	- This file
+as-iosched.txt
+	- Anticipatory IO scheduler
 barrier.txt
 	- I/O Barriers
 biodoc.txt

diff --git a/trunk/Documentation/block/as-iosched.txt b/trunk/Documentation/block/as-iosched.txt
@@ -0,0 +1,172 @@
+Anticipatory IO scheduler
+-------------------------
+Nick Piggin <piggin@cyberone.com.au>    13 Sep 2003
+
+Attention! Database servers, especially those using "TCQ" disks should
+investigate performance with the 'deadline' IO scheduler. Any system with high
+disk performance requirements should do so, in fact.
+
+If you see unusual performance characteristics of your disk systems, or you
+see big performance regressions versus the deadline scheduler, please email
+me. Database users don't bother unless you're willing to test a lot of patches
+from me ;) its a known issue.
+
+Also, users with hardware RAID controllers, doing striping, may find
+highly variable performance results with using the as-iosched. The
+as-iosched anticipatory implementation is based on the notion that a disk
+device has only one physical seeking head.  A striped RAID controller
+actually has a head for each physical device in the logical RAID device.
+
+However, setting the antic_expire (see tunable parameters below) produces
+very similar behavior to the deadline IO scheduler.
+
+Selecting IO schedulers
+-----------------------
+Refer to Documentation/block/switching-sched.txt for information on
+selecting an io scheduler on a per-device basis.
+
+Anticipatory IO scheduler Policies
+----------------------------------
+The as-iosched implementation implements several layers of policies
+to determine when an IO request is dispatched to the disk controller.
+Here are the policies outlined, in order of application.
+
+1. one-way Elevator algorithm.
+
+The elevator algorithm is similar to that used in deadline scheduler, with
+the addition that it allows limited backward movement of the elevator
+(i.e. seeks backwards).  A seek backwards can occur when choosing between
+two IO requests where one is behind the elevator's current position, and
+the other is in front of the elevator's position. If the seek distance to
+the request in back of the elevator is less than half the seek distance to
+the request in front of the elevator, then the request in back can be chosen.
+Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors.
+This favors forward movement of the elevator, while allowing opportunistic
+"short" backward seeks.
+
+2. FIFO expiration times for reads and for writes.
+
+This is again very similar to the deadline IO scheduler.  The expiration
+times for requests on these lists is tunable using the parameters read_expire
+and write_expire discussed below.  When a read or a write expires in this way,
+the IO scheduler will interrupt its current elevator sweep or read anticipation
+to service the expired request.
+
+3. Read and write request batching
+
+A batch is a collection of read requests or a collection of write
+requests.  The as scheduler alternates dispatching read and write batches
+to the driver.  In the case a read batch, the scheduler submits read
+requests to the driver as long as there are read requests to submit, and
+the read batch time limit has not been exceeded (read_batch_expire).
+The read batch time limit begins counting down only when there are
+competing write requests pending.
+
+In the case of a write batch, the scheduler submits write requests to
+the driver as long as there are write requests available, and the
+write batch time limit has not been exceeded (write_batch_expire).
+However, the length of write batches will be gradually shortened
+when read batches frequently exceed their time limit.
+
+When changing between batch types, the scheduler waits for all requests
+from the previous batch to complete before scheduling requests for the
+next batch.
+
+The read and write fifo expiration times described in policy 2 above
+are checked only when in scheduling IO of a batch for the corresponding
+(read/write) type.  So for example, the read FIFO timeout values are
+tested only during read batches.  Likewise, the write FIFO timeout
+values are tested only during write batches.  For this reason,
+it is generally not recommended for the read batch time
+to be longer than the write expiration time, nor for the write batch
+time to exceed the read expiration time (see tunable parameters below).
+
+When the IO scheduler changes from a read to a write batch,
+it begins the elevator from the request that is on the head of the
+write expiration FIFO.  Likewise, when changing from a write batch to
+a read batch, scheduler begins the elevator from the first entry
+on the read expiration FIFO.
+
+4. Read anticipation.
+
+Read anticipation occurs only when scheduling a read batch.
+This implementation of read anticipation allows only one read request
+to be dispatched to the disk controller at a time.  In
+contrast, many write requests may be dispatched to the disk controller
+at a time during a write batch.  It is this characteristic that can make
+the anticipatory scheduler perform anomalously with controllers supporting
+TCQ, or with hardware striped RAID devices. Setting the antic_expire
+queue parameter (see below) to zero disables this behavior, and the 
+anticipatory scheduler behaves essentially like the deadline scheduler.
+
+When read anticipation is enabled (antic_expire is not zero), reads
+are dispatched to the disk controller one at a time.
+At the end of each read request, the IO scheduler examines its next
+candidate read request from its sorted read list.  If that next request
+is from the same process as the request that just completed,
+or if the next request in the queue is "very close" to the
+just completed request, it is dispatched immediately.  Otherwise,
+statistics (average think time, average seek distance) on the process
+that submitted the just completed request are examined.  If it seems
+likely that that process will submit another request soon, and that
+request is likely to be near the just completed request, then the IO
+scheduler will stop dispatching more read requests for up to (antic_expire)
+milliseconds, hoping that process will submit a new request near the one
+that just completed.  If such a request is made, then it is dispatched
+immediately.  If the antic_expire wait time expires, then the IO scheduler
+will dispatch the next read request from the sorted read queue.
+
+To decide whether an anticipatory wait is worthwhile, the scheduler
+maintains statistics for each process that can be used to compute
+mean "think time" (the time between read requests), and mean seek
+distance for that process.  One observation is that these statistics
+are associated with each process, but those statistics are not associated
+with a specific IO device.  So for example, if a process is doing IO
+on several file systems on separate devices, the statistics will be
+a combination of IO behavior from all those devices.
+
+
+Tuning the anticipatory IO scheduler
+------------------------------------
+When using 'as', the anticipatory IO scheduler there are 5 parameters under
+/sys/block/*/queue/iosched/. All are units of milliseconds.
+
+The parameters are:
+* read_expire
+    Controls how long until a read request becomes "expired". It also controls the
+    interval between which expired requests are served, so set to 50, a request
+    might take anywhere < 100ms to be serviced _if_ it is the next on the
+    expired list. Obviously request expiration strategies won't make the disk
+    go faster. The result basically equates to the timeslice a single reader
+    gets in the presence of other IO. 100*((seek time / read_expire) + 1) is
+    very roughly the % streaming read efficiency your disk should get with
+    multiple readers.
+
+* read_batch_expire
+    Controls how much time a batch of reads is given before pending writes are
+    served. A higher value is more efficient. This might be set below read_expire
+    if writes are to be given higher priority than reads, but reads are to be
+    as efficient as possible when there are no writes. Generally though, it
+    should be some multiple of read_expire.
+
+* write_expire, and
+* write_batch_expire are equivalent to the above, for writes.
+
+* antic_expire
+    Controls the maximum amount of time we can anticipate a good read (one
+    with a short seek distance from the most recently completed request) before
+    giving up. Many other factors may cause anticipation to be stopped early,
+    or some processes will not be "anticipated" at all. Should be a bit higher
+    for big seek time devices though not a linear correspondence - most
+    processes have only a few ms thinktime.
+
+In addition to the tunables above there is a read-only file named est_time
+which, when read, will show:
+
+    - The probability of a task exiting without a cooperating task
+      submitting an anticipated IO.
+
+    - The current mean think time.
+
+    - The seek distance used to determine if an incoming IO is better.
+
diff --git a/trunk/Documentation/filesystems/ext4.txt b/trunk/Documentation/filesystems/ext4.txt
@@ -196,7 +196,7 @@ nobarrier		This also requires an IO stack which can support
 			also be used to enable or disable barriers, for
 			consistency with other ext4 mount options.
 
-inode_readahead_blks=n	This tuning parameter controls the maximum
+inode_readahead=n	This tuning parameter controls the maximum
 			number of inode table blocks that ext4's inode
 			table readahead algorithm will pre-read into
 			the buffer cache.  The default value is 32 blocks.

diff --git a/trunk/Documentation/kernel-parameters.txt b/trunk/Documentation/kernel-parameters.txt
@@ -240,7 +240,7 @@ and is between 256 and 4096 characters. It is defined in the file
 
 	acpi_sleep=	[HW,ACPI] Sleep options
 			Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig,
-				  old_ordering, s4_nonvs, sci_force_enable }
+				  old_ordering, s4_nonvs }
 			See Documentation/power/video.txt for information on
 			s3_bios and s3_mode.
 			s3_beep is for debugging; it makes the PC's speaker beep
@@ -253,9 +253,6 @@ and is between 256 and 4096 characters. It is defined in the file
 			of _PTS is used by default).
 			s4_nonvs prevents the kernel from saving/restoring the
 			ACPI NVS memory during hibernation.
-			sci_force_enable causes the kernel to set SCI_EN directly
-			on resume from S1/S3 (which is against the ACPI spec,
-			but some broken systems don't work without it).
 
 	acpi_use_timer_override [HW,ACPI]
 			Use timer override. For some broken Nvidia NF5 boards

diff --git a/trunk/Documentation/kvm/api.txt b/trunk/Documentation/kvm/api.txt
@@ -685,7 +685,7 @@ struct kvm_vcpu_events {
 		__u8 pad;
 	} nmi;
 	__u32 sipi_vector;
-	__u32 flags;
+	__u32 flags;   /* must be zero */
 };
 
 4.30 KVM_SET_VCPU_EVENTS
@@ -701,14 +701,6 @@ vcpu.
 
 See KVM_GET_VCPU_EVENTS for the data structure.
 
-Fields that may be modified asynchronously by running VCPUs can be excluded
-from the update. These fields are nmi.pending and sipi_vector. Keep the
-corresponding bits in the flags field cleared to suppress overwriting the
-current in-kernel state. The bits are:
-
-KVM_VCPUEVENT_VALID_NMI_PENDING - transfer nmi.pending to the kernel
-KVM_VCPUEVENT_VALID_SIPI_VECTOR - transfer sipi_vector
-
 
 5. The kvm_run structure
 

diff --git a/trunk/Documentation/laptops/thinkpad-acpi.txt b/trunk/Documentation/laptops/thinkpad-acpi.txt
@@ -1092,8 +1092,8 @@ WARNING:
     its level up and down at every change.
 
 
-Volume control (Console Audio control)
---------------------------------------
+Volume control
+--------------
 
 procfs: /proc/acpi/ibm/volume
 ALSA: "ThinkPad Console Audio Control", default ID: "ThinkPadEC"
@@ -1110,53 +1110,9 @@ the desktop environment to just provide on-screen-display feedback.
 Software volume control should be done only in the main AC97/HDA
 mixer.
 
-
-About the ThinkPad Console Audio control:
-
-ThinkPads have a built-in amplifier and muting circuit that drives the
-console headphone and speakers.  This circuit is after the main AC97
-or HDA mixer in the audio path, and under exclusive control of the
-firmware.
-
-ThinkPads have three special hotkeys to interact with the console
-audio control: volume up, volume down and mute.
-
-It is worth noting that the normal way the mute function works (on
-ThinkPads that do not have a "mute LED") is:
-
-1. Press mute to mute.  It will *always* mute, you can press it as
-   many times as you want, and the sound will remain mute.
-
-2. Press either volume key to unmute the ThinkPad (it will _not_
-   change the volume, it will just unmute).
-
-This is a very superior design when compared to the cheap software-only
-mute-toggle solution found on normal consumer laptops:  you can be
-absolutely sure the ThinkPad will not make noise if you press the mute
-button, no matter the previous state.
-
-The IBM ThinkPads, and the earlier Lenovo ThinkPads have variable-gain
-amplifiers driving the speakers and headphone output, and the firmware
-also handles volume control for the headphone and speakers on these
-ThinkPads without any help from the operating system (this volume
-control stage exists after the main AC97 or HDA mixer in the audio
-path).
-
-The newer Lenovo models only have firmware mute control, and depend on
-the main HDA mixer to do volume control (which is done by the operating
-system).  In this case, the volume keys are filtered out for unmute
-key press (there are some firmware bugs in this area) and delivered as
-normal key presses to the operating system (thinkpad-acpi is not
-involved).
-
-
-The ThinkPad-ACPI volume control:
-
-The preferred way to interact with the Console Audio control is the
-ALSA interface.
-
-The legacy procfs interface allows one to read the current state,
-and if volume control is enabled, accepts the following commands:
+This feature allows volume control on ThinkPad models with a digital
+volume knob (when available, not all models have it), as well as
+mute/unmute control.  The available commands are:
 
 	echo up   >/proc/acpi/ibm/volume
 	echo down >/proc/acpi/ibm/volume
@@ -1165,10 +1121,12 @@ and if volume control is enabled, accepts the following commands:
 	echo 'level <level>' >/proc/acpi/ibm/volume
 
 The <level> number range is 0 to 14 although not all of them may be
-distinct. To unmute the volume after the mute command, use either the
+distinct. The unmute the volume after the mute command, use either the
 up or down command (the level command will not unmute the volume), or
 the unmute command.
 
+The current volume level and mute state is shown in the file.
+
 You can use the volume_capabilities parameter to tell the driver
 whether your thinkpad has volume control or mute-only control:
 volume_capabilities=1 for mixers with mute and volume control,

diff --git a/trunk/Documentation/sound/alsa/Procfile.txt b/trunk/Documentation/sound/alsa/Procfile.txt
@@ -95,7 +95,7 @@ card*/pcm*/xrun_debug
 	It takes an integer value, can be changed by writing to this
 	file, such as
 
-		 # echo 5 > /proc/asound/card0/pcm0p/xrun_debug
+		 # cat 5 > /proc/asound/card0/pcm0p/xrun_debug
 
 	The value consists of the following bit flags:
 	  bit 0 = Enable XRUN/jiffies debug messages

diff --git a/trunk/Documentation/trace/ftrace-design.txt b/trunk/Documentation/trace/ftrace-design.txt
@@ -53,14 +53,14 @@ size of the mcount call that is embedded in the function).
 For example, if the function foo() calls bar(), when the bar() function calls
 mcount(), the arguments mcount() will pass to the tracer are:
 	"frompc" - the address bar() will use to return to foo()
-	"selfpc" - the address bar() (with mcount() size adjustment)
+	"selfpc" - the address bar() (with _mcount() size adjustment)
 
 Also keep in mind that this mcount function will be called *a lot*, so
 optimizing for the default case of no tracer will help the smooth running of
 your system when tracing is disabled.  So the start of the mcount function is
-typically the bare minimum with checking things before returning.  That also
-means the code flow should usually be kept linear (i.e. no branching in the nop
-case).  This is of course an optimization and not a hard requirement.
+typically the bare min with checking things before returning.  That also means
+the code flow should usually kept linear (i.e. no branching in the nop case).
+This is of course an optimization and not a hard requirement.
 
 Here is some pseudo code that should help (these functions should actually be
 implemented in assembly):
@@ -131,10 +131,10 @@ some functions to save (hijack) and restore the return address.
 
 The mcount function should check the function pointers ftrace_graph_return
 (compare to ftrace_stub) and ftrace_graph_entry (compare to
-ftrace_graph_entry_stub).  If either of those is not set to the relevant stub
+ftrace_graph_entry_stub).  If either of those are not set to the relevant stub
 function, call the arch-specific function ftrace_graph_caller which in turn
 calls the arch-specific function prepare_ftrace_return.  Neither of these
-function names is strictly required, but you should use them anyway to stay
+function names are strictly required, but you should use them anyways to stay
 consistent across the architecture ports -- easier to compare & contrast
 things.
 
@@ -144,7 +144,7 @@ but the first argument should be a pointer to the "frompc".  Typically this is
 located on the stack.  This allows the function to hijack the return address
 temporarily to have it point to the arch-specific function return_to_handler.
 That function will simply call the common ftrace_return_to_handler function and
-that will return the original return address with which you can return to the
+that will return the original return address with which, you can return to the
 original call site.
 
 Here is the updated mcount pseudo code: