Skip to content

Commit

Permalink
---
Browse files Browse the repository at this point in the history
yaml
---
r: 71620
b: refs/heads/master
c: bf0a40b
h: refs/heads/master
v: v3
  • Loading branch information
Len Brown committed Oct 10, 2007
1 parent e40ca50 commit fd15eb1
Show file tree
Hide file tree
Showing 184 changed files with 2,503 additions and 1,245 deletions.
2 changes: 1 addition & 1 deletion [refs]
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
---
refs/heads/master: 32afbf07aa53120c0e3fe1881b948ded99f4fc35
refs/heads/master: bf0a40b77ab1ce13fb6fa3bf7c8e413bac02873a
219 changes: 219 additions & 0 deletions trunk/Documentation/crypto/async-tx-api.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
Asynchronous Transfers/Transforms API

1 INTRODUCTION

2 GENEALOGY

3 USAGE
3.1 General format of the API
3.2 Supported operations
3.3 Descriptor management
3.4 When does the operation execute?
3.5 When does the operation complete?
3.6 Constraints
3.7 Example

4 DRIVER DEVELOPER NOTES
4.1 Conformance points
4.2 "My application needs finer control of hardware channels"

5 SOURCE

---

1 INTRODUCTION

The async_tx API provides methods for describing a chain of asynchronous
bulk memory transfers/transforms with support for inter-transactional
dependencies. It is implemented as a dmaengine client that smooths over
the details of different hardware offload engine implementations. Code
that is written to the API can optimize for asynchronous operation and
the API will fit the chain of operations to the available offload
resources.

2 GENEALOGY

The API was initially designed to offload the memory copy and
xor-parity-calculations of the md-raid5 driver using the offload engines
present in the Intel(R) Xscale series of I/O processors. It also built
on the 'dmaengine' layer developed for offloading memory copies in the
network stack using Intel(R) I/OAT engines. The following design
features surfaced as a result:
1/ implicit synchronous path: users of the API do not need to know if
the platform they are running on has offload capabilities. The
operation will be offloaded when an engine is available and carried out
in software otherwise.
2/ cross channel dependency chains: the API allows a chain of dependent
operations to be submitted, like xor->copy->xor in the raid5 case. The
API automatically handles cases where the transition from one operation
to another implies a hardware channel switch.
3/ dmaengine extensions to support multiple clients and operation types
beyond 'memcpy'

3 USAGE

3.1 General format of the API:
struct dma_async_tx_descriptor *
async_<operation>(<op specific parameters>,
enum async_tx_flags flags,
struct dma_async_tx_descriptor *dependency,
dma_async_tx_callback callback_routine,
void *callback_parameter);

3.2 Supported operations:
memcpy - memory copy between a source and a destination buffer
memset - fill a destination buffer with a byte value
xor - xor a series of source buffers and write the result to a
destination buffer
xor_zero_sum - xor a series of source buffers and set a flag if the
result is zero. The implementation attempts to prevent
writes to memory

3.3 Descriptor management:
The return value is non-NULL and points to a 'descriptor' when the operation
has been queued to execute asynchronously. Descriptors are recycled
resources, under control of the offload engine driver, to be reused as
operations complete. When an application needs to submit a chain of
operations it must guarantee that the descriptor is not automatically recycled
before the dependency is submitted. This requires that all descriptors be
acknowledged by the application before the offload engine driver is allowed to
recycle (or free) the descriptor. A descriptor can be acked by one of the
following methods:
1/ setting the ASYNC_TX_ACK flag if no child operations are to be submitted
2/ setting the ASYNC_TX_DEP_ACK flag to acknowledge the parent
descriptor of a new operation.
3/ calling async_tx_ack() on the descriptor.

3.4 When does the operation execute?
Operations do not immediately issue after return from the
async_<operation> call. Offload engine drivers batch operations to
improve performance by reducing the number of mmio cycles needed to
manage the channel. Once a driver-specific threshold is met the driver
automatically issues pending operations. An application can force this
event by calling async_tx_issue_pending_all(). This operates on all
channels since the application has no knowledge of channel to operation
mapping.

3.5 When does the operation complete?
There are two methods for an application to learn about the completion
of an operation.
1/ Call dma_wait_for_async_tx(). This call causes the CPU to spin while
it polls for the completion of the operation. It handles dependency
chains and issuing pending operations.
2/ Specify a completion callback. The callback routine runs in tasklet
context if the offload engine driver supports interrupts, or it is
called in application context if the operation is carried out
synchronously in software. The callback can be set in the call to
async_<operation>, or when the application needs to submit a chain of
unknown length it can use the async_trigger_callback() routine to set a
completion interrupt/callback at the end of the chain.

3.6 Constraints:
1/ Calls to async_<operation> are not permitted in IRQ context. Other
contexts are permitted provided constraint #2 is not violated.
2/ Completion callback routines cannot submit new operations. This
results in recursion in the synchronous case and spin_locks being
acquired twice in the asynchronous case.

3.7 Example:
Perform a xor->copy->xor operation where each operation depends on the
result from the previous operation:

void complete_xor_copy_xor(void *param)
{
printk("complete\n");
}

int run_xor_copy_xor(struct page **xor_srcs,
int xor_src_cnt,
struct page *xor_dest,
size_t xor_len,
struct page *copy_src,
struct page *copy_dest,
size_t copy_len)
{
struct dma_async_tx_descriptor *tx;

tx = async_xor(xor_dest, xor_srcs, 0, xor_src_cnt, xor_len,
ASYNC_TX_XOR_DROP_DST, NULL, NULL, NULL);
tx = async_memcpy(copy_dest, copy_src, 0, 0, copy_len,
ASYNC_TX_DEP_ACK, tx, NULL, NULL);
tx = async_xor(xor_dest, xor_srcs, 0, xor_src_cnt, xor_len,
ASYNC_TX_XOR_DROP_DST | ASYNC_TX_DEP_ACK | ASYNC_TX_ACK,
tx, complete_xor_copy_xor, NULL);

async_tx_issue_pending_all();
}

See include/linux/async_tx.h for more information on the flags. See the
ops_run_* and ops_complete_* routines in drivers/md/raid5.c for more
implementation examples.

4 DRIVER DEVELOPMENT NOTES
4.1 Conformance points:
There are a few conformance points required in dmaengine drivers to
accommodate assumptions made by applications using the async_tx API:
1/ Completion callbacks are expected to happen in tasklet context
2/ dma_async_tx_descriptor fields are never manipulated in IRQ context
3/ Use async_tx_run_dependencies() in the descriptor clean up path to
handle submission of dependent operations

4.2 "My application needs finer control of hardware channels"
This requirement seems to arise from cases where a DMA engine driver is
trying to support device-to-memory DMA. The dmaengine and async_tx
implementations were designed for offloading memory-to-memory
operations; however, there are some capabilities of the dmaengine layer
that can be used for platform-specific channel management.
Platform-specific constraints can be handled by registering the
application as a 'dma_client' and implementing a 'dma_event_callback' to
apply a filter to the available channels in the system. Before showing
how to implement a custom dma_event callback some background of
dmaengine's client support is required.

The following routines in dmaengine support multiple clients requesting
use of a channel:
- dma_async_client_register(struct dma_client *client)
- dma_async_client_chan_request(struct dma_client *client)

dma_async_client_register takes a pointer to an initialized dma_client
structure. It expects that the 'event_callback' and 'cap_mask' fields
are already initialized.

dma_async_client_chan_request triggers dmaengine to notify the client of
all channels that satisfy the capability mask. It is up to the client's
event_callback routine to track how many channels the client needs and
how many it is currently using. The dma_event_callback routine returns a
dma_state_client code to let dmaengine know the status of the
allocation.

Below is the example of how to extend this functionality for
platform-specific filtering of the available channels beyond the
standard capability mask:

static enum dma_state_client
my_dma_client_callback(struct dma_client *client,
struct dma_chan *chan, enum dma_state state)
{
struct dma_device *dma_dev;
struct my_platform_specific_dma *plat_dma_dev;

dma_dev = chan->device;
plat_dma_dev = container_of(dma_dev,
struct my_platform_specific_dma,
dma_dev);

if (!plat_dma_dev->platform_specific_capability)
return DMA_DUP;

. . .
}

5 SOURCE
include/linux/dmaengine.h: core header file for DMA drivers and clients
drivers/dma/dmaengine.c: offload engine channel management routines
drivers/dma/: location for offload engine drivers
include/linux/async_tx.h: core header file for the async_tx api
crypto/async_tx/async_tx.c: async_tx interface to dmaengine and common code
crypto/async_tx/async_memcpy.c: copy offload
crypto/async_tx/async_memset.c: memory fill offload
crypto/async_tx/async_xor.c: xor and xor zero sum offload
2 changes: 2 additions & 0 deletions trunk/Documentation/devices.txt
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,8 @@ Your cooperation is appreciated.
9 = /dev/urandom Faster, less secure random number gen.
10 = /dev/aio Asynchronous I/O notification interface
11 = /dev/kmsg Writes to this come out as printk's
12 = /dev/oldmem Used by crashdump kernels to access
the memory of the kernel that crashed.

1 block RAM disk
0 = /dev/ram0 First RAM disk
Expand Down
2 changes: 1 addition & 1 deletion trunk/Documentation/lguest/lguest.c
Original file line number Diff line number Diff line change
Expand Up @@ -882,7 +882,7 @@ static u32 handle_block_output(int fd, const struct iovec *iov,
* of the block file (possibly extending it). */
if (off + len > device_len) {
/* Trim it back to the correct length */
ftruncate(dev->fd, device_len);
ftruncate64(dev->fd, device_len);
/* Die, bad Guest, die. */
errx(1, "Write past end %llu+%u", off, len);
}
Expand Down
120 changes: 120 additions & 0 deletions trunk/Documentation/lockstat.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@

LOCK STATISTICS

- WHAT

As the name suggests, it provides statistics on locks.

- WHY

Because things like lock contention can severely impact performance.

- HOW

Lockdep already has hooks in the lock functions and maps lock instances to
lock classes. We build on that. The graph below shows the relation between
the lock functions and the various hooks therein.

__acquire
|
lock _____
| \
| __contended
| |
| <wait>
| _______/
|/
|
__acquired
|
.
<hold>
.
|
__release
|
unlock

lock, unlock - the regular lock functions
__* - the hooks
<> - states

With these hooks we provide the following statistics:

con-bounces - number of lock contention that involved x-cpu data
contentions - number of lock acquisitions that had to wait
wait time min - shortest (non-0) time we ever had to wait for a lock
max - longest time we ever had to wait for a lock
total - total time we spend waiting on this lock
acq-bounces - number of lock acquisitions that involved x-cpu data
acquisitions - number of times we took the lock
hold time min - shortest (non-0) time we ever held the lock
max - longest time we ever held the lock
total - total time this lock was held

From these number various other statistics can be derived, such as:

hold time average = hold time total / acquisitions

These numbers are gathered per lock class, per read/write state (when
applicable).

It also tracks 4 contention points per class. A contention point is a call site
that had to wait on lock acquisition.

- USAGE

Look at the current lock statistics:

( line numbers not part of actual output, done for clarity in the explanation
below )

# less /proc/lock_stat

01 lock_stat version 0.2
02 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
03 class name con-bounces contentions waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
04 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
05
06 &inode->i_data.tree_lock-W: 15 21657 0.18 1093295.30 11547131054.85 58 10415 0.16 87.51 6387.60
07 &inode->i_data.tree_lock-R: 0 0 0.00 0.00 0.00 23302 231198 0.25 8.45 98023.38
08 --------------------------
09 &inode->i_data.tree_lock 0 [<ffffffff8027c08f>] add_to_page_cache+0x5f/0x190
10
11 ...............................................................................................................................................................................................
12
13 dcache_lock: 1037 1161 0.38 45.32 774.51 6611 243371 0.15 306.48 77387.24
14 -----------
15 dcache_lock 180 [<ffffffff802c0d7e>] sys_getcwd+0x11e/0x230
16 dcache_lock 165 [<ffffffff802c002a>] d_alloc+0x15a/0x210
17 dcache_lock 33 [<ffffffff8035818d>] _atomic_dec_and_lock+0x4d/0x70
18 dcache_lock 1 [<ffffffff802beef8>] shrink_dcache_parent+0x18/0x130

This excerpt shows the first two lock class statistics. Line 01 shows the
output version - each time the format changes this will be updated. Line 02-04
show the header with column descriptions. Lines 05-10 and 13-18 show the actual
statistics. These statistics come in two parts; the actual stats separated by a
short separator (line 08, 14) from the contention points.

The first lock (05-10) is a read/write lock, and shows two lines above the
short separator. The contention points don't match the column descriptors,
they have two: contentions and [<IP>] symbol.


View the top contending locks:

# grep : /proc/lock_stat | head
&inode->i_data.tree_lock-W: 15 21657 0.18 1093295.30 11547131054.85 58 10415 0.16 87.51 6387.60
&inode->i_data.tree_lock-R: 0 0 0.00 0.00 0.00 23302 231198 0.25 8.45 98023.38
dcache_lock: 1037 1161 0.38 45.32 774.51 6611 243371 0.15 306.48 77387.24
&inode->i_mutex: 161 286 18446744073709 62882.54 1244614.55 3653 20598 18446744073709 62318.60 1693822.74
&zone->lru_lock: 94 94 0.53 7.33 92.10 4366 32690 0.29 59.81 16350.06
&inode->i_data.i_mmap_lock: 79 79 0.40 3.77 53.03 11779 87755 0.28 116.93 29898.44
&q->__queue_lock: 48 50 0.52 31.62 86.31 774 13131 0.17 113.08 12277.52
&rq->rq_lock_key: 43 47 0.74 68.50 170.63 3706 33929 0.22 107.99 17460.62
&rq->rq_lock_key#2: 39 46 0.75 6.68 49.03 2979 32292 0.17 125.17 17137.63
tasklist_lock-W: 15 15 1.45 10.87 32.70 1201 7390 0.58 62.55 13648.47

Clear the statistics:

# echo 0 > /proc/lock_stat
2 changes: 1 addition & 1 deletion trunk/Documentation/sysrq.txt
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ On x86 - You press the key combo 'ALT-SysRq-<command key>'. Note - Some
keyboards may not have a key labeled 'SysRq'. The 'SysRq' key is
also known as the 'Print Screen' key. Also some keyboards cannot
handle so many keys being pressed at the same time, so you might
have better luck with "press Alt", "press SysRq", "release Alt",
have better luck with "press Alt", "press SysRq", "release SysRq",
"press <command key>", release everything.

On SPARC - You press 'ALT-STOP-<command key>', I believe.
Expand Down
2 changes: 1 addition & 1 deletion trunk/Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 23
EXTRAVERSION =-rc7
EXTRAVERSION =
NAME = Arr Matey! A Hairy Bilge Rat!

# *DOCUMENTATION*
Expand Down
4 changes: 2 additions & 2 deletions trunk/arch/arm/kernel/bios32.c
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,7 @@ pbus_assign_bus_resources(struct pci_bus *bus, struct pci_sys_data *root)
* pcibios_fixup_bus - Called after each bus is probed,
* but before its children are examined.
*/
void __devinit pcibios_fixup_bus(struct pci_bus *bus)
void pcibios_fixup_bus(struct pci_bus *bus)
{
struct pci_sys_data *root = bus->sysdata;
struct pci_dev *dev;
Expand Down Expand Up @@ -419,7 +419,7 @@ void __devinit pcibios_fixup_bus(struct pci_bus *bus)
/*
* Convert from Linux-centric to bus-centric addresses for bridge devices.
*/
void __devinit
void
pcibios_resource_to_bus(struct pci_dev *dev, struct pci_bus_region *region,
struct resource *res)
{
Expand Down
Loading

0 comments on commit fd15eb1

Please sign in to comment.