Skip to content

Commit

Permalink
---
Browse files Browse the repository at this point in the history
yaml
---
r: 142026
b: refs/heads/master
c: 0221c81
h: refs/heads/master
v: v3
  • Loading branch information
Linus Torvalds committed Apr 5, 2009
1 parent 0e8438e commit b0725f7
Show file tree
Hide file tree
Showing 1,201 changed files with 340,170 additions and 74,009 deletions.
2 changes: 1 addition & 1 deletion [refs]
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
---
refs/heads/master: def57543418a5f47debae28a0a9dea2effc11692
refs/heads/master: 0221c81b1b8eb0cbb6b30a0ced52ead32d2b4e4c
71 changes: 71 additions & 0 deletions trunk/Documentation/ABI/testing/debugfs-kmemtrace
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
What: /sys/kernel/debug/kmemtrace/
Date: July 2008
Contact: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Description:

In kmemtrace-enabled kernels, the following files are created:

/sys/kernel/debug/kmemtrace/
cpu<n> (0400) Per-CPU tracing data, see below. (binary)
total_overruns (0400) Total number of bytes which were dropped from
cpu<n> files because of full buffer condition,
non-binary. (text)
abi_version (0400) Kernel's kmemtrace ABI version. (text)

Each per-CPU file should be read according to the relay interface. That is,
the reader should set affinity to that specific CPU and, as currently done by
the userspace application (though there are other methods), use poll() with
an infinite timeout before every read(). Otherwise, erroneous data may be
read. The binary data has the following _core_ format:

Event ID (1 byte) Unsigned integer, one of:
0 - represents an allocation (KMEMTRACE_EVENT_ALLOC)
1 - represents a freeing of previously allocated memory
(KMEMTRACE_EVENT_FREE)
Type ID (1 byte) Unsigned integer, one of:
0 - this is a kmalloc() / kfree()
1 - this is a kmem_cache_alloc() / kmem_cache_free()
2 - this is a __get_free_pages() et al.
Event size (2 bytes) Unsigned integer representing the
size of this event. Used to extend
kmemtrace. Discard the bytes you
don't know about.
Sequence number (4 bytes) Signed integer used to reorder data
logged on SMP machines. Wraparound
must be taken into account, although
it is unlikely.
Caller address (8 bytes) Return address to the caller.
Pointer to mem (8 bytes) Pointer to target memory area. Can be
NULL, but not all such calls might be
recorded.

In case of KMEMTRACE_EVENT_ALLOC events, the next fields follow:

Requested bytes (8 bytes) Total number of requested bytes,
unsigned, must not be zero.
Allocated bytes (8 bytes) Total number of actually allocated
bytes, unsigned, must not be lower
than requested bytes.
Requested flags (4 bytes) GFP flags supplied by the caller.
Target CPU (4 bytes) Signed integer, valid for event id 1.
If equal to -1, target CPU is the same
as origin CPU, but the reverse might
not be true.

The data is made available in the same endianness the machine has.

Other event ids and type ids may be defined and added. Other fields may be
added by increasing event size, but see below for details.
Every modification to the ABI, including new id definitions, are followed
by bumping the ABI version by one.

Adding new data to the packet (features) is done at the end of the mandatory
data:
Feature size (2 byte)
Feature ID (1 byte)
Feature data (Feature size - 3 bytes)


Users:
kmemtrace-user - git://repo.or.cz/kmemtrace-user.git

70 changes: 70 additions & 0 deletions trunk/Documentation/filesystems/pohmelfs/design_notes.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
POHMELFS: Parallel Optimized Host Message Exchange Layered File System.

Evgeniy Polyakov <zbr@ioremap.net>

Homepage: http://www.ioremap.net/projects/pohmelfs

POHMELFS first began as a network filesystem with coherent local data and
metadata caches but is now evolving into a parallel distributed filesystem.

Main features of this FS include:
* Locally coherent cache for data and metadata with (potentially) byte-range locks.
Since all Linux filesystems lock the whole inode during writing, algorithm
is very simple and does not use byte-ranges, although they are sent in
locking messages.
* Completely async processing of all events except creation of hard and symbolic
links, and rename events.
Object creation and data reading and writing are processed asynchronously.
* Flexible object architecture optimized for network processing.
Ability to create long paths to objects and remove arbitrarily huge
directories with a single network command.
(like removing the whole kernel tree via a single network command).
* Very high performance.
* Fast and scalable multithreaded userspace server. Being in userspace it works
with any underlying filesystem and still is much faster than async in-kernel NFS one.
* Client is able to switch between different servers (if one goes down, client
automatically reconnects to second and so on).
* Transactions support. Full failover for all operations.
Resending transactions to different servers on timeout or error.
* Read request (data read, directory listing, lookup requests) balancing between multiple servers.
* Write requests are replicated to multiple servers and completed only when all of them are acked.
* Ability to add and/or remove servers from the working set at run-time.
* Strong authentification and possible data encryption in network channel.
* Extended attributes support.

POHMELFS is based on transactions, which are potentially long-standing objects that live
in the client's memory. Each transaction contains all the information needed to process a given
command (or set of commands, which is frequently used during data writing: single transactions
can contain creation and data writing commands). Transactions are committed by all the servers
to which they are sent and, in case of failures, are eventually resent or dropped with an error.
For example, reading will return an error if no servers are available.

POHMELFS uses a asynchronous approach to data processing. Courtesy of transactions, it is
possible to detach replies from requests and, if the command requires data to be received, the
caller sleeps waiting for it. Thus, it is possible to issue multiple read commands to different
servers and async threads will pick up replies in parallel, find appropriate transactions in the
system and put the data where it belongs (like the page or inode cache).

The main feature of POHMELFS is writeback data and the metadata cache.
Only a few non-performance critical operations use the write-through cache and
are synchronous: hard and symbolic link creation, and object rename. Creation,
removal of objects and data writing are asynchronous and are sent to
the server during system writeback. Only one writer at a time is allowed for any
given inode, which is guarded by an appropriate locking protocol.
Because of this feature, POHMELFS is extremely fast at metadata intensive
workloads and can fully utilize the bandwidth to the servers when doing bulk
data transfers.

POHMELFS clients operate with a working set of servers and are capable of balancing read-only
operations (like lookups or directory listings) between them.
Administrators can add or remove servers from the set at run-time via special commands (described
in Documentation/pohmelfs/info.txt file). Writes are replicated to all servers.

POHMELFS is capable of full data channel encryption and/or strong crypto hashing.
One can select any kernel supported cipher, encryption mode, hash type and operation mode
(hmac or digest). It is also possible to use both or neither (default). Crypto configuration
is checked during mount time and, if the server does not support it, appropriate capabilities
will be disabled or mount will fail (if 'crypto_fail_unsupported' mount option is specified).
Crypto performance heavily depends on the number of crypto threads, which asynchronously perform
crypto operations and send the resulting data to server or submit it up the stack. This number
can be controlled via a mount option.
86 changes: 86 additions & 0 deletions trunk/Documentation/filesystems/pohmelfs/info.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
POHMELFS usage information.

Mount options:
idx=%u
Each mountpoint is associated with a special index via this option.
Administrator can add or remove servers from the given index, so all mounts,
which were attached to it, are updated.
Default it is 0.

trans_scan_timeout=%u
This timeout, expressed in milliseconds, specifies time to scan transaction
trees looking for stale requests, which have to be resent, or if number of
retries exceed specified limit, dropped with error.
Default is 5 seconds.

drop_scan_timeout=%u
Internal timeout, expressed in milliseconds, which specifies how frequently
inodes marked to be dropped are freed. It also specifies how frequently
the system checks that servers have to be added or removed from current working set.
Default is 1 second.

wait_on_page_timeout=%u
Number of milliseconds to wait for reply from remote server for data reading command.
If this timeout is exceeded, reading returns an error.
Default is 5 seconds.

trans_retries=%u
This is the number of times that a transaction will be resent to a server that did
not answer for the last @trans_scan_timeout milliseconds.
When the number of resends exceeds this limit, the transaction is completed with error.
Default is 5 resends.

crypto_thread_num=%u
Number of crypto processing threads. Threads are used both for RX and TX traffic.
Default is 2, or no threads if crypto operations are not supported.

trans_max_pages=%u
Maximum number of pages in a single transaction. This parameter also controls
the number of pages, allocated for crypto processing (each crypto thread has
pool of pages, the number of which is equal to 'trans_max_pages'.
Default is 100 pages.

crypto_fail_unsupported
If specified, mount will fail if the server does not support requested crypto operations.
By default mount will disable non-matching crypto operations.

mcache_timeout=%u
Maximum number of milliseconds to wait for the mcache objects to be processed.
Mcache includes locks (given lock should be granted by server), attributes (they should be
fully received in the given timeframe).
Default is 5 seconds.

Usage examples.

Add (or remove if it already exists) server server1.net:1025 into the working set with index $idx
with appropriate hash algorithm and key file and cipher algorithm, mode and key file:
$cfg -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key

Mount filesystem with given index $idx to /mnt mountpoint.
Client will connect to all servers specified in the working set via previous command:
mount -t pohmel -o idx=$idx q /mnt

One can add or remove servers from working set after mounting too.


Server installation.

Creating a server, which listens at port 1025 and 0.0.0.0 address.
Working root directory (note, that server chroots there, so you have to have appropriate permissions)
is set to /mnt, server will negotiate hash/cipher with client, in case client requested it, there
are appropriate key files.
Number of working threads is set to 10.

# ./fserver -a 0.0.0.0 -p 1025 -r /mnt -w 10 -K hash_key -k cipher_key

-A 6 - listen on ipv6 address. Default: Disabled.
-r root - path to root directory. Default: /tmp.
-a addr - listen address. Default: 0.0.0.0.
-p port - listen port. Default: 1025.
-w workers - number of workers per connected client. Default: 1.
-K file - hash key size. Default: none.
-k file - cipher key size. Default: none.
-h - this help.

Number of worker threads specifies how many workers will be created for each client.
Bulk single-client transafers usually are better handled with smaller number (like 1-3).
Loading

0 comments on commit b0725f7

Please sign in to comment.