Skip to content

Commit

Permalink
---
Browse files Browse the repository at this point in the history
yaml
---
r: 101342
b: refs/heads/master
c: b9d2252
h: refs/heads/master
v: v3
  • Loading branch information
Linus Torvalds committed Jul 15, 2008
1 parent d84a136 commit 2715150
Show file tree
Hide file tree
Showing 369 changed files with 11,774 additions and 8,346 deletions.
2 changes: 1 addition & 1 deletion [refs]
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
---
refs/heads/master: e79aec291da55aa322ddb5d8f3bb04cdf69470d5
refs/heads/master: b9d2252c1e44fa83a4e65fdc9eb93db6297c55af
37 changes: 28 additions & 9 deletions trunk/Documentation/IRQ-affinity.txt
Original file line number Diff line number Diff line change
@@ -1,17 +1,26 @@
ChangeLog:
Started by Ingo Molnar <mingo@redhat.com>
Update by Max Krasnyansky <maxk@qualcomm.com>

SMP IRQ affinity, started by Ingo Molnar <mingo@redhat.com>

SMP IRQ affinity

/proc/irq/IRQ#/smp_affinity specifies which target CPUs are permitted
for a given IRQ source. It's a bitmask of allowed CPUs. It's not allowed
to turn off all CPUs, and if an IRQ controller does not support IRQ
affinity then the value will not change from the default 0xffffffff.

/proc/irq/default_smp_affinity specifies default affinity mask that applies
to all non-active IRQs. Once IRQ is allocated/activated its affinity bitmask
will be set to the default mask. It can then be changed as described above.
Default mask is 0xffffffff.

Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting
the IRQ to CPU4-7 (this is an 8-CPU SMP box):
it to CPU4-7 (this is an 8-CPU SMP box):

[root@moon 44]# cd /proc/irq/44
[root@moon 44]# cat smp_affinity
ffffffff

[root@moon 44]# echo 0f > smp_affinity
[root@moon 44]# cat smp_affinity
0000000f
Expand All @@ -21,17 +30,27 @@ PING hell (195.4.7.3): 56 data bytes
--- hell ping statistics ---
6029 packets transmitted, 6027 packets received, 0% packet loss
round-trip min/avg/max = 0.1/0.1/0.4 ms
[root@moon 44]# cat /proc/interrupts | grep 44:
44: 0 1785 1785 1783 1783 1
1 0 IO-APIC-level eth1
[root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:'
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1

As can be seen from the line above IRQ44 was delivered only to the first four
processors (0-3).
Now lets restrict that IRQ to CPU(4-7).

[root@moon 44]# echo f0 > smp_affinity
[root@moon 44]# cat smp_affinity
000000f0
[root@moon 44]# ping -f h
PING hell (195.4.7.3): 56 data bytes
..
--- hell ping statistics ---
2779 packets transmitted, 2777 packets received, 0% packet loss
round-trip min/avg/max = 0.1/0.5/585.4 ms
[root@moon 44]# cat /proc/interrupts | grep 44:
44: 1068 1785 1785 1784 1784 1069 1070 1069 IO-APIC-level eth1
[root@moon 44]#
[root@moon 44]# cat /proc/interrupts | 'CPU\|44:'
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1

This time around IRQ44 was delivered only to the last four processors.
i.e counters for the CPU0-3 did not change.

26 changes: 9 additions & 17 deletions trunk/Documentation/cputopology.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,8 @@ represent the thread siblings to cpu X in the same physical package;
To implement it in an architecture-neutral way, a new source file,
drivers/base/topology.c, is to export the 4 attributes.

If one architecture wants to support this feature, it just needs to
implement 4 defines, typically in file include/asm-XXX/topology.h.
The 4 defines are:
For an architecture to support this feature, it must define some of
these macros in include/asm-XXX/topology.h:
#define topology_physical_package_id(cpu)
#define topology_core_id(cpu)
#define topology_thread_siblings(cpu)
Expand All @@ -25,17 +24,10 @@ The 4 defines are:
The type of **_id is int.
The type of siblings is cpumask_t.

To be consistent on all architectures, the 4 attributes should have
default values if their values are unavailable. Below is the rule.
1) physical_package_id: If cpu has no physical package id, -1 is the
default value.
2) core_id: If cpu doesn't support multi-core, its core id is 0.
3) thread_siblings: Just include itself, if the cpu doesn't support
HT/multi-thread.
4) core_siblings: Just include itself, if the cpu doesn't support
multi-core and HT/Multi-thread.

So be careful when declaring the 4 defines in include/asm-XXX/topology.h.

If an attribute isn't defined on an architecture, it won't be exported.

To be consistent on all architectures, include/linux/topology.h
provides default definitions for any of the above macros that are
not defined by include/asm-XXX/topology.h:
1) physical_package_id: -1
2) core_id: 0
3) thread_siblings: just the given CPU
4) core_siblings: just the given CPU
7 changes: 0 additions & 7 deletions trunk/Documentation/feature-removal-schedule.txt
Original file line number Diff line number Diff line change
Expand Up @@ -222,13 +222,6 @@ Who: Thomas Gleixner <tglx@linutronix.de>

---------------------------

What: i2c-i810, i2c-prosavage and i2c-savage4
When: May 2008
Why: These drivers are superseded by i810fb, intelfb and savagefb.
Who: Jean Delvare <khali@linux-fr.org>

---------------------------

What (Why):
- include/linux/netfilter_ipv4/ipt_TOS.h ipt_tos.h header files
(superseded by xt_TOS/xt_tos target & match)
Expand Down
125 changes: 75 additions & 50 deletions trunk/Documentation/filesystems/ext4.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,72 +13,93 @@ Mailing list: linux-ext4@vger.kernel.org
1. Quick usage instructions:
===========================

- Grab updated e2fsprogs from
ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/
This is a patchset on top of e2fsprogs-1.39, which can be found at
- Compile and install the latest version of e2fsprogs (as of this
writing version 1.41) from:

http://sourceforge.net/project/showfiles.php?group_id=2406

or

ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/

- It's still mke2fs -j /dev/hda1
or grab the latest git repository from:

git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git

- Create a new filesystem using the ext4dev filesystem type:

# mke2fs -t ext4dev /dev/hda1

Or configure an existing ext3 filesystem to support extents and set
the test_fs flag to indicate that it's ok for an in-development
filesystem to touch this filesystem:

- mount /dev/hda1 /wherever -t ext4dev
# tune2fs -O extents -E test_fs /dev/hda1

- To enable extents,
If the filesystem was created with 128 byte inodes, it can be
converted to use 256 byte for greater efficiency via:

mount /dev/hda1 /wherever -t ext4dev -o extents
# tune2fs -I 256 /dev/hda1

- The filesystem is compatible with the ext3 driver until you add a file
which has extents (ie: `mount -o extents', then create a file).
(Note: we currently do not have tools to convert an ext4dev
filesystem back to ext3; so please do not do try this on production
filesystems.)

NOTE: The "extents" mount flag is temporary. It will soon go away and
extents will be enabled by the "-o extents" flag to mke2fs or tune2fs
- Mounting:

# mount -t ext4dev /dev/hda1 /wherever

- When comparing performance with other filesystems, remember that
ext3/4 by default offers higher data integrity guarantees than most. So
when comparing with a metadata-only journalling filesystem, use `mount -o
data=writeback'. And you might as well use `mount -o nobh' too along
with it. Making the journal larger than the mke2fs default often helps
performance with metadata-intensive workloads.
ext3/4 by default offers higher data integrity guarantees than most.
So when comparing with a metadata-only journalling filesystem, such
as ext3, use `mount -o data=writeback'. And you might as well use
`mount -o nobh' too along with it. Making the journal larger than
the mke2fs default often helps performance with metadata-intensive
workloads.

2. Features
===========

2.1 Currently available

* ability to use filesystems > 16TB
* ability to use filesystems > 16TB (e2fsprogs support not available yet)
* extent format reduces metadata overhead (RAM, IO for access, transactions)
* extent format more robust in face of on-disk corruption due to magics,
* internal redunancy in tree

2.1 Previously available, soon to be enabled by default by "mkefs.ext4":

* dir_index and resize inode will be on by default
* large inodes will be used by default for fast EAs, nsec timestamps, etc
* improved file allocation (multi-block alloc)
* fix 32000 subdirectory limit
* nsec timestamps for mtime, atime, ctime, create time
* inode version field on disk (NFSv4, Lustre)
* reduced e2fsck time via uninit_bg feature
* journal checksumming for robustness, performance
* persistent file preallocation (e.g for streaming media, databases)
* ability to pack bitmaps and inode tables into larger virtual groups via the
flex_bg feature
* large file support
* Inode allocation using large virtual block groups via flex_bg
* delayed allocation
* large block (up to pagesize) support
* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force
the ordering)

2.2 Candidate features for future inclusion

There are several under discussion, whether they all make it in is
partly a function of how much time everyone has to work on them:
* Online defrag (patches available but not well tested)
* reduced mke2fs time via lazy itable initialization in conjuction with
the uninit_bg feature (capability to do this is available in e2fsprogs
but a kernel thread to do lazy zeroing of unused inode table blocks
after filesystem is first mounted is required for safety)

* improved file allocation (multi-block alloc, delayed alloc; basically done)
* fix 32000 subdirectory limit (patch exists, needs some e2fsck work)
* nsec timestamps for mtime, atime, ctime, create time (patch exists,
needs some e2fsck work)
* inode version field on disk (NFSv4, Lustre; prototype exists)
* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
* journal checksumming for robustness, performance (prototype exists)
* persistent file preallocation (e.g for streaming media, databases)
There are several others under discussion, whether they all make it in is
partly a function of how much time everyone has to work on them. Features like
metadata checksumming have been discussed and planned for a bit but no patches
exist yet so I'm not sure they're in the near-term roadmap.

Features like metadata checksumming have been discussed and planned for
a bit but no patches exist yet so I'm not sure they're in the near-term
roadmap.
The big performance win will come with mballoc, delalloc and flex_bg
grouping of bitmaps and inode tables. Some test results available here:

The big performance win will come with mballoc and delalloc. CFS has
been using mballoc for a few years already with Lustre, and IBM + Bull
did a lot of benchmarking on it. The reason it isn't in the first set of
patches is partly a manageability issue, and partly because it doesn't
directly affect the on-disk format (outside of much better allocation)
so it isn't critical to get into the first round of changes. I believe
Alex is working on a new set of patches right now.
- http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html
- http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html

3. Options
==========
Expand Down Expand Up @@ -222,9 +243,11 @@ stripe=n Number of filesystem blocks that mballoc will try
to use for allocation size and alignment. For RAID5/6
systems this should be the number of data
disks * RAID chunk size in file system blocks.

delalloc (*) Deferring block allocation until write-out time.
nodelalloc Disable delayed allocation. Blocks are allocation
when data is copied from user to page cache.
Data Mode
---------
=========
There are 3 different data modes:

* writeback mode
Expand All @@ -236,18 +259,19 @@ typically provide the best ext4 performance.

* ordered mode
In data=ordered mode, ext4 only officially journals metadata, but it logically
groups metadata and data blocks into a single unit called a transaction. When
it's time to write the new metadata out to disk, the associated data blocks
are written first. In general, this mode performs slightly slower than
writeback but significantly faster than journal mode.
groups metadata information related to data changes with the data blocks into a
single unit called a transaction. When it's time to write the new metadata
out to disk, the associated data blocks are written first. In general,
this mode performs slightly slower than writeback but significantly faster than journal mode.

* journal mode
data=journal mode provides full data and metadata journaling. All new data is
written to the journal first, and then to its final location.
In the event of a crash, the journal can be replayed, bringing both data and
metadata into a consistent state. This mode is the slowest except when data
needs to be read from and written to disk at the same time where it
outperforms all others modes.
outperforms all others modes. Curently ext4 does not have delayed
allocation support if this data journalling mode is selected.

References
==========
Expand All @@ -256,7 +280,8 @@ kernel source: <file:fs/ext4/>
<file:fs/jbd2/>

programs: http://e2fsprogs.sourceforge.net/
http://ext2resize.sourceforge.net

useful links: http://fedoraproject.org/wiki/ext3-devel
http://www.bullopensource.org/ext4/
http://ext4.wiki.kernel.org/index.php/Main_Page
http://fedoraproject.org/wiki/Features/Ext4
Loading

0 comments on commit 2715150

Please sign in to comment.