Skip to content

Commit

Permalink
---
Browse files Browse the repository at this point in the history
yaml
---
r: 63810
b: refs/heads/master
c: e7bc15a
h: refs/heads/master
v: v3
  • Loading branch information
Linus Torvalds committed Aug 9, 2007
1 parent 21c5d11 commit b8526c0
Show file tree
Hide file tree
Showing 36 changed files with 552 additions and 511 deletions.
2 changes: 1 addition & 1 deletion [refs]
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
---
refs/heads/master: b434e71933aa0519ee042c01419db76b7dcc058e
refs/heads/master: e7bc15a9ad07269c48732e9a9874e5938b4700ee
4 changes: 3 additions & 1 deletion trunk/Documentation/lguest/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@ LGUEST_GUEST_TOP := ($(CONFIG_PAGE_OFFSET) - 0x08000000)

CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 -Wl,-T,lguest.lds
LDLIBS:=-lz

# Removing this works for some versions of ld.so (eg. Ubuntu Feisty) and
# not others (eg. FC7).
LDFLAGS+=-static
all: lguest.lds lguest

# The linker script on x86 is so complex the only way of creating one
Expand Down
2 changes: 1 addition & 1 deletion trunk/Documentation/sched-design-CFS.txt
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ Some implementation details:
CFS uses nanosecond granularity accounting and does not rely on any
jiffies or other HZ detail. Thus the CFS scheduler has no notion of
'timeslices' and has no heuristics whatsoever. There is only one
central tunable:
central tunable (you have to switch on CONFIG_SCHED_DEBUG):

/proc/sys/kernel/sched_granularity_ns

Expand Down
108 changes: 108 additions & 0 deletions trunk/Documentation/sched-nice-design.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
This document explains the thinking about the revamped and streamlined
nice-levels implementation in the new Linux scheduler.

Nice levels were always pretty weak under Linux and people continuously
pestered us to make nice +19 tasks use up much less CPU time.

Unfortunately that was not that easy to implement under the old
scheduler, (otherwise we'd have done it long ago) because nice level
support was historically coupled to timeslice length, and timeslice
units were driven by the HZ tick, so the smallest timeslice was 1/HZ.

In the O(1) scheduler (in 2003) we changed negative nice levels to be
much stronger than they were before in 2.4 (and people were happy about
that change), and we also intentionally calibrated the linear timeslice
rule so that nice +19 level would be _exactly_ 1 jiffy. To better
understand it, the timeslice graph went like this (cheesy ASCII art
alert!):


A
\ | [timeslice length]
\ |
\ |
\ |
\ |
\|___100msecs
|^ . _
| ^ . _
| ^ . _
-*----------------------------------*-----> [nice level]
-20 | +19
|
|

So that if someone wanted to really renice tasks, +19 would give a much
bigger hit than the normal linear rule would do. (The solution of
changing the ABI to extend priorities was discarded early on.)

This approach worked to some degree for some time, but later on with
HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
we felt to be a bit excessive. Excessive _not_ because it's too small of
a CPU utilization, but because it causes too frequent (once per
millisec) rescheduling. (and would thus trash the cache, etc. Remember,
this was long ago when hardware was weaker and caches were smaller, and
people were running number crunching apps at nice +19.)

So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
right minimal granularity - and this translates to 5% CPU utilization.
But the fundamental HZ-sensitive property for nice+19 still remained,
and we never got a single complaint about nice +19 being too _weak_ in
terms of CPU utilization, we only got complaints about it (still) being
too _strong_ :-)

To sum it up: we always wanted to make nice levels more consistent, but
within the constraints of HZ and jiffies and their nasty design level
coupling to timeslices and granularity it was not really viable.

The second (less frequent but still periodically occuring) complaint
about Linux's nice level support was its assymetry around the origo
(which you can see demonstrated in the picture above), or more
accurately: the fact that nice level behavior depended on the _absolute_
nice level as well, while the nice API itself is fundamentally
"relative":

int nice(int inc);

asmlinkage long sys_nice(int increment)

(the first one is the glibc API, the second one is the syscall API.)
Note that the 'inc' is relative to the current nice level. Tools like
bash's "nice" command mirror this relative API.

With the old scheduler, if you for example started a niced task with +1
and another task with +2, the CPU split between the two tasks would
depend on the nice level of the parent shell - if it was at nice -10 the
CPU split was different than if it was at +5 or +10.

A third complaint against Linux's nice level support was that negative
nice levels were not 'punchy enough', so lots of people had to resort to
run audio (and other multimedia) apps under RT priorities such as
SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
proof, and a buggy SCHED_FIFO app can also lock up the system for good.

The new scheduler in v2.6.23 addresses all three types of complaints:

To address the first complaint (of nice levels being not "punchy"
enough), the scheduler was decoupled from 'time slice' and HZ concepts
(and granularity was made a separate concept from nice levels) and thus
it was possible to implement better and more consistent nice +19
support: with the new scheduler nice +19 tasks get a HZ-independent
1.5%, instead of the variable 3%-5%-9% range they got in the old
scheduler.

To address the second complaint (of nice levels not being consistent),
the new scheduler makes nice(1) have the same CPU utilization effect on
tasks, regardless of their absolute nice levels. So on the new
scheduler, running a nice +10 and a nice 11 task has the same CPU
utilization "split" between them as running a nice -5 and a nice -4
task. (one will get 55% of the CPU, the other 45%.) That is why nice
levels were changed to be "multiplicative" (or exponential) - that way
it does not matter which nice level you start out from, the 'relative
result' will always be the same.

The third complaint (of negative nice levels not being "punchy" enough
and forcing audio apps to run under the more dangerous SCHED_FIFO
scheduling policy) is addressed by the new scheduler almost
automatically: stronger negative nice levels are an automatic
side-effect of the recalibrated dynamic range of nice levels.
5 changes: 5 additions & 0 deletions trunk/drivers/lguest/core.c
Original file line number Diff line number Diff line change
Expand Up @@ -453,6 +453,11 @@ static void run_guest_once(struct lguest *lg, struct lguest_pages *pages)
* lguest_pages". */
copy_in_guest_info(lg, pages);

/* Set the trap number to 256 (impossible value). If we fault while
* switching to the Guest (bad segment registers or bug), this will
* cause us to abort the Guest. */
lg->regs->trapnum = 256;

/* Now: we push the "eflags" register on the stack, then do an "lcall".
* This is how we change from using the kernel code segment to using
* the dedicated lguest code segment, as well as jumping into the
Expand Down
9 changes: 6 additions & 3 deletions trunk/drivers/lguest/interrupts_and_traps.c
Original file line number Diff line number Diff line change
Expand Up @@ -195,13 +195,16 @@ static int has_err(unsigned int trap)
/* deliver_trap() returns true if it could deliver the trap. */
int deliver_trap(struct lguest *lg, unsigned int num)
{
u32 lo = lg->idt[num].a, hi = lg->idt[num].b;
/* Trap numbers are always 8 bit, but we set an impossible trap number
* for traps inside the Switcher, so check that here. */
if (num >= ARRAY_SIZE(lg->idt))
return 0;

/* Early on the Guest hasn't set the IDT entries (or maybe it put a
* bogus one in): if we fail here, the Guest will be killed. */
if (!idt_present(lo, hi))
if (!idt_present(lg->idt[num].a, lg->idt[num].b))
return 0;
set_guest_interrupt(lg, lo, hi, has_err(num));
set_guest_interrupt(lg, lg->idt[num].a, lg->idt[num].b, has_err(num));
return 1;
}

Expand Down
9 changes: 6 additions & 3 deletions trunk/drivers/lguest/lguest.c
Original file line number Diff line number Diff line change
Expand Up @@ -323,9 +323,12 @@ static void lguest_write_gdt_entry(struct desc_struct *dt,
* __thread variables). So we have a hypercall specifically for this case. */
static void lguest_load_tls(struct thread_struct *t, unsigned int cpu)
{
/* There's one problem which normal hardware doesn't have: the Host
* can't handle us removing entries we're currently using. So we clear
* the GS register here: if it's needed it'll be reloaded anyway. */
loadsegment(gs, 0);
lazy_hcall(LHCALL_LOAD_TLS, __pa(&t->tls_array), cpu, 0);
}
/*:*/

/*G:038 That's enough excitement for now, back to ploughing through each of
* the paravirt_ops (we're about 1/3 of the way through).
Expand Down Expand Up @@ -687,7 +690,8 @@ static struct clocksource lguest_clock = {
.rating = 400,
.read = lguest_clock_read,
.mask = CLOCKSOURCE_MASK(64),
.mult = 1,
.mult = 1 << 22,
.shift = 22,
};

/* The "scheduler clock" is just our real clock, adjusted to start at zero */
Expand Down Expand Up @@ -770,7 +774,6 @@ static void lguest_time_init(void)
* way, the "rating" is initialized so high that it's always chosen
* over any other clocksource. */
if (lguest_data.tsc_khz) {
lguest_clock.shift = 22;
lguest_clock.mult = clocksource_khz2mult(lguest_data.tsc_khz,
lguest_clock.shift);
lguest_clock.flags = CLOCK_SOURCE_IS_CONTINUOUS;
Expand Down
62 changes: 5 additions & 57 deletions trunk/drivers/lguest/segments.c
Original file line number Diff line number Diff line change
Expand Up @@ -43,22 +43,6 @@
* begin.
*/

/* Is the descriptor the Guest wants us to put in OK?
*
* The flag which Intel says must be zero: must be zero. The descriptor must
* be present, (this is actually checked earlier but is here for thorougness),
* and the descriptor type must be 1 (a memory segment). */
static int desc_ok(const struct desc_struct *gdt)
{
return ((gdt->b & 0x00209000) == 0x00009000);
}

/* Is the segment present? (Otherwise it can't be used by the Guest). */
static int segment_present(const struct desc_struct *gdt)
{
return gdt->b & 0x8000;
}

/* There are several entries we don't let the Guest set. The TSS entry is the
* "Task State Segment" which controls all kinds of delicate things. The
* LGUEST_CS and LGUEST_DS entries are reserved for the Switcher, and the
Expand All @@ -71,37 +55,11 @@ static int ignored_gdt(unsigned int num)
|| num == GDT_ENTRY_DOUBLEFAULT_TSS);
}

/* If the Guest asks us to remove an entry from the GDT, we have to be careful.
* If one of the segment registers is pointing at that entry the Switcher will
* crash when it tries to reload the segment registers for the Guest.
*
* It doesn't make much sense for the Guest to try to remove its own code, data
* or stack segments while they're in use: assume that's a Guest bug. If it's
* one of the lesser segment registers using the removed entry, we simply set
* that register to 0 (unusable). */
static void check_segment_use(struct lguest *lg, unsigned int desc)
{
/* GDT entries are 8 bytes long, so we divide to get the index and
* ignore the bottom bits. */
if (lg->regs->gs / 8 == desc)
lg->regs->gs = 0;
if (lg->regs->fs / 8 == desc)
lg->regs->fs = 0;
if (lg->regs->es / 8 == desc)
lg->regs->es = 0;
if (lg->regs->ds / 8 == desc
|| lg->regs->cs / 8 == desc
|| lg->regs->ss / 8 == desc)
kill_guest(lg, "Removed live GDT entry %u", desc);
}
/*:*/
/*M:009 We wouldn't need to check for removal of in-use segments if we handled
* faults in the Switcher. However, it's probably not a worthwhile
* optimization. :*/

/*H:610 Once the GDT has been changed, we look through the changed entries and
* see if they're OK. If not, we'll call kill_guest() and the Guest will never
* get to use the invalid entries. */
/*H:610 Once the GDT has been changed, we fix the new entries up a little. We
* don't care if they're invalid: the worst that can happen is a General
* Protection Fault in the Switcher when it restores a Guest segment register
* which tries to use that entry. Then we kill the Guest for causing such a
* mess: the message will be "unhandled trap 256". */
static void fixup_gdt_table(struct lguest *lg, unsigned start, unsigned end)
{
unsigned int i;
Expand All @@ -112,16 +70,6 @@ static void fixup_gdt_table(struct lguest *lg, unsigned start, unsigned end)
if (ignored_gdt(i))
continue;

/* We could fault in switch_to_guest if they are using
* a removed segment. */
if (!segment_present(&lg->gdt[i])) {
check_segment_use(lg, i);
continue;
}

if (!desc_ok(&lg->gdt[i]))
kill_guest(lg, "Bad GDT descriptor %i", i);

/* Segment descriptors contain a privilege level: the Guest is
* sometimes careless and leaves this as 0, even though it's
* running at privilege level 1. If so, we fix it here. */
Expand Down
15 changes: 9 additions & 6 deletions trunk/drivers/lguest/switcher.S
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
// Down here in the depths of assembler code.
#include <linux/linkage.h>
#include <asm/asm-offsets.h>
#include <asm/page.h>
#include "lg.h"

// We mark the start of the code to copy
Expand Down Expand Up @@ -182,13 +183,15 @@ ENTRY(switch_to_guest)
movl $(LGUEST_DS), %eax; \
movl %eax, %ds; \
/* So where are we? Which CPU, which struct? \
* The stack is our clue: our TSS sets \
* It at the end of "struct lguest_pages" \
* And we then pushed and pushed and pushed Guest regs: \
* Now stack points atop the "struct lguest_regs". \
* Subtract that offset, and we find our struct. */ \
* The stack is our clue: our TSS starts \
* It at the end of "struct lguest_pages". \
* Or we may have stumbled while restoring \
* Our Guest segment regs while in switch_to_guest, \
* The fault pushed atop that part-unwound stack. \
* If we round the stack down to the page start \
* We're at the start of "struct lguest_pages". */ \
movl %esp, %eax; \
subl $LGUEST_PAGES_regs, %eax; \
andl $(~(1 << PAGE_SHIFT - 1)), %eax; \
/* Save our trap number: the switch will obscure it \
* (The Guest regs are not mapped here in the Host) \
* %ebx holds it safe for deliver_to_host */ \
Expand Down
14 changes: 7 additions & 7 deletions trunk/drivers/mmc/card/queue.c
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,6 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
struct mmc_host *host = card->host;
u64 limit = BLK_BOUNCE_HIGH;
int ret;
unsigned int bouncesz;

if (mmc_dev(host)->dma_mask && *mmc_dev(host)->dma_mask)
limit = *mmc_dev(host)->dma_mask;
Expand All @@ -134,6 +133,8 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock

#ifdef CONFIG_MMC_BLOCK_BOUNCE
if (host->max_hw_segs == 1) {
unsigned int bouncesz;

bouncesz = MMC_QUEUE_BOUNCESZ;

if (bouncesz > host->max_req_size)
Expand All @@ -156,14 +157,14 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
GFP_KERNEL);
if (!mq->sg) {
ret = -ENOMEM;
goto free_bounce_buf;
goto cleanup_queue;
}

mq->bounce_sg = kmalloc(sizeof(struct scatterlist) *
bouncesz / 512, GFP_KERNEL);
if (!mq->bounce_sg) {
ret = -ENOMEM;
goto free_sg;
goto cleanup_queue;
}
}
}
Expand Down Expand Up @@ -197,14 +198,13 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, spinlock_t *lock
if (mq->bounce_sg)
kfree(mq->bounce_sg);
mq->bounce_sg = NULL;
free_sg:
kfree(mq->sg);
cleanup_queue:
if (mq->sg)
kfree(mq->sg);
mq->sg = NULL;
free_bounce_buf:
if (mq->bounce_buf)
kfree(mq->bounce_buf);
mq->bounce_buf = NULL;
cleanup_queue:
blk_cleanup_queue(mq->queue);
return ret;
}
Expand Down
8 changes: 4 additions & 4 deletions trunk/drivers/mmc/host/at91_mci.c
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@

#define AT91_MCI_ERRORS (AT91_MCI_RINDE | AT91_MCI_RDIRE | AT91_MCI_RCRCE \
| AT91_MCI_RENDE | AT91_MCI_RTOE | AT91_MCI_DCRCE \
| AT91_MCI_DTOE | AT91_MCI_OVRE | AT91_MCI_UNRE)
| AT91_MCI_DTOE | AT91_MCI_OVRE | AT91_MCI_UNRE)

#define at91_mci_read(host, reg) __raw_readl((host)->baseaddr + (reg))
#define at91_mci_write(host, reg, val) __raw_writel((val), (host)->baseaddr + (reg))
Expand Down Expand Up @@ -676,15 +676,15 @@ static irqreturn_t at91_mci_irq(int irq, void *devid)

int_status = at91_mci_read(host, AT91_MCI_SR);
int_mask = at91_mci_read(host, AT91_MCI_IMR);

pr_debug("MCI irq: status = %08X, %08X, %08X\n", int_status, int_mask,
int_status & int_mask);

int_status = int_status & int_mask;

if (int_status & AT91_MCI_ERRORS) {
completed = 1;

if (int_status & AT91_MCI_UNRE)
pr_debug("MMC: Underrun error\n");
if (int_status & AT91_MCI_OVRE)
Expand Down
Loading

0 comments on commit b8526c0

Please sign in to comment.