Skip to content

Commit

Permalink
Merge branch 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linu…
Browse files Browse the repository at this point in the history
…x/kernel/git/tip/linux-2.6-tip

* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (80 commits)
  x86, mce: Add boot options for corrected errors
  x86, mce: Fix mce printing
  x86, mce: fix for mce counters
  x86, mce: support action-optional machine checks
  x86, mce: define MCE_VECTOR
  x86, mce: rename mce_notify_user to mce_notify_irq
  x86: fix panic with interrupts off (needed for MCE)
  x86, mce: export MCE severities coverage via debugfs
  x86, mce: implement new status bits
  x86, mce: print header/footer only once for multiple MCEs
  x86, mce: default to panic timeout for machine checks
  x86, mce: improve mce_get_rip
  x86, mce: make non Monarch panic message "Fatal machine check" too
  x86, mce: switch x86 machine check handler to Monarch election.
  x86, mce: implement panic synchronization
  x86, mce: implement bootstrapping for machine check wakeups
  x86, mce: check early in exception handler if panic is needed
  x86, mce: add table driven machine check grading
  x86, mce: remove TSC print heuristic
  x86, mce: log corrected errors when panicing
  ...
  • Loading branch information
Linus Torvalds committed Jun 13, 2009
2 parents 7603ef0 + 0d59597 commit a2ee298
Show file tree
Hide file tree
Showing 39 changed files with 2,978 additions and 1,668 deletions.
15 changes: 15 additions & 0 deletions Documentation/Changes
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ o procps 3.2.0 # ps --version
o oprofile 0.9 # oprofiled --version
o udev 081 # udevinfo -V
o grub 0.93 # grub --version
o mcelog 0.6

Kernel compilation
==================
Expand Down Expand Up @@ -276,6 +277,16 @@ before running exportfs or mountd. It is recommended that all NFS
services be protected from the internet-at-large by a firewall where
that is possible.

mcelog
------

In Linux 2.6.31+ the i386 kernel needs to run the mcelog utility
as a regular cronjob similar to the x86-64 kernel to process and log
machine check events when CONFIG_X86_NEW_MCE is enabled. Machine check
events are errors reported by the CPU. Processing them is strongly encouraged.
All x86-64 kernels since 2.6.4 require the mcelog utility to
process machine checks.

Getting updated software
========================

Expand Down Expand Up @@ -365,6 +376,10 @@ FUSE
----
o <http://sourceforge.net/projects/fuse>

mcelog
------
o <ftp://ftp.kernel.org/pub/linux/utils/cpu/mce/mcelog/>

Networking
**********

Expand Down
10 changes: 10 additions & 0 deletions Documentation/feature-removal-schedule.txt
Original file line number Diff line number Diff line change
Expand Up @@ -437,3 +437,13 @@ Why: Superseded by tdfxfb. I2C/DDC support used to live in a separate
driver but this caused driver conflicts.
Who: Jean Delvare <khali@linux-fr.org>
Krzysztof Helt <krzysztof.h1@wp.pl>

----------------------------

What: CONFIG_X86_OLD_MCE
When: 2.6.32
Why: Remove the old legacy 32bit machine check code. This has been
superseded by the newer machine check code from the 64bit port,
but the old version has been kept around for easier testing. Note this
doesn't impact the old P5 and WinChip machine check handlers.
Who: Andi Kleen <andi@firstfloor.org>
44 changes: 37 additions & 7 deletions Documentation/x86/x86_64/boot-options.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,51 @@ only the AMD64 specific ones are listed here.

Machine check

mce=off disable machine check
mce=bootlog Enable logging of machine checks left over from booting.
Disabled by default on AMD because some BIOS leave bogus ones.
If your BIOS doesn't do that it's a good idea to enable though
to make sure you log even machine check events that result
in a reboot. On Intel systems it is enabled by default.
Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables.

mce=off
Disable machine check
mce=no_cmci
Disable CMCI(Corrected Machine Check Interrupt) that
Intel processor supports. Usually this disablement is
not recommended, but it might be handy if your hardware
is misbehaving.
Note that you'll get more problems without CMCI than with
due to the shared banks, i.e. you might get duplicated
error logs.
mce=dont_log_ce
Don't make logs for corrected errors. All events reported
as corrected are silently cleared by OS.
This option will be useful if you have no interest in any
of corrected errors.
mce=ignore_ce
Disable features for corrected errors, e.g. polling timer
and CMCI. All events reported as corrected are not cleared
by OS and remained in its error banks.
Usually this disablement is not recommended, however if
there is an agent checking/clearing corrected errors
(e.g. BIOS or hardware monitoring applications), conflicting
with OS's error handling, and you cannot deactivate the agent,
then this option will be a help.
mce=bootlog
Enable logging of machine checks left over from booting.
Disabled by default on AMD because some BIOS leave bogus ones.
If your BIOS doesn't do that it's a good idea to enable though
to make sure you log even machine check events that result
in a reboot. On Intel systems it is enabled by default.
mce=nobootlog
Disable boot machine check logging.
mce=tolerancelevel (number)
mce=tolerancelevel[,monarchtimeout] (number,number)
tolerance levels:
0: always panic on uncorrected errors, log corrected errors
1: panic or SIGBUS on uncorrected errors, log corrected errors
2: SIGBUS or log uncorrected errors, log corrected errors
3: never panic or SIGBUS, log all errors (for testing only)
Default is 1
Can be also set using sysfs which is preferable.
monarchtimeout:
Sets the time in us to wait for other CPUs on machine checks. 0
to disable.

nomce (for compatibility with i386): same as mce=off

Expand Down
8 changes: 7 additions & 1 deletion Documentation/x86/x86_64/machinecheck
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,9 @@ check_interval
the polling interval. When the poller stops finding MCEs, it
triggers an exponential backoff (poll less often) on the polling
interval. The check_interval variable is both the initial and
maximum polling interval.
maximum polling interval. 0 means no polling for corrected machine
check errors (but some corrected errors might be still reported
in other ways)

tolerant
Tolerance level. When a machine check exception occurs for a non
Expand All @@ -67,6 +69,10 @@ trigger
Program to run when a machine check event is detected.
This is an alternative to running mcelog regularly from cron
and allows to detect events faster.
monarch_timeout
How long to wait for the other CPUs to machine check too on a
exception. 0 to disable waiting for other CPUs.
Unit: us

TBD document entries for AMD threshold interrupt configuration

Expand Down
45 changes: 41 additions & 4 deletions arch/x86/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -789,30 +789,63 @@ config X86_MCE
to disable it. MCE support simply ignores non-MCE processors like
the 386 and 486, so nearly everyone can say Y here.

config X86_OLD_MCE
depends on X86_32 && X86_MCE
bool "Use legacy machine check code (will go away)"
default n
select X86_ANCIENT_MCE
---help---
Use the old i386 machine check code. This is merely intended for
testing in a transition period. Try this if you run into any machine
check related software problems, but report the problem to
linux-kernel. When in doubt say no.

config X86_NEW_MCE
depends on X86_MCE
bool
default y if (!X86_OLD_MCE && X86_32) || X86_64

config X86_MCE_INTEL
def_bool y
prompt "Intel MCE features"
depends on X86_64 && X86_MCE && X86_LOCAL_APIC
depends on X86_NEW_MCE && X86_LOCAL_APIC
---help---
Additional support for intel specific MCE features such as
the thermal monitor.

config X86_MCE_AMD
def_bool y
prompt "AMD MCE features"
depends on X86_64 && X86_MCE && X86_LOCAL_APIC
depends on X86_NEW_MCE && X86_LOCAL_APIC
---help---
Additional support for AMD specific MCE features such as
the DRAM Error Threshold.

config X86_ANCIENT_MCE
def_bool n
depends on X86_32
prompt "Support for old Pentium 5 / WinChip machine checks"
---help---
Include support for machine check handling on old Pentium 5 or WinChip
systems. These typically need to be enabled explicitely on the command
line.

config X86_MCE_THRESHOLD
depends on X86_MCE_AMD || X86_MCE_INTEL
bool
default y

config X86_MCE_INJECT
depends on X86_NEW_MCE
tristate "Machine check injector support"
---help---
Provide support for injecting machine checks for testing purposes.
If you don't know what a machine check is and you don't do kernel
QA it is safe to say n.

config X86_MCE_NONFATAL
tristate "Check for non-fatal errors on AMD Athlon/Duron / Intel Pentium 4"
depends on X86_32 && X86_MCE
depends on X86_OLD_MCE
---help---
Enabling this feature starts a timer that triggers every 5 seconds which
will look at the machine check registers to see if anything happened.
Expand All @@ -825,11 +858,15 @@ config X86_MCE_NONFATAL

config X86_MCE_P4THERMAL
bool "check for P4 thermal throttling interrupt."
depends on X86_32 && X86_MCE && (X86_UP_APIC || SMP)
depends on X86_OLD_MCE && X86_MCE && (X86_UP_APIC || SMP)
---help---
Enabling this feature will cause a message to be printed when the P4
enters thermal throttling.

config X86_THERMAL_VECTOR
def_bool y
depends on X86_MCE_P4THERMAL || X86_MCE_INTEL

config VM86
bool "Enable VM86 support" if EMBEDDED
default y
Expand Down
11 changes: 10 additions & 1 deletion arch/x86/include/asm/entry_arch.h
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ BUILD_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
BUILD_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR)
BUILD_INTERRUPT(call_function_single_interrupt,CALL_FUNCTION_SINGLE_VECTOR)
BUILD_INTERRUPT(irq_move_cleanup_interrupt,IRQ_MOVE_CLEANUP_VECTOR)
BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR)

BUILD_INTERRUPT3(invalidate_interrupt0,INVALIDATE_TLB_VECTOR_START+0,
smp_invalidate_interrupt)
Expand Down Expand Up @@ -52,8 +53,16 @@ BUILD_INTERRUPT(spurious_interrupt,SPURIOUS_APIC_VECTOR)
BUILD_INTERRUPT(perf_pending_interrupt, LOCAL_PENDING_VECTOR)
#endif

#ifdef CONFIG_X86_MCE_P4THERMAL
#ifdef CONFIG_X86_THERMAL_VECTOR
BUILD_INTERRUPT(thermal_interrupt,THERMAL_APIC_VECTOR)
#endif

#ifdef CONFIG_X86_MCE_THRESHOLD
BUILD_INTERRUPT(threshold_interrupt,THRESHOLD_APIC_VECTOR)
#endif

#ifdef CONFIG_X86_NEW_MCE
BUILD_INTERRUPT(mce_self_interrupt,MCE_SELF_VECTOR)
#endif

#endif
2 changes: 1 addition & 1 deletion arch/x86/include/asm/hardirq.h
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ typedef struct {
#endif
#ifdef CONFIG_X86_MCE
unsigned int irq_thermal_count;
# ifdef CONFIG_X86_64
# ifdef CONFIG_X86_MCE_THRESHOLD
unsigned int irq_threshold_count;
# endif
#endif
Expand Down
2 changes: 2 additions & 0 deletions arch/x86/include/asm/hw_irq.h
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ extern void perf_pending_interrupt(void);
extern void spurious_interrupt(void);
extern void thermal_interrupt(void);
extern void reschedule_interrupt(void);
extern void mce_self_interrupt(void);

extern void invalidate_interrupt(void);
extern void invalidate_interrupt0(void);
Expand All @@ -46,6 +47,7 @@ extern void invalidate_interrupt6(void);
extern void invalidate_interrupt7(void);

extern void irq_move_cleanup_interrupt(void);
extern void reboot_interrupt(void);
extern void threshold_interrupt(void);

extern void call_function_interrupt(void);
Expand Down
17 changes: 10 additions & 7 deletions arch/x86/include/asm/irq_vectors.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
*/

#define NMI_VECTOR 0x02
#define MCE_VECTOR 0x12

/*
* IDT vectors usable for external interrupt sources start
Expand Down Expand Up @@ -87,13 +88,8 @@
#define CALL_FUNCTION_VECTOR 0xfc
#define CALL_FUNCTION_SINGLE_VECTOR 0xfb
#define THERMAL_APIC_VECTOR 0xfa

#ifdef CONFIG_X86_32
/* 0xf8 - 0xf9 : free */
#else
# define THRESHOLD_APIC_VECTOR 0xf9
# define UV_BAU_MESSAGE 0xf8
#endif
#define THRESHOLD_APIC_VECTOR 0xf9
#define REBOOT_VECTOR 0xf8

/* f0-f7 used for spreading out TLB flushes: */
#define INVALIDATE_TLB_VECTOR_END 0xf7
Expand All @@ -117,6 +113,13 @@
*/
#define LOCAL_PENDING_VECTOR 0xec

#define UV_BAU_MESSAGE 0xec

/*
* Self IPI vector for machine checks
*/
#define MCE_SELF_VECTOR 0xeb

/*
* First APIC vector available to drivers: (vectors 0x30-0xee) we
* start at 0x31(0x41) to spread out vectors evenly between priority
Expand Down
Loading

0 comments on commit a2ee298

Please sign in to comment.