Skip to content

Commit

Permalink
---
Browse files Browse the repository at this point in the history
yaml
---
r: 288651
b: refs/heads/master
c: 73323f5
h: refs/heads/master
i:
  288649: 7627cce
  288647: 4c3f44a
v: v3
  • Loading branch information
Stephane Eranian authored and Arnaldo Carvalho de Melo committed Feb 9, 2012
1 parent 9c54dc5 commit 55e5914
Show file tree
Hide file tree
Showing 10 changed files with 199 additions and 140 deletions.
2 changes: 1 addition & 1 deletion [refs]
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
---
refs/heads/master: f8d98f1095210da708a59f3a0b6fd267ad8f3f03
refs/heads/master: 73323f541fe5f55a3b8a5c3d565bfc1efd64abf6
63 changes: 0 additions & 63 deletions trunk/Documentation/lockup-watchdogs.txt

This file was deleted.

83 changes: 83 additions & 0 deletions trunk/Documentation/nmi_watchdog.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@

[NMI watchdog is available for x86 and x86-64 architectures]

Is your system locking up unpredictably? No keyboard activity, just
a frustrating complete hard lockup? Do you want to help us debugging
such lockups? If all yes then this document is definitely for you.

On many x86/x86-64 type hardware there is a feature that enables
us to generate 'watchdog NMI interrupts'. (NMI: Non Maskable Interrupt
which get executed even if the system is otherwise locked up hard).
This can be used to debug hard kernel lockups. By executing periodic
NMI interrupts, the kernel can monitor whether any CPU has locked up,
and print out debugging messages if so.

In order to use the NMI watchdog, you need to have APIC support in your
kernel. For SMP kernels, APIC support gets compiled in automatically. For
UP, enable either CONFIG_X86_UP_APIC (Processor type and features -> Local
APIC support on uniprocessors) or CONFIG_X86_UP_IOAPIC (Processor type and
features -> IO-APIC support on uniprocessors) in your kernel config.
CONFIG_X86_UP_APIC is for uniprocessor machines without an IO-APIC.
CONFIG_X86_UP_IOAPIC is for uniprocessor with an IO-APIC. [Note: certain
kernel debugging options, such as Kernel Stack Meter or Kernel Tracer,
may implicitly disable the NMI watchdog.]

For x86-64, the needed APIC is always compiled in.

Using local APIC (nmi_watchdog=2) needs the first performance register, so
you can't use it for other purposes (such as high precision performance
profiling.) However, at least oprofile and the perfctr driver disable the
local APIC NMI watchdog automatically.

To actually enable the NMI watchdog, use the 'nmi_watchdog=N' boot
parameter. Eg. the relevant lilo.conf entry:

append="nmi_watchdog=1"

For SMP machines and UP machines with an IO-APIC use nmi_watchdog=1.
For UP machines without an IO-APIC use nmi_watchdog=2, this only works
for some processor types. If in doubt, boot with nmi_watchdog=1 and
check the NMI count in /proc/interrupts; if the count is zero then
reboot with nmi_watchdog=2 and check the NMI count. If it is still
zero then log a problem, you probably have a processor that needs to be
added to the nmi code.

A 'lockup' is the following scenario: if any CPU in the system does not
execute the period local timer interrupt for more than 5 seconds, then
the NMI handler generates an oops and kills the process. This
'controlled crash' (and the resulting kernel messages) can be used to
debug the lockup. Thus whenever the lockup happens, wait 5 seconds and
the oops will show up automatically. If the kernel produces no messages
then the system has crashed so hard (eg. hardware-wise) that either it
cannot even accept NMI interrupts, or the crash has made the kernel
unable to print messages.

Be aware that when using local APIC, the frequency of NMI interrupts
it generates, depends on the system load. The local APIC NMI watchdog,
lacking a better source, uses the "cycles unhalted" event. As you may
guess it doesn't tick when the CPU is in the halted state (which happens
when the system is idle), but if your system locks up on anything but the
"hlt" processor instruction, the watchdog will trigger very soon as the
"cycles unhalted" event will happen every clock tick. If it locks up on
"hlt", then you are out of luck -- the event will not happen at all and the
watchdog won't trigger. This is a shortcoming of the local APIC watchdog
-- unfortunately there is no "clock ticks" event that would work all the
time. The I/O APIC watchdog is driven externally and has no such shortcoming.
But its NMI frequency is much higher, resulting in a more significant hit
to the overall system performance.

On x86 nmi_watchdog is disabled by default so you have to enable it with
a boot time parameter.

It's possible to disable the NMI watchdog in run-time by writing "0" to
/proc/sys/kernel/nmi_watchdog. Writing "1" to the same file will re-enable
the NMI watchdog. Notice that you still need to use "nmi_watchdog=" parameter
at boot time.

NOTE: In kernels prior to 2.4.2-ac18 the NMI-oopser is enabled unconditionally
on x86 SMP boxes.

[ feel free to send bug reports, suggestions and patches to
Ingo Molnar <mingo@redhat.com> or the Linux SMP mailing
list at <linux-smp@vger.kernel.org> ]

5 changes: 2 additions & 3 deletions trunk/arch/x86/include/asm/inat.h
Original file line number Diff line number Diff line change
Expand Up @@ -97,12 +97,11 @@

/* Attribute search APIs */
extern insn_attr_t inat_get_opcode_attribute(insn_byte_t opcode);
extern int inat_get_last_prefix_id(insn_byte_t last_pfx);
extern insn_attr_t inat_get_escape_attribute(insn_byte_t opcode,
int lpfx_id,
insn_byte_t last_pfx,
insn_attr_t esc_attr);
extern insn_attr_t inat_get_group_attribute(insn_byte_t modrm,
int lpfx_id,
insn_byte_t last_pfx,
insn_attr_t esc_attr);
extern insn_attr_t inat_get_avx_attribute(insn_byte_t opcode,
insn_byte_t vex_m,
Expand Down
18 changes: 6 additions & 12 deletions trunk/arch/x86/include/asm/insn.h
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,12 @@ struct insn {
#define X86_VEX_P(vex) ((vex) & 0x03) /* VEX3 Byte2, VEX2 Byte1 */
#define X86_VEX_M_MAX 0x1f /* VEX3.M Maximum value */

/* The last prefix is needed for two-byte and three-byte opcodes */
static inline insn_byte_t insn_last_prefix(struct insn *insn)
{
return insn->prefixes.bytes[3];
}

extern void insn_init(struct insn *insn, const void *kaddr, int x86_64);
extern void insn_get_prefixes(struct insn *insn);
extern void insn_get_opcode(struct insn *insn);
Expand Down Expand Up @@ -154,18 +160,6 @@ static inline insn_byte_t insn_vex_p_bits(struct insn *insn)
return X86_VEX_P(insn->vex_prefix.bytes[2]);
}

/* Get the last prefix id from last prefix or VEX prefix */
static inline int insn_last_prefix_id(struct insn *insn)
{
if (insn_is_avx(insn))
return insn_vex_p_bits(insn); /* VEX_p is a SIMD prefix id */

if (insn->prefixes.bytes[3])
return inat_get_last_prefix_id(insn->prefixes.bytes[3]);

return 0;
}

/* Offset of each field from kaddr */
static inline int insn_offset_rex_prefix(struct insn *insn)
{
Expand Down
36 changes: 18 additions & 18 deletions trunk/arch/x86/lib/inat.c
Original file line number Diff line number Diff line change
Expand Up @@ -29,46 +29,46 @@ insn_attr_t inat_get_opcode_attribute(insn_byte_t opcode)
return inat_primary_table[opcode];
}

int inat_get_last_prefix_id(insn_byte_t last_pfx)
{
insn_attr_t lpfx_attr;

lpfx_attr = inat_get_opcode_attribute(last_pfx);
return inat_last_prefix_id(lpfx_attr);
}

insn_attr_t inat_get_escape_attribute(insn_byte_t opcode, int lpfx_id,
insn_attr_t inat_get_escape_attribute(insn_byte_t opcode, insn_byte_t last_pfx,
insn_attr_t esc_attr)
{
const insn_attr_t *table;
int n;
insn_attr_t lpfx_attr;
int n, m = 0;

n = inat_escape_id(esc_attr);

if (last_pfx) {
lpfx_attr = inat_get_opcode_attribute(last_pfx);
m = inat_last_prefix_id(lpfx_attr);
}
table = inat_escape_tables[n][0];
if (!table)
return 0;
if (inat_has_variant(table[opcode]) && lpfx_id) {
table = inat_escape_tables[n][lpfx_id];
if (inat_has_variant(table[opcode]) && m) {
table = inat_escape_tables[n][m];
if (!table)
return 0;
}
return table[opcode];
}

insn_attr_t inat_get_group_attribute(insn_byte_t modrm, int lpfx_id,
insn_attr_t inat_get_group_attribute(insn_byte_t modrm, insn_byte_t last_pfx,
insn_attr_t grp_attr)
{
const insn_attr_t *table;
int n;
insn_attr_t lpfx_attr;
int n, m = 0;

n = inat_group_id(grp_attr);

if (last_pfx) {
lpfx_attr = inat_get_opcode_attribute(last_pfx);
m = inat_last_prefix_id(lpfx_attr);
}
table = inat_group_tables[n][0];
if (!table)
return inat_group_common_attribute(grp_attr);
if (inat_has_variant(table[X86_MODRM_REG(modrm)]) && lpfx_id) {
table = inat_group_tables[n][lpfx_id];
if (inat_has_variant(table[X86_MODRM_REG(modrm)]) && m) {
table = inat_group_tables[n][m];
if (!table)
return inat_group_common_attribute(grp_attr);
}
Expand Down
13 changes: 6 additions & 7 deletions trunk/arch/x86/lib/insn.c
Original file line number Diff line number Diff line change
Expand Up @@ -185,8 +185,7 @@ void insn_get_prefixes(struct insn *insn)
void insn_get_opcode(struct insn *insn)
{
struct insn_field *opcode = &insn->opcode;
insn_byte_t op;
int pfx_id;
insn_byte_t op, pfx;
if (opcode->got)
return;
if (!insn->prefixes.got)
Expand All @@ -213,8 +212,8 @@ void insn_get_opcode(struct insn *insn)
/* Get escaped opcode */
op = get_next(insn_byte_t, insn);
opcode->bytes[opcode->nbytes++] = op;
pfx_id = insn_last_prefix_id(insn);
insn->attr = inat_get_escape_attribute(op, pfx_id, insn->attr);
pfx = insn_last_prefix(insn);
insn->attr = inat_get_escape_attribute(op, pfx, insn->attr);
}
if (inat_must_vex(insn->attr))
insn->attr = 0; /* This instruction is bad */
Expand All @@ -236,7 +235,7 @@ void insn_get_opcode(struct insn *insn)
void insn_get_modrm(struct insn *insn)
{
struct insn_field *modrm = &insn->modrm;
insn_byte_t pfx_id, mod;
insn_byte_t pfx, mod;
if (modrm->got)
return;
if (!insn->opcode.got)
Expand All @@ -247,8 +246,8 @@ void insn_get_modrm(struct insn *insn)
modrm->value = mod;
modrm->nbytes = 1;
if (inat_is_group(insn->attr)) {
pfx_id = insn_last_prefix_id(insn);
insn->attr = inat_get_group_attribute(mod, pfx_id,
pfx = insn_last_prefix(insn);
insn->attr = inat_get_group_attribute(mod, pfx,
insn->attr);
if (insn_is_avx(insn) && !inat_accept_vex(insn->attr))
insn->attr = 0; /* This is bad */
Expand Down
24 changes: 12 additions & 12 deletions trunk/kernel/watchdog.c
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,12 @@
*
* started by Don Zickus, Copyright (C) 2010 Red Hat, Inc.
*
* Note: Most of this code is borrowed heavily from the original softlockup
* detector, so thanks to Ingo for the initial implementation.
* Some chunks also taken from the old x86-specific nmi watchdog code, thanks
* this code detects hard lockups: incidents in where on a CPU
* the kernel does not respond to anything except NMI.
*
* Note: Most of this code is borrowed heavily from softlockup.c,
* so thanks to Ingo for the initial implementation.
* Some chunks also taken from arch/x86/kernel/apic/nmi.c, thanks
* to those contributors as well.
*/

Expand Down Expand Up @@ -114,10 +117,9 @@ static unsigned long get_sample_period(void)
{
/*
* convert watchdog_thresh from seconds to ns
* the divide by 5 is to give hrtimer several chances (two
* or three with the current relation between the soft
* and hard thresholds) to increment before the
* hardlockup detector generates a warning
* the divide by 5 is to give hrtimer 5 chances to
* increment before the hardlockup detector generates
* a warning
*/
return get_softlockup_thresh() * (NSEC_PER_SEC / 5);
}
Expand Down Expand Up @@ -334,11 +336,9 @@ static int watchdog(void *unused)

set_current_state(TASK_INTERRUPTIBLE);
/*
* Run briefly (kicked by the hrtimer callback function) once every
* get_sample_period() seconds (4 seconds by default) to reset the
* softlockup timestamp. If this gets delayed for more than
* 2*watchdog_thresh seconds then the debug-printout triggers in
* watchdog_timer_fn().
* Run briefly once per second to reset the softlockup timestamp.
* If this gets delayed for more than 60 seconds then the
* debug-printout triggers in watchdog_timer_fn().
*/
while (!kthread_should_stop()) {
__touch_watchdog();
Expand Down
18 changes: 7 additions & 11 deletions trunk/lib/Kconfig.debug
Original file line number Diff line number Diff line change
Expand Up @@ -166,21 +166,18 @@ config LOCKUP_DETECTOR
hard and soft lockups.

Softlockups are bugs that cause the kernel to loop in kernel
mode for more than 20 seconds, without giving other tasks a
mode for more than 60 seconds, without giving other tasks a
chance to run. The current stack trace is displayed upon
detection and the system will stay locked up.

Hardlockups are bugs that cause the CPU to loop in kernel mode
for more than 10 seconds, without letting other interrupts have a
for more than 60 seconds, without letting other interrupts have a
chance to run. The current stack trace is displayed upon detection
and the system will stay locked up.

The overhead should be minimal. A periodic hrtimer runs to
generate interrupts and kick the watchdog task every 4 seconds.
An NMI is generated every 10 seconds or so to check for hardlockups.

The frequency of hrtimer and NMI events and the soft and hard lockup
thresholds can be controlled through the sysctl watchdog_thresh.
generate interrupts and kick the watchdog task every 10-12 seconds.
An NMI is generated every 60 seconds or so to check for hardlockups.

config HARDLOCKUP_DETECTOR
def_bool LOCKUP_DETECTOR && PERF_EVENTS && HAVE_PERF_EVENTS_NMI && \
Expand All @@ -192,8 +189,7 @@ config BOOTPARAM_HARDLOCKUP_PANIC
help
Say Y here to enable the kernel to panic on "hard lockups",
which are bugs that cause the kernel to loop in kernel
mode with interrupts disabled for more than 10 seconds (configurable
using the watchdog_thresh sysctl).
mode with interrupts disabled for more than 60 seconds.

Say N if unsure.

Expand All @@ -210,8 +206,8 @@ config BOOTPARAM_SOFTLOCKUP_PANIC
help
Say Y here to enable the kernel to panic on "soft lockups",
which are bugs that cause the kernel to loop in kernel
mode for more than 20 seconds (configurable using the watchdog_thresh
sysctl), without giving other tasks a chance to run.
mode for more than 60 seconds, without giving other tasks a
chance to run.

The panic can be used in combination with panic_timeout,
to cause the system to reboot automatically after a
Expand Down
Loading

0 comments on commit 55e5914

Please sign in to comment.