Skip to content

Add LTS Linux 5.4.46 #1832

Merged
merged 13 commits into from
Jul 28, 2020
Merged

Add LTS Linux 5.4.46 #1832

merged 13 commits into from
Jul 28, 2020

Conversation

pmenzel
Copy link
Collaborator

@pmenzel pmenzel commented Jun 16, 2020

Tested on hypnotoad, sigchld, and sigfpe.

@hypnotoad:~$ sensors
amdgpu-pci-0600
Adapter: PCI adapter
fan1:           0 RPM
edge:         +43.0°C  (crit = +120.0°C, hyst = +90.0°C)

@hypnotoad:~$ sudo modprobe k10temp
@hypnotoad:~$ sudo modprobe k10temp
@hypnotoad:~$ sensors
k10temp-pci-00c3
Adapter: PCI adapter
Tdie:         +73.2°C  (high = +70.0°C)
Tctl:         +73.2°C  

amdgpu-pci-0600
Adapter: PCI adapter
fan1:           0 RPM
edge:         +43.0°C  (crit = +120.0°C, hyst = +90.0°C)

@pmenzel
Copy link
Collaborator Author

pmenzel commented Jun 16, 2020

The Nvidia drivers still need to be built.

@donald
Copy link
Collaborator

donald commented Jul 7, 2020

root@sigusr2:~# /usr/sbin/nvidiactl start
insmod: ERROR: could not insert module /usr/share/nvidia/kernel/5.4.46.mx64.337/current/nvidia.ko: Unknown symbol in module
mknod: /dev/nvidia0: File exists
mknod: /dev/nvidiactl: File exists
insmod: ERROR: could not insert module /usr/share/nvidia/kernel/5.4.46.mx64.337/current/nvidia-uvm.ko: Unknown symbol in module
insmod: ERROR: could not insert module /usr/share/nvidia/kernel/5.4.46.mx64.337/current/nvidia-modeset.ko: Unknown symbol in module
insmod: ERROR: could not insert module /usr/share/nvidia/kernel/5.4.46.mx64.337/current/nvidia-drm.ko: Unknown symbol in module

...

[    0.000000] Linux version 5.4.46.mx64.337 (root@invidia.molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 SMP Tue Jun 16 23:32:15 CEST 2020
[    4.426438] nvidia: loading out-of-tree module taints kernel.
[    4.428262] nvidia: module license 'NVIDIA' taints kernel.
[    4.436802] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    4.439889] nvidia: Unknown symbol ipmi_create_user (err -2)
[    4.441823] nvidia: Unknown symbol ipmi_destroy_user (err -2)
[    4.443741] nvidia: Unknown symbol ipmi_validate_addr (err -2)
[    4.445646] nvidia: Unknown symbol ipmi_free_recv_msg (err -2)
[    4.447492] nvidia: Unknown symbol ipmi_set_my_address (err -2)
[    4.449328] nvidia: Unknown symbol ipmi_request_settime (err -2)
[    4.452651] nvidia: Unknown symbol ipmi_set_gets_events (err -2)
[    4.514629] nvidia_uvm: Unknown symbol nvUvmInterfaceDisableAccessCntr (err -2)
[    4.516934] nvidia_uvm: Unknown symbol nvUvmInterfaceChannelDestroy (err -2)
[    4.519161] nvidia_uvm: Unknown symbol nvUvmInterfaceQueryCaps (err -2)
[...]
[    4.665373] nvidia_modeset: Unknown symbol nvidia_register_module (err -2)
[    4.665410] nvidia_modeset: Unknown symbol nvidia_get_rm_ops (err -2)
[    4.665449] nvidia_modeset: Unknown symbol nvidia_unregister_module (err -2)
[    4.681113] nvidia_drm: Unknown symbol nvKmsKapiGetFunctionsTable (err -2)


@pmenzel
Copy link
Collaborator Author

pmenzel commented Jul 7, 2020

Sorry, for not setting the WIP label.

You can work around it by manually loading the module ipmi_msghandler:

sudo modprobe ipmi_msghandler

My plan is to build it into the Linux kernel again. Too much time wasted again thanks to the Nvidia driver.

The IPMI drivers are not needed on all systems, and we try to avoid
that interface. This also resolves a conflict with other watchdog
timers.

    handsomejack:~$ dmesg --level=err
    [   11.618887] watchdog: iTCO_wdt: cannot register miscdev on minor=130 (err=-16).
    [   11.627956] watchdog: iTCO_wdt: a legacy watchdog module is probably present.
    handsomejack:~$ dmesg | grep -e iTCO -e watchdog
    [   11.603138] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
    [   11.609888] iTCO_wdt: Found a Wellsburg TCO device (Version=2, TCOBASE=0x0460)
    [   11.618887] watchdog: iTCO_wdt: cannot register miscdev on minor=130 (err=-16).
    [   11.627956] watchdog: iTCO_wdt: a legacy watchdog module is probably present.
    [   11.636462] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
    [   11.643679] iTCO_vendor_support: vendor-support=0

The Linux error when shutting down *sympathyforthedevil* – not in the logs,
only on the monitor or the serial console – is also gone now, as the drivers
are not automatically loaded.

    [  189.063113] reboot: Power down
    [  189.068549] IPMI poweroff: Powering down via IPMI chassis control command
    [  189.075498] ------------[ cut here ]------------
    [  189.080259] sched: Unexpected reschedule of offline CPU#8!
    [  189.085898] WARNING: CPU: 0 PID: 1 at arch/x86/kernel/apic/ipi.c:67 native_smp_send_reschedule+0x34/0x40
    [  189.095605] Modules linked in: 8021q garp stp mrp llc amd64_edac_mod edac_mce_amd kvm_amd kvm input_leds led_class irqbypass ixgbe crc32c_intel acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables unix ipv6 nf_defrag_ipv6 autofs4
    [  189.118332] CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 5.4.39.mx64.334 #1
    [  189.125774] Hardware name: Supermicro Super Server/H11DSU-iN, BIOS 1.3 01/30/2020
    [  189.133482] RIP: 0010:native_smp_send_reschedule+0x34/0x40
    [  189.139114] Code: 05 31 9c 52 01 73 15 48 8b 05 a8 7f 2d 01 be fd 00 00 00 48 8b 40 30 e9 6a 8b db 00 89 fe 48 c7 c7 20 9e 21 82 e8 5c 1d 02 00 <0f> 0b c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 05 74 7f
    [  189.158198] RSP: 0018:ffffc9001892fbc8 EFLAGS: 00010086
    [  189.163571] RAX: 0000000000000000 RBX: ffff889faa6f5200 RCX: ffffffff82454348
    [  189.170858] RDX: 0000000000000001 RSI: 0000000000000092 RDI: ffffffff82b2cbec
    [  189.178139] RBP: 0000000000028b00 R08: 0000000000000796 R09: 0000000000000000
    [  189.185420] R10: ffffc9001892fbb8 R11: 00000000000000f0 R12: 0000000000000008
    [  189.192706] R13: 0000000000000000 R14: ffff889faa6f589c R15: 0000000000000046
    [  189.199988] FS:  00007f7a26e6f800(0000) GS:ffff889faec00000(0000) knlGS:0000000000000000
    [  189.208299] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  189.214192] CR2: 00007f950950c8a0 CR3: 000000ff9f3b4000 CR4: 00000000003406f0
    [  189.221473] Call Trace:
    [  189.224068]  try_to_wake_up+0x3bd/0x5a0
    [  189.228045]  check_start_timer_thread.part.12+0x2a/0x50
    [  189.233418]  sender+0x65/0x70
    [  189.236527]  i_ipmi_request+0x2de/0x9d0
    [  189.240507]  ipmi_request_supply_msgs+0x102/0x130
    [  189.245358]  ipmi_request_in_rc_mode+0x2f/0x80
    [  189.249944]  ipmi_poweroff_chassis+0xa0/0x110
    [  189.254452]  __do_sys_reboot+0x150/0x1e0
    [  189.258517]  ? do_writev+0xd8/0x120
    [  189.262146]  ? do_writev+0xd8/0x120
    [  189.265779]  do_syscall_64+0x48/0x130
    [  189.269586]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [  189.274782] RIP: 0033:0x7f7a2662a2a3
    [  189.278501] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 f3 c3 0f 1f 00 48 8b 15 b1 4b 2c 00 f7 d8
    [  189.297584] RSP: 002b:00007ffed7660078 EFLAGS: 00000206 ORIG_RAX: 00000000000000a9
    [  189.305376] RAX: ffffffffffffffda RBX: 000000004321fedc RCX: 00007f7a2662a2a3
    [  189.312663] RDX: 000000004321fedc RSI: 0000000028121969 RDI: 00000000fee1dead
    [  189.319944] RBP: 0000000000000000 R08: 0000000000000040 R09: 0000000000000005
    [  189.327224] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
    [  189.334512] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    [  189.341794] ---[ end trace 4c38720b40d3b851 ]---
Building them into the Linux kernel causes resource conflicts.

Resolves: #1821
> linux CONFIG_BLK_DEV_NBD should be "m" not "y"
> This option enables access to the in-kernel headers that are generated during                                                                 │
> the build process. These can be used to build eBPF tracing programs,                                                                          │
> or similar programs.  If you build the headers as a module, a module called                                                                   │
> kheaders.ko is built which can be loaded on-demand to get access to headers.
On several AMD server and desktop systems, we observe NMI stalls, which
sometimes even require a reboot. Add a patch by the Linux maintainer to
print more information in these cases.
1.  `CONFIG_SENSORS_K8TEMP=m`

    > If you say yes here you get support for the temperature sensor(s) inside
    > your CPU. Supported is whole AMD K8 microarchitecture. Please note that
    > you will need at least lm-sensors 2.10.1 for proper userspace support.
    >
    > This driver can also be built as a module. If so, the module will be
    > called k8temp.

2.  `CONFIG_SENSORS_K10TEMP=m`

    > If you say yes here you get support for the temperature sensor(s)
    > inside your CPU. Supported are later revisions of the AMD Family 10h and
    > all revisions of the AMD Family 11h, 12h (Llano), 14h (Brazos), 15h
    > (Bulldozer/Trinity/Kaveri/Carrizo) and 16h (Kabini/Mullins)
    > microarchitectures.
    >
    > This driver can also be built as a module. If so, the module will be
    > called k10temp.

3.  `CONFIG_SENSORS_FAM15H_POWER=m`

    > If you say yes here you get support for processor power information
    > of your AMD family 15h CPU.
    >
    > This driver can also be built as a module. If so, the module will be
    > called fam15h_power.
Building ipmi_msghandler as a module causes – as always – problems with
the proprietary Nvidia driver. For whatever reasons, it depends on
functions from the module, and is unable to load the module itself –
probably because of our mxgfx indirection.

    2020-06-17T13:56:09.272068+02:00 sigchld kernel: [    0.000000] Linux version 5.4.46.mx64.337 (root@invidia.molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 SMP Tue Jun 16 23:32:15 CEST 2020
    […]
    2020-06-17T13:56:09.322119+02:00 sigchld kernel: [    3.907200] nvidia: loading out-of-tree module taints kernel.
    2020-06-17T13:56:09.322140+02:00 sigchld kernel: [    3.911716] nvidia: module license 'NVIDIA' taints kernel.
    2020-06-17T13:56:09.333611+02:00 sigchld kernel: [    3.923028] nvidia: module verification failed: signature and/or required key missing - tainting kernel
    2020-06-17T13:56:09.333620+02:00 sigchld kernel: [    3.926029] nvidia: Unknown symbol ipmi_create_user (err -2)
    2020-06-17T13:56:09.335472+02:00 sigchld kernel: [    3.927879] nvidia: Unknown symbol ipmi_destroy_user (err -2)
    2020-06-17T13:56:09.337338+02:00 sigchld kernel: [    3.929720] nvidia: Unknown symbol ipmi_validate_addr (err -2)
    2020-06-17T13:56:09.337342+02:00 sigchld kernel: [    3.931552] nvidia: Unknown symbol ipmi_free_recv_msg (err -2)
    2020-06-17T13:56:09.339180+02:00 sigchld kernel: [    3.933377] nvidia: Unknown symbol ipmi_set_my_address (err -2)
    2020-06-17T13:56:09.341000+02:00 sigchld kernel: [    3.935221] nvidia: Unknown symbol ipmi_request_settime (err -2)
    2020-06-17T13:56:09.342899+02:00 sigchld kernel: [    3.937102] nvidia: Unknown symbol ipmi_set_gets_events (err -2)
    2020-06-17T13:56:09.385602+02:00 sigchld kernel: [    3.975577] nvidia_uvm: Unknown symbol nvUvmInterfaceDisableAccessCntr (err -2)
    2020-06-17T13:56:09.385614+02:00 sigchld kernel: [    3.977740] nvidia_uvm: Unknown symbol nvUvmInterfaceChannelDestroy (err -2)
    2020-06-17T13:56:09.385615+02:00 sigchld kernel: [    3.979796] nvidia_uvm: Unknown symbol nvUvmInterfaceQueryCaps (err -2)
    2020-06-17T13:56:09.387549+02:00 sigchld kernel: [    3.981756] nvidia_uvm: Unknown symbol nvUvmInterfaceUnsetPageDirectory (err -2)
    2020-06-17T13:56:09.389361+02:00 sigchld kernel: [    3.983558] nvidia_uvm: Unknown symbol nvUvmInterfaceInitAccessCntrInfo (err -2)
    2020-06-17T13:56:09.391153+02:00 sigchld kernel: [    3.985352] nvidia_uvm: Unknown symbol nvUvmInterfaceReleaseChannel (err -2)
    2020-06-17T13:56:09.392781+02:00 sigchld kernel: [    3.986986] nvidia_uvm: Unknown symbol nvUvmInterfaceMemoryAllocSys (err -2)
    2020-06-17T13:56:09.394816+02:00 sigchld kernel: [    3.989018] nvidia_uvm: Unknown symbol nvUvmInterfaceMemoryCpuMap (err -2)
    2020-06-17T13:56:09.398324+02:00 sigchld kernel: [    3.992539] nvidia_uvm: Unknown symbol nvUvmInterfaceRetainChannelResources (err -2)
    2020-06-17T13:56:09.403240+02:00 sigchld kernel: [    3.997423] nvidia_uvm: Unknown symbol nvUvmInterfacePmaFreePages (err -2)
    […]

So partly revert commit 32c9443 (linux-5.4.46: Build IPMI drivers as
modules), and build impi_msghandler into the Linux kernel.
Fix cosmetic issue, that two lines belonging together have a different
log message.  The line below is now printed in one line.

1.  old:

        [    0.979142] pci 0000:00:00.2: AMD-Vi: Extended features (0xf77ef22294ada):
        [    0.979546]  PPR NX GT IA GA PC GA_vAPIC

2.  new:

        [    0.979142] pci 0000:00:00.2: AMD-Vi: Extended features (0xf77ef22294ada): PPR NX GT IA GA PC GA_vAPIC
This simplies the interpretation of the values, as it is a bitmask.
@pmenzel pmenzel merged commit a3c2fb4 into master Jul 28, 2020
@pmenzel pmenzel deleted the add-linux-5.4.46 branch July 28, 2020 15:19
Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants