Skip to content

mxgrub: Add --reboot #87

Merged
merged 1 commit into from
May 8, 2019
Merged

mxgrub: Add --reboot #87

merged 1 commit into from
May 8, 2019

Conversation

donald
Copy link
Collaborator

@donald donald commented May 8, 2019

Add a new command mxgrub --reboot which reboots to the selected kernel
via kexec.

Notes:

* Do a sync before triggering the reboot, so that services are still
available while the bulk of dirty pages is written, which can take
a relevant time on big machines.

* kexec (without -l, -p, -e, or -f) loads the kernel and than calls
shutdown, which is handled by systemd. So this is a full shutdown ;
systemd will stop services, unmount filesystems and sync all disks
(again). It just recognizes the loaded kernel and calls the
equivalent of (`kexec -e` instead of `reboot -f`) in the very last
step. We just avoid the EFI/BIOS initialization and grub and kernel
loading.

* When an nfs server restarts, clients currently get 90 seconds to
renew their leases. Some nfs file operations from the clients are
blocked during this time, so this adds to the time, a workstation
appears to be frozen to the user. We can consider to reduce this
lease_time of our servers to a lower value. The disadvantage is,
that the clients need to send more RENEW request (2/3 of the lease
time, so every minute currently) and might lose lock state if the
network is interrupted for a longer time.

* Strictly speaking, kexec() has not much to do with grub. However,
in our implementation of the grub based boot it is related because
the record of which kernel to boot next (the selected kernel)
is kept in the grub environment file. So if we want the kexec based
reboot to start the same kernel as a reboot sequence over grub
would do, we need to read the grub environment file. This should be
enough excude to implement the command in mxgrub.

* Loading of the new kernel might fail on machines where not enough
continuous physical memory is availble because of fragmentation.

Timings:

* reboot theinernet (to lightdm login prompt) :  47 ->  22 seconds
* reboot claptrap (to console login prompt)   : 160 ->  59 seconds
* reboot claptrap (to NFS server available)   : 165 ->  43 seconds
* reboot claptrap (to locks available)        : 262 -> 141 seconds
* reboot nsa: 25 seconds

@pmenzel
Copy link
Contributor

pmenzel commented May 8, 2019

  1. Please call the switch --kexec or --kexec-reboot to avoid ambiguity.
  2. Nice timings, but without comparison to a real reboot we do not know if it is good.

@donald
Copy link
Collaborator Author

donald commented May 8, 2019

I'd say the high level request is to reboot and kexec is only the means to do so. So, if we find kexec eats our children, we might replace "kexec" by other code internally and caller of mxgrub --reboot can continue to do what they are used to. Also, if we find an even better method, we might expand to mxgrub --reboot [--kexec | --kexec-v2 ].

Okay, you get your numbers. On theinternet it might be infinite, because from time to time the systems hangs in BIOS during reboot.

@wwwutz
Copy link
Contributor

wwwutz commented May 8, 2019

We should count from

to

as 2 points.

Add a new command `mxgrub --reboot` which reboots to the selected kernel
via kexec.

Notes:

    * Do a sync before triggering the reboot, so that services are still
    available while the bulk of dirty pages is written, which can take
    a relevant time on big machines.

    * kexec (without -l, -p, -e, or -f) loads the kernel and than calls
    shutdown, which is handled by systemd. So this is a full shutdown ;
    systemd will stop services, unmount filesystems and sync all disks
    (again). It just recognizes the loaded kernel and calls the
    equivalent of (`kexec -e` instead of `reboot -f`) in the very last
    step. We just avoid the EFI/BIOS initialization and grub and kernel
    loading.

    * When an nfs server restarts, clients currently get 90 seconds to
    renew their leases. Some nfs file operations from the clients are
    blocked during this time, so this adds to the time, a workstation
    appears to be frozen to the user. We can consider to reduce this
    lease_time of our servers to a lower value. The disadvantage is,
    that the clients need to send more RENEW request (2/3 of the lease
    time, so every minute currently) and might lose lock state if the
    network is interrupted for a longer time.

    * Strictly speaking, kexec() has not much to do with grub. However,
    in our implementation of the grub based boot it is related because
    the record of which kernel to boot next (the selected kernel)
    is kept in the grub environment file. So if we want the kexec based
    reboot to start the same kernel as a reboot sequence over grub
    would do, we need to read the grub environment file. This should be
    enough excude to implement the command in mxgrub.

    * Loading of the new kernel might fail on machines where not enough
    continuous physical memory is availble because of fragmentation.

Timings:

    * reboot theinernet (to lightdm login prompt) :  47 ->  22 seconds
    * reboot claptrap (to console login prompt)   : 160 ->  59 seconds
    * reboot claptrap (to NFS server available)   : 165 ->  43 seconds
    * reboot claptrap (to locks available)        : 262 -> 141 seconds
@donald
Copy link
Collaborator Author

donald commented May 8, 2019

Possible problems: Systems with crazy memory architecture, strange hardware which requires a BIOS reset, failure in out of memory situations. We'll need to get some experience. Otherwise, it might even be saver to reboot this way, because we avoid all problems with bios initialization or grub.

@donald
Copy link
Collaborator Author

donald commented May 8, 2019

nsa : 25 seconds

@wwwutz
Copy link
Contributor

wwwutz commented May 8, 2019

Otherwise, it might even be saver to reboot this way, because we avoid all problems with bios initialization or grub.

I doubt that. I'd prefer to take kexec as a last resort. So when you fucked up your grub config in january you will notice your system won't boot anymore in december... murphy dictates a power outage on 24th of december. that'll be fun.

@pmenzel pmenzel merged commit 86e970a into master May 8, 2019
@pmenzel
Copy link
Contributor

pmenzel commented May 8, 2019

What method should be used by default can be decided independently of having this option.

Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants