Limit user resources on geniux #126

pmenzel · 2020-05-11T12:51:42Z

Tested on stitch, and geniux.

By ignorance and inattention, users run calculations on our gateway server *geniux*, affecting all other users. Prevent that technically, by limitting the resources to one CPU and ten percent of the memory. See systemd.resource-control(5) for more details. The current resource limits for user id 133 can be checked With `systemd-cgls` and `systemctl status user-133.slice`. Users can still cripple the system with high IO and network load.

wwwutz · 2020-05-11T12:59:44Z

I''ll object to any form of resource limiting by a "percentage" value. This is wrong.

pmenzel · 2020-05-11T13:02:02Z

Care to elaborate?

Anyway, please suggest absolute values then.

donald · 2020-05-11T13:25:59Z

Good idea. We should try.

wwwutz · 2020-05-11T14:03:22Z

You already suggested values by setting it to 10%. 10% of "tested on X" or 10% of "tested on Y"... I don't really know what you would like to set it to. you had 3 GB on stitch and 6 GB on geniux. So which one would it be ? if you tested it on nomnomnom it would be 200 GB... That is my objection to percentage values. they age like milk.

The solution would be to reserve memory and CPU to a certain user, not restricting memory & CPU to all others. Then this user ( root ) should be able to fix the system. manually or with magic.

you still end up in a unresponsive system when several people go to limits.

Do not apply the user resource limits to user *root*.

Absolute values are preferred by some, so arbitrarily choose 3 GB. (Before it would have been around 6 GB on *geniux*, which seems excessive.)

pmenzel · 2020-05-11T14:25:51Z

It depends on what your goals are. Limiting resources is like schedulers, and are only a heuristic.

Relative values stay also current.

The solution would be to reserve memory and CPU to a certain user, not restricting memory & CPU to all others.

How should that work?

Then this user ( root ) should be able to fix the system. manually or with magic.

Good point. I try to exclude that user.

you still end up in a unresponsive system when several people go to limits.

Luckily that is not our experience.
That will be possible with every heuristic (which tries to maintain some degree of usability).

The proposed solution will hopefully solve the majority of the issue we were seeing with our gateway server.

donald · 2020-05-18T08:02:47Z

@wwwutz: 3 GB/user limit okay with you for a test run? Last time we had a berserk user process on geniux, I needed over 15 minutes to log in to geniux via bka, indentify the process and kill it. During that time nobody could work from home. So I think, this is a problem we need to address using the options we have.

donald · 2020-05-26T06:41:13Z

We found out, that geniux still has a swap file, which might explain the laaaaaag when running out of physical memory. Without a swap file and a restrictive overcommit policy, the problematic user jobs probably would have died right away. So as an alternative we might disable the swap on geniux.

wwwutz · 2020-05-26T06:42:39Z

disable the swap on geniux

This.

Thanks, Grandpa

pmenzel · 2020-07-03T07:37:40Z

Is the claim regarding swap still valid after reading the article In defence of swap: common misconceptions, you shared on June 30th?

At least in June, there were still some out of memory situations.

[Mon Jun 22 20:31:53 2020] Tasks state (memory values in pages):
[Mon Jun 22 20:31:53 2020] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Mon Jun 22 20:31:53 2020] [  39479]  4888 39479     8818     1259   102400        0             0 systemd
[Mon Jun 22 20:31:53 2020] [  39480]  4888 39480    55908     1237   196608        0             0 (sd-pam)
[Mon Jun 22 20:31:53 2020] [  43058]  4888 43058     7177      206    77824       15             0 tmux: server
[Mon Jun 22 20:31:53 2020] [  43128]  4888 43128     5586      146    77824        0             0 bash
[Mon Jun 22 20:31:53 2020] [  11029]     0 11029    14705     1570   155648        0             0 sshd
[Mon Jun 22 20:31:53 2020] [  11061]  4888 11061    14705     1118   151552        0             0 sshd
[Mon Jun 22 20:31:53 2020] [  11062]  4888 11062     5586      972    86016        0             0 bash
[Mon Jun 22 20:31:53 2020] [  13925]  4888 13925  5142592   786063  6782976        0             0 java
[Mon Jun 22 20:31:53 2020] Memory cgroup out of memory: Kill process 13925 (java) score 1001 or sacrifice child
[Mon Jun 22 20:31:53 2020] Killed process 13925 (java) total-vm:20570368kB, anon-rss:3127316kB, file-rss:16936kB, shmem-rss:0kB
[Mon Jun 22 20:31:53 2020] oom_reaper: reaped process 13925 (java), now anon-rss:0kB, file-rss:4kB, shmem-rss:0kB
[Mon Jun 22 20:35:05 2020] java invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
[Mon Jun 22 20:35:05 2020] java cpuset=/ mems_allowed=0
[Mon Jun 22 20:35:05 2020] CPU: 14 PID: 14442 Comm: java Kdump: loaded Not tainted 4.19.57.mx64.282 #1
[Mon Jun 22 20:35:05 2020] Hardware name: Dell Inc. PowerEdge R420/0CN7CM, BIOS 1.6.7 08/30/2013

wwwutz · 2020-07-03T07:47:52Z

Claim 3 stimmt nicht, wenn ich keine swap habe, verhindere ich I/O vom swapper.

Claim 5 ist so ein typisches "wir haben doch eh alle SSDs." argument. Haben wir nicht, wir haben langsame platten.

Claim 6 bezieht sich auf dem OOM-Killer, der meiner Meinung nach nichts im System zu suchen hat solange man ihm nicht beibringt, welche prozesse er nicht anfassen darf.

donald · 2020-07-03T08:24:45Z

Claim 3 stimmt nicht, wenn ich keine swap habe, verhindere ich I/O vom swapper.

Die Argumentation da ist, dass der I/O trotzdem stattfindet. Da ohne Swap anonyme Seiten (ohne File-Backend) nicht freigemacht werden können, würden statt dessen eben Seiten mit File-Backend rausgeworfen, wenn freie Seiten gebraucht werde. Der I/O geht dann nicht über den Swap sondern über die Filesysteme, wäre also trotzdem vorhanden. Der I/O wäre höher, denn dadurch, dass mir die inaktiven anonymen, dirty Seiten nicht zum freimachen zur Verfügung stehen, müssten aktive Seiten freigemacht werden, die dann wieder reingefaultet werden müssen.

pmenzel · 2020-07-10T10:12:20Z

Maybe it got lost in the long discussion, but I claim limiting the resources is still useful and needed, as disabling swap did not help.

donald · 2020-07-10T16:18:11Z

Was the system in a unusable state after the swap has been disabled?

pmenzel · 2020-07-10T17:17:02Z

Yes, the OOM became active. (I just saw it in the logs, but having processes killed belongs to the unusable category to me.)

wwwutz · 2020-07-13T10:24:26Z

"the OOM became active" und "belongs to the unusable category to me".

darum gings hier nicht. Es ging darum ein system nicht mehr warten zu koennen, weil man sich nicht einloggen kann, weil es unresponsive ( wegen hohem I/O) wurde.

Das hat nichts mit einem OOM Killer zu tun. Der ist nur laestig, aber der macht das system nicht unresponsive. Und wenn er das macht, macht der was falsch. Dann muss der weg.

veto.

pmenzel · 2020-07-13T10:32:12Z

This merge/pull request is for limiting the memory resources for processes, because it was making geniux unusable. The OOM is often too late (and that is why oomd was written as an alternative for example). No idea, what it has to do with I/O.

What am I missing? I have the feeling we talk past each other.

wwwutz · 2020-07-13T10:49:36Z

OK, dann eben so: Nein, ich halte es nicht fuer sinnvoll auf einem Rechner die Speicherresourcen einzelner User oder Prozesse zu begrenzen. Warum auch immer man das tun solle. Nein, ich moechte das nicht. Nein. Nein. Nein. Ich befuerchte, dass wir durch diese Sonderkonfiguration eines einzelnen Servers uns mehr Probleme einhandeln als wir lösen. Also: veto gegen diesen pull-request.

pmenzel · 2020-10-08T14:07:49Z

And another occurrence.

X  31997  0.0  0.0  28620  3560 ?        Ss   14:34   0:00 tmux new -s exprimacon
X  31998  0.0  0.0  23680  5016 pts/38   Ss   14:34   0:00  \_ -bash
X  32086  0.0  0.0  20088  3056 pts/38   S+   14:34   0:00      \_ /bin/bash /usr/bin/python3
X  32088 98.2 49.2 51078328 32334248 pts/38 Rl+ 14:34  82:42          \_ python3

I am missing alternative proposal to fix the annoying problem.

pmenzel force-pushed the limit-user-resources-on-geniux branch from 60ac109 to feef63b Compare May 11, 2020 12:53

pmenzel added 2 commits May 11, 2020 16:09

systemd: Exclude *root* from resource limits application

97380af

Do not apply the user resource limits to user *root*.

systemd: Fix maximum memory usage to 3G for users on *geniux*

c98c886

Absolute values are preferred by some, so arbitrarily choose 3 GB. (Before it would have been around 6 GB on *geniux*, which seems excessive.)

donald force-pushed the master branch 2 times, most recently from 91ee9fc to 647c337 Compare November 29, 2023 09:21

Limit user resources on geniux #126

Limit user resources on geniux #126

pmenzel commented May 11, 2020

wwwutz commented May 11, 2020

pmenzel commented May 11, 2020

donald commented May 11, 2020

wwwutz commented May 11, 2020

pmenzel commented May 11, 2020

donald commented May 18, 2020

donald commented May 26, 2020

wwwutz commented May 26, 2020

pmenzel commented Jul 3, 2020

wwwutz commented Jul 3, 2020

donald commented Jul 3, 2020

pmenzel commented Jul 10, 2020

donald commented Jul 10, 2020

pmenzel commented Jul 10, 2020

wwwutz commented Jul 13, 2020

pmenzel commented Jul 13, 2020

wwwutz commented Jul 13, 2020

pmenzel commented Oct 8, 2020

Limit user resources on *geniux* #126

Are you sure you want to change the base?

Limit user resources on *geniux* #126

Conversation

pmenzel commented May 11, 2020

wwwutz commented May 11, 2020

pmenzel commented May 11, 2020

donald commented May 11, 2020

wwwutz commented May 11, 2020

pmenzel commented May 11, 2020

donald commented May 18, 2020

donald commented May 26, 2020

wwwutz commented May 26, 2020

pmenzel commented Jul 3, 2020

wwwutz commented Jul 3, 2020

donald commented Jul 3, 2020

pmenzel commented Jul 10, 2020

donald commented Jul 10, 2020

pmenzel commented Jul 10, 2020

wwwutz commented Jul 13, 2020

pmenzel commented Jul 13, 2020

wwwutz commented Jul 13, 2020

pmenzel commented Oct 8, 2020

Limit user resources on geniux #126

Limit user resources on geniux #126