job wait time unbound in certain conditions #125

donald · 2022-03-20T09:35:38Z

We've had a job which can only run on a specific server because of very high memory demand. The server had two other smaller jobs running and correctly went into "WAITING" state. However, the server kept accepting smaller jobs delaying the big job for an unreasonable long time (until the queues of smaller jobs were exhausted).

This is what is currently believed to be the problem:

A WAITING server accepts only jobs for users, which don't already used their "fair share" of the cluster which is defined by number_of_slots_running/number_of_users. This might more or less work when we have lots of servers with equal resources. However, it is easy to demonstrate by examples that this doesn't always work:

Assume, we have only one server or only one server capable to start the jobs we talk about. Assume, we have two users A1 and A2 with a long queues of jobs which require 50% of the server and a user B with a single job which requires 100%. Assume, jobs from A1 and A2 are running. Whenever a job of A1 or A2 completes, this user also has zero jobs running in the cluster and doesn't use his "fair share". So the server will start another job for that A user and a job for B will never get started.

Another condition is two "big" users B1 and B2 who need 100% of the server. The user list from the server is currently (stable-) sorted by number of slots running in the cluster so users with fewer global job count get precedence. However, when B1 and B2 can only run one job on a single server, they will both have zero global slots running and the users will never change positions in the servers user list, so one of these two is always preferred when the server is free.

When the server needs free resources, it should not blindly accept jobs from (other) users who are below their "fair share", because all users with zero running jobs are below that value.

It would be difficult to consider queue/waiting time (because the jobs are not loaded by the server, just the groups). But we might need to find a way to avoid the B1/B2 unfairness described above by changing the order of users with the same global jobs running.

donald · 2022-03-22T09:11:34Z

Closed by #126

donald mentioned this issue Mar 20, 2022

0.30.3 #126

Merged

donald closed this as completed Mar 22, 2022

donald mentioned this issue Mar 22, 2022

mxq: Update version from 0.30.2 to 0.30.3 mariux64/bee-files#2600

Merged

job wait time unbound in certain conditions #125

job wait time unbound in certain conditions #125

donald commented Mar 20, 2022

donald commented Mar 22, 2022

job wait time unbound in certain conditions #125

job wait time unbound in certain conditions #125

Comments

donald commented Mar 20, 2022

donald commented Mar 22, 2022