Skip to content

job wait time unbound in certain conditions #125

Closed
donald opened this issue Mar 20, 2022 · 1 comment · Fixed by mariux64/bee-files#2600
Closed

job wait time unbound in certain conditions #125

donald opened this issue Mar 20, 2022 · 1 comment · Fixed by mariux64/bee-files#2600

Comments

@donald
Copy link
Contributor

donald commented Mar 20, 2022

We've had a job which can only run on a specific server because of very high memory demand. The server had two other smaller jobs running and correctly went into "WAITING" state. However, the server kept accepting smaller jobs delaying the big job for an unreasonable long time (until the queues of smaller jobs were exhausted).

This is what is currently believed to be the problem:

A WAITING server accepts only jobs for users, which don't already used their "fair share" of the cluster which is defined by number_of_slots_running/number_of_users. This might more or less work when we have lots of servers with equal resources. However, it is easy to demonstrate by examples that this doesn't always work:

Assume, we have only one server or only one server capable to start the jobs we talk about. Assume, we have two users A1 and A2 with a long queues of jobs which require 50% of the server and a user B with a single job which requires 100%. Assume, jobs from A1 and A2 are running. Whenever a job of A1 or A2 completes, this user also has zero jobs running in the cluster and doesn't use his "fair share". So the server will start another job for that A user and a job for B will never get started.

Another condition is two "big" users B1 and B2 who need 100% of the server. The user list from the server is currently (stable-) sorted by number of slots running in the cluster so users with fewer global job count get precedence. However, when B1 and B2 can only run one job on a single server, they will both have zero global slots running and the users will never change positions in the servers user list, so one of these two is always preferred when the server is free.

When the server needs free resources, it should not blindly accept jobs from (other) users who are below their "fair share", because all users with zero running jobs are below that value.

It would be difficult to consider queue/waiting time (because the jobs are not loaded by the server, just the groups). But we might need to find a way to avoid the B1/B2 unfairness described above by changing the order of users with the same global jobs running.

@donald donald mentioned this issue Mar 20, 2022
@donald
Copy link
Contributor Author

donald commented Mar 22, 2022

Closed by #126

Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant