Skip to content

gpu-setup: job_release: job with 10882 has no GPU locked #139

Open
donald opened this issue Dec 22, 2022 · 0 comments
Open

gpu-setup: job_release: job with 10882 has no GPU locked #139

donald opened this issue Dec 22, 2022 · 0 comments

Comments

@donald
Copy link
Contributor

donald commented Dec 22, 2022

Three mxqd servers terminated, because with this error.

Log from jabberwocky :

2022-12-13 15:29:02 +0100 mxqd[1216]: mxqd - MXQ v0.30.8 (beta)
2022-12-13 15:29:02 +0100 mxqd[1216]:   by Marius Tolzmann <marius.tolzmann@molgen.mpg.de> 2013-2022
2022-12-13 15:29:02 +0100 mxqd[1216]:      and Donald Buczek <buczek@molgen.mpg.de> 2015-2022
2022-12-13 15:29:02 +0100 mxqd[1216]:   Max Planck Institute for Molecular Genetics - Berlin Dahlem
2022-12-13 15:29:02 +0100 mxqd[1216]: hostname=jabberwocky.molgen.mpg.de daemon_name=main daemon_id=11995 :: MXQ server started.
2022-12-13 15:29:02 +0100 mxqd[1216]:   host_id=b0cb98b1-7e0e-45f8-86d3-1b24381c6355-e24-4c0
2022-12-13 15:29:02 +0100 mxqd[1216]: slots=32 memory_total=350000 memory_avg_per_slot=10938 memory_limit_slot_soft=350000 memory_limit_slot_hard=350000 :: server initialized.
2022-12-13 15:29:02 +0100 mxqd[1216]: cpu set available: [0-31]
2022-12-13 15:29:02 +0100 mxqd[1216]: recover: 3 running groups loaded.
2022-12-13 15:29:02 +0100 mxqd[1216]: entering main loop
2022-12-14 21:59:56 +0100 mxqd[1216]: WARNING: MySQL mysql_ping(): ERROR 2013 (HY000): Lost connection to MySQL server during query
2022-12-14 21:59:56 +0100 mxqd[1216]: WARNING: MySQL mysql_ping(): ERROR 2013 (HY000): Lost connection to MySQL server during query
2022-12-14 21:59:56 +0100 mxqd[1216]: WARNING: mx_mysql_ping() failed: Resource temporarily unavailable - retrying again (forever) in 5 second(s).
2022-12-14 22:00:01 +0100 mxqd[1216]: mx_mysql_ping_forever() recovered from previous errors (1 tries). Yippieh! Back to work!
2022-12-19 19:40:55 +0100 mxqd[1216]:   group=haas(8009):517952 slots_to_start=32 slots_per_job=8 :: trying to start job for group.
2022-12-19 19:40:56 +0100 mxqd[1216]:    job=haas(8009):517952:42635158 :: started. pid=9705
2022-12-19 19:40:56 +0100 mxqd[1216]: Main loop started 8 slots.
2022-12-19 19:41:27 +0100 mxqd[1216]: job finished (via fspool) : job 42635158 pid 9705 status 256
2022-12-19 19:41:28 +0100 mxqd[1216]: Main loop freed 8 slots.
2022-12-19 19:45:14 +0100 mxqd[1216]:   group=haas(8009):517953 slots_to_start=32 slots_per_job=1 :: trying to start job for group.
2022-12-19 19:45:14 +0100 mxqd[1216]:    job=haas(8009):517953:42635174 :: started. pid=9934
2022-12-19 19:45:14 +0100 mxqd[1216]: Main loop started 1 slots.
2022-12-19 19:45:44 +0100 mxqd[1216]: job finished (via fspool) : job 42635174 pid 9934 status 256
2022-12-19 19:45:44 +0100 mxqd[1216]: Main loop freed 1 slots.
2022-12-19 19:47:27 +0100 mxqd[1216]:   group=haas(8009):517954 slots_to_start=32 slots_per_job=8 :: trying to start job for group.
2022-12-19 19:47:28 +0100 mxqd[1216]:    job=haas(8009):517954:42635175 :: started. pid=10039
2022-12-19 19:47:28 +0100 mxqd[1216]: Main loop started 8 slots.
2022-12-19 19:47:58 +0100 mxqd[1216]: job finished (via fspool) : job 42635175 pid 10039 status 256
2022-12-19 19:47:59 +0100 mxqd[1216]: Main loop freed 8 slots.
2022-12-19 19:57:12 +0100 mxqd[1216]:   group=haas(8009):517954 slots_to_start=32 slots_per_job=8 :: trying to start job for group.
2022-12-19 19:57:13 +0100 mxqd[1216]:    job=haas(8009):517954:42635211 :: started. pid=10329
2022-12-19 19:57:13 +0100 mxqd[1216]: Main loop started 8 slots.
2022-12-19 19:57:43 +0100 mxqd[1216]: job finished (via fspool) : job 42635211 pid 10329 status 256
2022-12-19 19:57:43 +0100 mxqd[1216]: Main loop freed 8 slots.
2022-12-20 12:19:06 +0100 mxqd[1216]:   group=haas(8009):517952 slots_to_start=32 slots_per_job=8 :: trying to start job for group.
2022-12-20 12:19:06 +0100 mxqd[1216]:    job=haas(8009):517952:42637224 :: started. pid=24694
2022-12-20 12:19:06 +0100 mxqd[1216]: Main loop started 8 slots.
2022-12-20 12:19:37 +0100 mxqd[1216]: job finished (via fspool) : job 42637224 pid 24694 status 256
2022-12-20 12:19:38 +0100 mxqd[1216]: Main loop freed 8 slots.
2022-12-20 17:07:38 +0100 mxqd[1216]:   group=haas(8009):517952 slots_to_start=32 slots_per_job=8 :: trying to start job for group.
2022-12-20 17:07:38 +0100 mxqd[1216]:    job=haas(8009):517952:42639017 :: started. pid=29216
2022-12-20 17:07:38 +0100 mxqd[1216]: Main loop started 8 slots.
2022-12-20 17:08:10 +0100 mxqd[1216]: job finished (via fspool) : job 42639017 pid 29216 status 256
2022-12-20 17:08:10 +0100 mxqd[1216]: Main loop freed 8 slots.
2022-12-21 08:59:32 +0100 mxqd[1216]:   group=haas(8009):517952 slots_to_start=32 slots_per_job=8 :: trying to start job for group.
2022-12-21 08:59:32 +0100 mxqd[1216]:    job=haas(8009):517952:42640917 :: started. pid=10882
2022-12-21 08:59:32 +0100 mxqd[1216]: Main loop started 8 slots.
2022-12-21 08:59:53 +0100 mxqd[1216]:   group=haas(8009):517971 slots_to_start=24 slots_per_job=8 :: trying to start job for group.
2022-12-21 08:59:53 +0100 mxqd[1216]:    job=haas(8009):517971:42640918 :: started. pid=11003
2022-12-21 08:59:53 +0100 mxqd[1216]: Main loop started 8 slots.
2022-12-21 09:00:03 +0100 mxqd[1216]: job finished (via fspool) : job 42640917 pid 10882 status 256
/usr/libexec/mxq/gpu-setup: job_release: job with 10882 has no GPU locked
2022-12-21 09:00:03 +0100 mxqd[1216]: ERROR: gpu-setup job-release: Protocol error

Absolute mystery.

Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant