mac pgo workers not working
Categories
(Infrastructure & Operations :: RelOps: Posix OS, defect)
Tracking
(Not tracked)
People
(Reporter: dhouse, Assigned: dhouse)
References
Details
Aryx reported problems with the mac pgo machines. It looks like one or more are stuck.
Aryx
06:42
hi, OSX Bpgo(run) runs look backlogged and it seems there is only one running machine (and one which hasn't don't anything in the past month): https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-3-t-osx-1014
fubar are you around?
The pgo pool is:
t-mojave-r7-235.tier3.releng.mdc2.mozilla.com
t-mojave-r7-236.tier3.releng.mdc2.mozilla.com
t-mojave-r7-471.tier3.releng.mdc1.mozilla.com
t-mojave-r7-472.tier3.releng.mdc1.mozilla.com
235 was up but is quarantined. The last 5 tasks it ran one month ago were failures. I need to investigate these to see if it needs rebuilt or has hardware issues, or if I can un-quarantine it and try it on a few tasks.
236 was up but did not show up in the TC queue as active. I restarted the machine and it is now taking tasks
471 is not pingable (power was on. I cycled it and it did not come on the network; I've powered it down for 5 minutes and will power it back on to test)
472 has been fine and working through tasks successfully
235's first failed task saw http-500 errors on the hg checkout (https://firefox-ci-tc.services.mozilla.com/tasks/Yf3-TmMEQr29YIj_gf4x3g/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FYf3-TmMEQr29YIj_gf4x3g%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive_backing.log)
The following tasks hit failures on hg sparse checkout (same failure on the last 4 tasks like https://firefox-ci-tc.services.mozilla.com/tasks/RblmCkekS32NMyvsm1nqog/runs/1/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FRblmCkekS32NMyvsm1nqog%2Fruns%2F1%2Fartifacts%2Fpublic%2Flogs%2Flive_backing.log):
[vcs 2020-02-24T13:20:42.359Z] ensuring https://hg.mozilla.org/integration/autoland@75ac3d6900cf944e10b74e6be9848122ef7f8e75 is available at /Users/task_1582548376/checkouts/gecko
[...]
[vcs 2020-02-24T13:20:42.367Z] mercurial.error.Abort: cannot enable sparse profile on existing non-sparse checkout
So it seems like the http-500 failed task changed something in the worker's state to cause the following tasks to fail.
I un-quarantined it, and it failed the first task it took in the same way (same error about non-sparse checkout).
I'm checking the directories and will try clearing or resetting it to a clean state.
![]() |
||
Comment 3•5 years ago
|
||
Status update for autoland: With ~25 pushes having the macOS Bpgo(run) job pending * 10 minutes to run a task (9 minutes) + overhead (1 minutes) / 2 running machines, the jobs would either be complete or running in 2 hours. As Bpgo(run) depends on Bpgo(instr) which runs ~40 minutes, autoland is scheduled to reopen at ~18:35 UTC.
On 235, I cleared out the caches and updated the files recording the cache state. I successfully ran a copy of a recent pgo task through this worker (in a -beta workerType), and then switched it back into the production pool.
So we're up to three workers, and the queue was down to 1 task (I added a few copies to the production queue to make sure 235 runs a task there successfully).
471 has not come back onto the network after multiple power cycles and leaving it powered off for about an hour.
![]() |
||
Comment 6•5 years ago
|
||
Tree has been reopened at ~18:20. Leaving the bug open for dhouse's remaining work.
4 machines for the pool are a reasonable minimum and allow to process <=24 pushes/hour (e.g. when pushes land in waves after a backlog has built due to a tree closure, sheriffs let up to 20 pushes land and then close the tree for an hour).
(In reply to Dave House [:dhouse] from comment #4)
On 235, I cleared out the caches and updated the files recording the cache state. I successfully ran a copy of a recent pgo task through this worker (in a -beta workerType), and then switched it back into the production pool.
So we're up to three workers, and the queue was down to 1 task (I added a few copies to the production queue to make sure 235 runs a task there successfully).
235 has run 6 tasks successfully.
I'm working with the datacenter remote-hands to recover #471. If necessary, I'll replace it with other hardware.
The queue has stayed under 10 for the last 24h (peaked at 8 one hour ago). So we're doing okay with 3 workers in the pool so far.
We recovered #471 and I finished rebuilding it this morning. I am running 3 recent pgo tasks on it. If it looks good, I will put it back into the production group.
Assignee | ||
Comment 10•5 years ago
|
||
471 completed tasks successfully and I put it back into the production pool.
NI to myself for adding monitoring to alert when any of the small pool falls over in the future
Assignee | ||
Comment 11•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #4)
On 235, I cleared out the caches and updated the files recording the cache state. I successfully ran a copy of a recent pgo task through this worker (in a -beta workerType), and then switched it back into the production pool.
The http 500 on clone left a bad cache again on #235. Perhaps it is a coincidence that it is the same worker. I'll clear the cache and see if it repeats.
Assignee | ||
Comment 12•5 years ago
|
||
The datacenter remote-hands reimaged #235 and I successfully ran tasks on it yesterday. I've put it back into the production pool now and removed the quarantine.
Assignee | ||
Comment 13•5 years ago
|
||
All 4 mac pgo workers are still up and active. There were no repeats of the problem over the weekend.
Description
•