Closed Bug 1628333 Opened 5 years ago Closed 5 years ago

mac pgo workers not working

Categories

(Infrastructure & Operations :: RelOps: Posix OS, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: dhouse, Assigned: dhouse)

References

Details

Aryx reported problems with the mac pgo machines. It looks like one or more are stuck.

Aryx
06:42
hi, OSX Bpgo(run) runs look backlogged and it seems there is only one running machine (and one which hasn't don't anything in the past month): https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-3-t-osx-1014
fubar are you around?

The pgo pool is:
t-mojave-r7-235.tier3.releng.mdc2.mozilla.com
t-mojave-r7-236.tier3.releng.mdc2.mozilla.com
t-mojave-r7-471.tier3.releng.mdc1.mozilla.com
t-mojave-r7-472.tier3.releng.mdc1.mozilla.com

235 was up but is quarantined. The last 5 tasks it ran one month ago were failures. I need to investigate these to see if it needs rebuilt or has hardware issues, or if I can un-quarantine it and try it on a few tasks.
236 was up but did not show up in the TC queue as active. I restarted the machine and it is now taking tasks
471 is not pingable (power was on. I cycled it and it did not come on the network; I've powered it down for 5 minutes and will power it back on to test)
472 has been fine and working through tasks successfully

235's first failed task saw http-500 errors on the hg checkout (https://firefox-ci-tc.services.mozilla.com/tasks/Yf3-TmMEQr29YIj_gf4x3g/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FYf3-TmMEQr29YIj_gf4x3g%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive_backing.log)

The following tasks hit failures on hg sparse checkout (same failure on the last 4 tasks like https://firefox-ci-tc.services.mozilla.com/tasks/RblmCkekS32NMyvsm1nqog/runs/1/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FRblmCkekS32NMyvsm1nqog%2Fruns%2F1%2Fartifacts%2Fpublic%2Flogs%2Flive_backing.log):

[vcs 2020-02-24T13:20:42.359Z] ensuring https://hg.mozilla.org/integration/autoland@75ac3d6900cf944e10b74e6be9848122ef7f8e75 is available at /Users/task_1582548376/checkouts/gecko
[...]
[vcs 2020-02-24T13:20:42.367Z] mercurial.error.Abort: cannot enable sparse profile on existing non-sparse checkout

So it seems like the http-500 failed task changed something in the worker's state to cause the following tasks to fail.
I un-quarantined it, and it failed the first task it took in the same way (same error about non-sparse checkout).
I'm checking the directories and will try clearing or resetting it to a clean state.

Status update for autoland: With ~25 pushes having the macOS Bpgo(run) job pending * 10 minutes to run a task (9 minutes) + overhead (1 minutes) / 2 running machines, the jobs would either be complete or running in 2 hours. As Bpgo(run) depends on Bpgo(instr) which runs ~40 minutes, autoland is scheduled to reopen at ~18:35 UTC.

On 235, I cleared out the caches and updated the files recording the cache state. I successfully ran a copy of a recent pgo task through this worker (in a -beta workerType), and then switched it back into the production pool.
So we're up to three workers, and the queue was down to 1 task (I added a few copies to the production queue to make sure 235 runs a task there successfully).

471 has not come back onto the network after multiple power cycles and leaving it powered off for about an hour.

Tree has been reopened at ~18:20. Leaving the bug open for dhouse's remaining work.

4 machines for the pool are a reasonable minimum and allow to process <=24 pushes/hour (e.g. when pushes land in waves after a backlog has built due to a tree closure, sheriffs let up to 20 pushes land and then close the tree for an hour).

(In reply to Dave House [:dhouse] from comment #4)

On 235, I cleared out the caches and updated the files recording the cache state. I successfully ran a copy of a recent pgo task through this worker (in a -beta workerType), and then switched it back into the production pool.
So we're up to three workers, and the queue was down to 1 task (I added a few copies to the production queue to make sure 235 runs a task there successfully).

235 has run 6 tasks successfully.

I'm working with the datacenter remote-hands to recover #471. If necessary, I'll replace it with other hardware.

The queue has stayed under 10 for the last 24h (peaked at 8 one hour ago). So we're doing okay with 3 workers in the pool so far.

We recovered #471 and I finished rebuilding it this morning. I am running 3 recent pgo tasks on it. If it looks good, I will put it back into the production group.

471 completed tasks successfully and I put it back into the production pool.

NI to myself for adding monitoring to alert when any of the small pool falls over in the future

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(dhouse)
Resolution: --- → FIXED

(In reply to Dave House [:dhouse] from comment #4)

On 235, I cleared out the caches and updated the files recording the cache state. I successfully ran a copy of a recent pgo task through this worker (in a -beta workerType), and then switched it back into the production pool.

The http 500 on clone left a bad cache again on #235. Perhaps it is a coincidence that it is the same worker. I'll clear the cache and see if it repeats.

The datacenter remote-hands reimaged #235 and I successfully ran tasks on it yesterday. I've put it back into the production pool now and removed the quarantine.

Flags: needinfo?(dhouse)

All 4 mac pgo workers are still up and active. There were no repeats of the problem over the weekend.

Status: RESOLVED → VERIFIED
Depends on: 1530732
You need to log in before you can comment on or make changes to this bug.