1628333 - mac pgo workers not working

Assignee

Description

•

5 years ago

Aryx reported problems with the mac pgo machines. It looks like one or more are stuck.

Aryx
06:42
hi, OSX Bpgo(run) runs look backlogged and it seems there is only one running machine (and one which hasn't don't anything in the past month): https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-3-t-osx-1014
fubar are you around?

:dhouse

Assignee

Comment 1

•

5 years ago

The pgo pool is:
t-mojave-r7-235.tier3.releng.mdc2.mozilla.com
t-mojave-r7-236.tier3.releng.mdc2.mozilla.com
t-mojave-r7-471.tier3.releng.mdc1.mozilla.com
t-mojave-r7-472.tier3.releng.mdc1.mozilla.com

235 was up but is quarantined. The last 5 tasks it ran one month ago were failures. I need to investigate these to see if it needs rebuilt or has hardware issues, or if I can un-quarantine it and try it on a few tasks.
236 was up but did not show up in the TC queue as active. I restarted the machine and it is now taking tasks
471 is not pingable (power was on. I cycled it and it did not come on the network; I've powered it down for 5 minutes and will power it back on to test)
472 has been fine and working through tasks successfully

:dhouse

Assignee

Comment 2

•

5 years ago

235's first failed task saw http-500 errors on the hg checkout (https://firefox-ci-tc.services.mozilla.com/tasks/Yf3-TmMEQr29YIj_gf4x3g/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FYf3-TmMEQr29YIj_gf4x3g%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive_backing.log)

The following tasks hit failures on hg sparse checkout (same failure on the last 4 tasks like https://firefox-ci-tc.services.mozilla.com/tasks/RblmCkekS32NMyvsm1nqog/runs/1/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FRblmCkekS32NMyvsm1nqog%2Fruns%2F1%2Fartifacts%2Fpublic%2Flogs%2Flive_backing.log):

[vcs 2020-02-24T13:20:42.359Z] ensuring https://hg.mozilla.org/integration/autoland@75ac3d6900cf944e10b74e6be9848122ef7f8e75 is available at /Users/task_1582548376/checkouts/gecko
[...]
[vcs 2020-02-24T13:20:42.367Z] mercurial.error.Abort: cannot enable sparse profile on existing non-sparse checkout

So it seems like the http-500 failed task changed something in the worker's state to cause the following tasks to fail.
I un-quarantined it, and it failed the first task it took in the same way (same error about non-sparse checkout).
I'm checking the directories and will try clearing or resetting it to a clean state.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 3

•

5 years ago

Status update for autoland: With ~25 pushes having the macOS Bpgo(run) job pending * 10 minutes to run a task (9 minutes) + overhead (1 minutes) / 2 running machines, the jobs would either be complete or running in 2 hours. As Bpgo(run) depends on Bpgo(instr) which runs ~40 minutes, autoland is scheduled to reopen at ~18:35 UTC.

:dhouse

Assignee

Comment 4

•

5 years ago

On 235, I cleared out the caches and updated the files recording the cache state. I successfully ran a copy of a recent pgo task through this worker (in a -beta workerType), and then switched it back into the production pool.
So we're up to three workers, and the queue was down to 1 task (I added a few copies to the production queue to make sure 235 runs a task there successfully).

:dhouse

Assignee

Comment 5

•

5 years ago

471 has not come back onto the network after multiple power cycles and leaving it powered off for about an hour.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 6

•

5 years ago

Tree has been reopened at ~18:20. Leaving the bug open for dhouse's remaining work.

4 machines for the pool are a reasonable minimum and allow to process <=24 pushes/hour (e.g. when pushes land in waves after a backlog has built due to a tree closure, sheriffs let up to 20 pushes land and then close the tree for an hour).

:dhouse

Assignee

Comment 7

•

5 years ago

(In reply to Dave House [:dhouse] from comment #4)

On 235, I cleared out the caches and updated the files recording the cache state. I successfully ran a copy of a recent pgo task through this worker (in a -beta workerType), and then switched it back into the production pool.
So we're up to three workers, and the queue was down to 1 task (I added a few copies to the production queue to make sure 235 runs a task there successfully).

235 has run 6 tasks successfully.

:dhouse

Assignee

Comment 8

•

5 years ago

I'm working with the datacenter remote-hands to recover #471. If necessary, I'll replace it with other hardware.

The queue has stayed under 10 for the last 24h (peaked at 8 one hour ago). So we're doing okay with 3 workers in the pool so far.

:dhouse

Assignee

Comment 9

•

5 years ago

We recovered #471 and I finished rebuilding it this morning. I am running 3 recent pgo tasks on it. If it looks good, I will put it back into the production group.

:dhouse

Assignee

Comment 10

•

5 years ago

471 completed tasks successfully and I put it back into the production pool.

NI to myself for adding monitoring to alert when any of the small pool falls over in the future

Status: NEW → RESOLVED

Closed: 5 years ago

Flags: needinfo?(dhouse)

Resolution: --- → FIXED

:dhouse

Assignee

Comment 11

•

5 years ago

(In reply to Dave House [:dhouse] from comment #4)

On 235, I cleared out the caches and updated the files recording the cache state. I successfully ran a copy of a recent pgo task through this worker (in a -beta workerType), and then switched it back into the production pool.

The http 500 on clone left a bad cache again on #235. Perhaps it is a coincidence that it is the same worker. I'll clear the cache and see if it repeats.

:dhouse

Assignee

Comment 12

•

5 years ago

The datacenter remote-hands reimaged #235 and I successfully ran tasks on it yesterday. I've put it back into the production pool now and removed the quarantine.

Flags: needinfo?(dhouse)

:dhouse

Assignee

Comment 13

•

5 years ago

All 4 mac pgo workers are still up and active. There were no repeats of the problem over the weekend.

Status: RESOLVED → VERIFIED

:dhouse

Assignee

Updated

•

4 years ago

Depends on: 1530732

Bugzilla

mac pgo workers not working

Categories

(Infrastructure & Operations :: RelOps: Posix OS, defect)

Tracking

(Not tracked)

People

(Reporter: dhouse, Assigned: dhouse)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated