Closed Bug 1563827 Opened 5 years ago Closed 4 years ago

Upgrade mojave workers to 10.14.5

Categories

(Infrastructure & Operations :: RelOps: Posix OS, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dhouse, Unassigned)

References

Details

(Keywords: leave-open)

Attachments

(2 files)

We are updating to mojave 10.14.5 before ESR is pinned to 10.14 for tests.
I am first upgrading 176-229, then I will move to updating the others currently running mojave, and finally we can then expand the pool (as directed by qa).

I created a workflow in deploystudio (copy of V3 with the disk image changed to the 10.14.5 image (not the xcode included image)).
176 and 177 have successfully re-imaged to 10.14.5 with this workflow and are taking work:
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1014/workers/mdc2/t-mojave-r7-176
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-osx-1014/workers/mdc2/t-mojave-r7-177

I've upgraded about 15 of these. 7 in one series still have SIP enabled. So I'll ask QTS to turn off SIP on those. A few I've power-cycled to reach (I'm guessing they have sip enabled and powered-off from power savings settings).

I ended up with 16 successfully upgraded from the original mdc2 failed pool of 40. (of the other 24, 8 have SIP enabled, 4 stuck at upgrade (possible SIP) , and 12 others are not reachable (I think those got stuck at the resume from hibernate into system setup); I'll work on recovering those with QTS remote-hands and troubleshooting).
I'm continuing into upgrading the rest of the machines that have been running mojave 10.14.

I've moved about 30 more to 10.14.5 and I'm working through the rest.
I also moved an additional 5 from yosemite/10.10 to mojave/10.14.5 because the queue has gone up this afternoon faster than the burn-down (and re: egao's note, I'll increase that 10 addition to 100 over the next day and keep updating this bug

yesterday, I finished going through the existing mojave workers and increased both mdc1 and mdc2 pools by 10.
The mojave and yosemite pools were about the same, around 200 active workers this morning. I have about 50 I've been re-trying reimaging/updating and troubleshooting today (some have SIP enabled, some needed power cycled, etc).
I'm currently adding another 20 to the mojave pool (from yosemite).

The burn-down rate (around 600 tasks/hour average) on the mojave queue is better now with more workers, but still it does not keep up with the queue through the pacific-coast work day.

as 95% of the load is on 10.14, the more machines we can put there the better. we can balance that with 10.14.5 fixes though :)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #6)

as 95% of the load is on 10.14, the more machines we can put there the better. we can balance that with 10.14.5 fixes though :)

Thanks!

We're up to 255 on 10.14 macos now and the queue stayed under 1k today.

Now we have 263 10.14 macos workers. The queue exceeded 1k mid-day yesterday, but it recovered within a few hours.

I am still fixing the workers that failed the update (because of SIP or unreachable/power). WIth all of those added, we'll have around 300 10.14(.5) workers.

we have 1 test suite left to convert to 10.14 which we could do this week or next week- then uplift to esr and beta- should allow us to only need 10 remaining 10.10 machines to account for esr60 and mozilla-release

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #9)

we have 1 test suite left to convert to 10.14 which we could do this week or next week- then uplift to esr and beta- should allow us to only need 10 remaining 10.10 machines to account for esr60 and mozilla-release

Thanks Joel! I'll continue moving macos workers from 10.10 to 10.14 through this week and next and aim for only 10 left after next week.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #9)

we have 1 test suite left to convert to 10.14 which we could do this week or next week- then uplift to esr and beta- should allow us to only need 10 remaining 10.10 machines to account for esr60 and mozilla-release

We do have browser-screenshots in addition to mochitest-browser-chrome, both of which will need to run on macosx1010 for the time being. We are looking to migrate mochitest-browser-chrome to macosx1014 by end of this week, which leaves just the browser-screenshots test running on macosx1010 after that point.

The bug dealing with test execution issues of browser-screenshot is in bug 1554821. While the bug had traction from :dividehex who looked into the configuration of the macosx1014 machines, not much work has been done.

next week we can uplift marionette-media, gtest, browser-chrome fixes to esr68 and mozilla-beta- setting us up for success on September 2nd (or shortly thereafter to have all branches running the full set of tests.

browser-screenshots is only run on trunk, so there is little concern with that taking a few extra weeks.

I've recovered a few more mac minis and we now have 310 machines running in the production queue for 10.14(.5)

Production 10.14 count is now up to 320 workers.

(In reply to Dave House [:dhouse] from comment #14)

Production 10.14 count is now up to 320 workers.

Now we have 327 10.14 workers up

FYI, with bug 1570775 I will be removing references to macosx1010 everywhere in-tree, as we no longer have any tests running on macosx1010 machines (last holdout, mochitest-browser-screenshots was migrated over to macosx1014 earlier this week).

So unless there are other obligations to have macosx1010 (eg. ESR, beta) it is safe to reduce the number of macosx1010 machines to a low count.

low count == 15 (for esr60 and mozilla-release branches)

I've brought the macosx1010 worker count down a few more this morning to 24 (12 in each data center).

I'll reduce it further to 15 over the next 2 days.

(In reply to Dave House [:dhouse] from comment #18)

I've brought the macosx1010 worker count down a few more this morning to 24 (12 in each data center).

I'll reduce it further to 15 over the next 2 days.

This +7 moved from 1010, combined with the machines we've recovered from a failed state, brings our macos worker count for macos 10.14 to 340 active

The yosemite/1010 worker pool usage has been low for the last month (at peak in the last week, 20 machines were used for 1-2 hours on Oct 3 and 7th around 9am pacific time). I'll plan to move half of the remaining yosemite machines over to mojave later this week. Then we'll be left with 12 yosemite workers (6 in each data center).

:aryx, do you know how many macos1010 workers we need to keep around this month? (I think we were keeping some for esr68?)

Flags: needinfo?(aryx.bugmail)

The yosemite worker queue has been zero

Blocks: 1570775

On trunk, two jobs still use gecko-t-osx-1010: webrender (WR) cargotest and wrench: https://sql.telemetry.mozilla.org/queries/65256/source

comm-esr68 and try-comm-central use the old worker (needinfoing Jorg from Thunderbird on that): https://sql.telemetry.mozilla.org/queries/65257

On Try, likely pushes based on an old code base. Only mochitest-browser-chrome, xpcshell, python and MnM tests were run: https://sql.telemetry.mozilla.org/queries/65258/source

Flags: needinfo?(aryx.bugmail) → needinfo?(jorgk)

Sounds like a build issue to me.

Flags: needinfo?(jorgk) → needinfo?(rob)

Thx, I'm okay keeping some yosemite workers around, but I'm eager to repurpose them if they're not needed :)

Normally I'd say we need to uplift bug 1570743. But that patch has other
stuff in it that won't work on comm-esr68. (Marionette) This is the minimum
changes needed to run tests on macOS 10.14.

Try job at looks good to me.
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=6147af3596ffb288c8fe575bef57678a99b1682c
Attachment #9100025 - Flags: review?(jorgk)
Assignee: nobody → rob
Status: NEW → ASSIGNED
Comment on attachment 9100025 [details] [diff] [review]
macos1014_tests_commesr68.patch

Thanks, I'll push it now.
Flags: needinfo?(rob)
Attachment #9100025 - Flags: review?(jorgk) → review+

Is this bug only for TB? When I push, it will be closed.
EDIT: Oops, patch for TB 68.

Keywords: leave-open
Flags: needinfo?(jorgk)

It should "just" be a matter of uplifting bug 1570743. But that commit includes marionette stuff and other changes that didn't merge nicely

Assignee: rob → nobody
Status: ASSIGNED → NEW
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: