Closed Bug 1786254 Opened 2 years ago Closed 2 years ago

Few Android/bitbar workers are available

Categories

(Taskcluster :: Workers, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: CosminS, Assigned: aerickson)

References

Details

(Keywords: leave-open, Whiteboard: [relops-android],)

Attachments

(2 files, 1 obsolete file)

Summary: No Android/bitbar workers available → Few Android/bitbar workers are available

I've asked Bitbar to investigate.

Assignee: nobody → aerickson
Status: NEW → ASSIGNED
Flags: needinfo?(mcornmesser)
Flags: needinfo?(dhouse)
Flags: needinfo?(aerickson)
Whiteboard: [relops-android]

We're up to 69.57% health rate, just 21 workers left with issues, 6 Pixels and 15 A51.
LE: Up to 89.86%, only 6 workers overall with issues.

Bitbar found many device hosts hung. Restarting seems to have fixed the issue.

We're at 98.55% devices online (only one test g5 offline).

We're planning on upgrading Docker (to support Ubuntu 2204 images) and I'm hoping it has some other fixes that may help.

Things seem to have stabilized. Closing.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED

There are still exceptions on trees (especially autoland) with tasks getting to deadline exceeded before they're picked up by any worker, like these. A larger range to see how things look on tree is this one. Is there something that can be done about it?
RIght now https://earthangel-b40313e5.influxcloud.net/d/wIJoZ4HWk/android-queues?orgId=1&from=now-24h&to=now&refresh=30s lists 8 Pixel workers having issues and one A51.
Reopening for now until things get better.

Status: RESOLVED → REOPENED
Flags: needinfo?(aerickson)
Resolution: FIXED → ---

(In reply to Cosmin Sabou [:CosminS] from comment #6)

There are still exceptions on trees (especially autoland) with tasks getting to deadline exceeded before they're picked up by any worker, like these.

These all say things about being cancelled (but I don't see a cAll task in the decision worker).

"[taskcluster 2022-08-28T01:18:35.757Z] Command ABORTED after 33m58.363490362s"

Are those really tasks that timed out?

A larger range to see how things look on tree is this one. Is there something that can be done about it?

I do see 'deadline exceeded' for these, but I'm unsure why they'd time out as they're not try tasks (should be higher priority) and we were doing well last Friday with devices. Why are autoland jobs testing a single change?

:jmaher, is it possible we may have scheduled too many tests on the p2s? The a51s don't seem to have this problem (we have more of them and they're faster).

RIght now https://earthangel-b40313e5.influxcloud.net/d/wIJoZ4HWk/android-queues?orgId=1&from=now-24h&to=now&refresh=30s lists 8 Pixel workers having issues and one A51.
Reopening for now until things get better.

These devices go offline... they're mobile phones that require human intervention to fix (and Bitbar doesn't work on the weekend). I don't know that there's anything I can do to deliver a higher availability.

Flags: needinfo?(aerickson) → needinfo?(jmaher)

good question :aerickson, I looked at a 90 day window and you see a trend:
https://earthangel-b40313e5.influxcloud.net/d/wIJoZ4HWk/android-queues?orgId=1&refresh=5m&from=now-90d&to=now

August 11th things frequently have much larger queues and backlogs and that is when Bug 1768558 landed (running fenix/android perf tests in BOTH fission and no-fission).

So this would fall in line with "we are overscheduling".

This is the same problem (maybe more of a problem) on the A51 phones as well.

I think why we are seeing issues on P2 is that the phones are 3.5-4 years old and have a higher chance of failure/going offline. The A51's are brand new.

I know there is talk of only running perf on A51 and not P2, maybe we could at least turn off non-fission on P2 or all perf and add some A51's. Alternatively we could reduce the tests we run (maybe no chrome tests, maybe no live tests, maybe only tp6m-essential)

:sparky, could you help get a conversation going within the perftools team regarding how to reduce the scheduled load on the P2 phones?

Flags: needinfo?(jmaher) → needinfo?(gmierz2)

The chrome tests are almost completely disabled right now because of permafailures there:
https://bugzilla.mozilla.org/show_bug.cgi?id=1781237
https://bugzilla.mozilla.org/show_bug.cgi?id=1780817

On mozilla-beta, tests are running on every push right now and in the next few weeks we'll convert it to a cron running nightly: https://bugzilla.mozilla.org/show_bug.cgi?id=1788026

I think we should wait to see what it looks like once mozilla-beta uses cron since that'll reduce the number of pushes we run daily there from an average of 3-4 to 1. If that doesn't help, then we can look into starting to disable live-site tests.

Flags: needinfo?(gmierz2)
Whiteboard: [relops-android] → [relops-android], [perftest:triage]
Depends on: 1788026
See Also: → 1781237, 1780817

a few weeks is a long time, can we bump up the priority of fixing the beta scheduling to this week or early next week?

Kash will be working on that
(In reply to Joel Maher ( :jmaher ) (UTC -0800) from comment #10)

a few weeks is a long time, can we bump up the priority of fixing the beta scheduling to this week or early next week?

Flags: needinfo?(kshampur)

:jmaher, I was looking at the autoland pushes and I noticed on a couple occasions that the android perf tests were scheduled within 4-5 commits of each other (adding ~100 tasks to the queue within the span of an hour):

https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=browsertime%2Candroid&fromchange=4c76664026b55d57999e109b5bc5429d986df9ab
A more direct link: https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=browsertime%2Candroid&tochange=49f976dc75d595a5de73540b0ce666a435d60c65&fromchange=f23316f3486fdb33c535f7ae1131d12e91258022

Do you know if this is expected behaviour for the scheduling?

I've also noticed that Fenix is now scheduling the perf tests more than once a day on some occasions (I'm thinking this is because it's building more often): https://treeherder.mozilla.org/jobs?repo=fenix&searchStr=browsertime%2Candroid&fromchange=19831e5c79c27813a647bce1c24898ddebe60544

Flags: needinfo?(jmaher)

Oh! I found this that happened on August 12th (ship nightly twice a day) which is suspiciously close to when the queue issues started: https://github.com/mozilla-mobile/fenix/commit/5a8a7f549946fc8ad6ccf31f8c9c6bc2180aaed2

August 12 2021?

Ah darn thanks for catching that, that felt too good to be true! I still think we could reduce our frequency to once a day.

(In reply to Greg Mierzwinski [:sparky] from comment #15)

Ah darn thanks for catching that, that felt too good to be true! I still think we could reduce our frequency to once a day.

No disagreement on that for perf tests. Strong disagreement on the Nightly build frequency. If we're setting up a cron for Beta, can we use that on other branches/repos also?

Unfortunately we can't use the crons we have in mozilla-central in the fenix branch and we'll have to build a new one there: https://github.com/mozilla-mobile/fenix/blob/main/.cron.yml

I'm going to start working on building the cron for the Fenix branch. One thing I should note though is that we'll be using a slightly "different" Fenix build in the nightlySim group which is virtually the same thing as the current build we use: https://github.com/mozilla-mobile/fenix/blob/main/taskcluster/ci/build/kind.yml#L143

thanks for pointing out specific pushes on autoland- these were originally scheduled via the decision task, in fact it appears that we scheduled full pushes on those.

this push (03:46:10), id: 168880:
https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=Mi7bA_OPTaaOQPxlXm4vNA.0&searchStr=decision&revision=49f976dc75d595a5de73540b0ce666a435d60c65

  • the next pushid % 20

is a backstop, the android tp6m tasks are scheduled via:
skip-unless-backstop

this push (02:41:12) id: 168874:
https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=Mi7bA_OPTaaOQPxlXm4vNA.0&searchStr=decision&revision=b739eda0419c1d0dbd9ffd40a5645aa1cf1b5e6c

  • the next push 4 hours after previous backstop

is also a backstop, so the same scheduling logic.

the previous backstop (aug 30th, 21:50:44) id: 168866:
https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=XIYtb7sySCOpoFwxH4eyeA.0&searchStr=decision&revision=11e997d3cf78eb6a4f31a1e13a2509f4181f4b0a

here is how we calculate a backstop:
https://searchfox.org/mozilla-central/source/taskcluster/gecko_taskgraph/util/backstop.py#19

  • every 20th push
  • every 4 hours

There was one patch (bug 1780278) in taskgraph that landed August 10th (similar timing to nofis + fis for android-hw):
https://hg.mozilla.org/mozilla-central/rev/1fc676162294e9cafb0e3963879e60193153407f

a lot of code was removed from optimize/init.py, I assume that it made it into the taskgraph module.

going back in time to when the init.py changes where made, I don't see the logic changing, it is 4 hours || pushID % 20; there are edge cases (i.e. DONTBUILD, CLOSEDTREE, etc.), but in general looking over many days the pattern and frequency is the same.

This tells me that most likely the increase is related to fission, and not on optimization.

Flags: needinfo?(jmaher)
Depends on: 1788643

The PR in bug 1788643 is landing soon. It'll disable perf tests on Fenix for now until we get the nightly cron enabled.

honestly autoland is where most of the load is- I guess small adjustments could make a big difference on a day to day basis.

:jmaher, my plan is to start with the Fenix+Beta changes and if we still have issues after those fixes, then I'll look into disabling a subset of android tests on autoland.

Pushed by gmierz2@outlook.com:
https://hg.mozilla.org/integration/autoland/rev/6f8a27bf7142
Prevent android from being scheduled by the perf backfills. r=jmaher

:cpeterson, what fission tests are you ok with us disabling on autoland? I'm thinking of reducing the fission tests to only essential tests for geckoview (fenix would stay the same and have all tests running in the fenix branch). If we find that we have capacity left after that, we could slowly enable them one at a time.

Here's a list of the essential tests: https://searchfox.org/mozilla-central/source/taskcluster/ci/test/browsertime-mobile.yml#120-127

Flags: needinfo?(cpeterson)

(In reply to Greg Mierzwinski [:sparky] from comment #24)

:cpeterson, I'm not sure if it'll come to this yet, but if it does, what fission tests are you ok with us disabling on autoland? I was thinking of reducing the fission tests to only essential tests for geckoview (fenix would stay the same and have all tests running in the fenix branch).

If you will still be running Fenix tests on autoland and both Fenix and GeckoView on m-c, would GeckoView be available for perf backfills on autoland? Are you running the tests on both the Galaxy A51 and Pixel 2?

If so, then you could stop running all GeckoView-Fission tests on autoland. GeckoView gives us a performance baseline, but we ship Fenix so catching its performance regressions on autoland is most important.

Or we could just run the allrecipes test in GeckoView on autoland. allrecipes.com is not a major website, but its test results are very stable and clearly show the performance difference between Fenix and GeckoView.

Flags: needinfo?(cpeterson) → needinfo?(gmierz2)

Sorry if I caused some confusion. Fenix tests run on mozilla-central, and the fenix branch. We don't run it on autoland because we don't have a build available there, the fenix branch would be the closest thing to this but it doesn't have alerting since there are too few commits that go through there. GeckoView on autoland has alerting and is available for backfilling. Yes, we're running the tests on both the A51, and the P2.

Does this information change your thoughts on what we should run? The essential tests I'm suggesting are those which we've found to catch the most regressions (allrecipes is included in this list).

Flags: needinfo?(gmierz2) → needinfo?(cpeterson)

:cpeterson, I've attached a patch of what I'm thinking of doing. The non-essential fission perf tests will be disabled on autoland, but they'll still run nightly on mozilla-central.

Attachment #9292836 - Attachment is obsolete: true

I'm going to land the patch so we can deal with the queue, and we can adjust as needed afterwards.

Attachment #9292861 - Attachment description: Bug 1786254 - Disable non-essential fission perf tests on android. r?#perftest → Bug 1786254 - Disable non-essential android fission perf tests on autoland. r?#perftest
Pushed by gmierz2@outlook.com:
https://hg.mozilla.org/integration/autoland/rev/781a227d991b
Disable non-essential android fission perf tests on autoland. r=perftest-reviewers,kshampur

(In reply to Greg Mierzwinski [:sparky] from comment #26)

Does this information change your thoughts on what we should run? The essential tests I'm suggesting are those which we've found to catch the most regressions (allrecipes is included in this list).

We caught recent Fenix Fission regressions in warm recorded page load on Wikipedia and Google Restaurants (and some other sites, but Wikipedia and Google Restaurants had clear regressions and show a clear difference between Fission and no-Fission perf).

So the essential perf tests for GeckoView on autoland could be:

  • Wikipedia cold recorded page load on A51 and P2
  • Wikipedia warm recorded page load on A51 and P2
  • Google Restaurants cold recorded page load on A51 and P2
  • Google Restaurants warm recorded page load on A51 and P2

If that is still too many tests, we could test Wikipedia on the P2 and Google Restaurants on the A51. (Those the test+device pairings where we found the recent Fission regressions.) Or if running different tests on each device is too complicated or not recommended, we drop Google Restaurants from both devices.

Flags: needinfo?(cpeterson) → needinfo?(gmierz2)

:cpeterson, thanks, that sounds good to me. I'm going to let things settle over the weekend (we've reduced the beta branch frequency now as well) and see what it looks like then, if we find that we have some capacity available, I'll enable both of them, otherwise, I'll swap a couple of the other tests out for these two.

Flags: needinfo?(gmierz2)

Bug 1788026 has been merged & uplifted to beta, cancelling ni?

Flags: needinfo?(kshampur)
Depends on: 1789948
See Also: → 1790987
Whiteboard: [relops-android], [perftest:triage] → [relops-android],

This seems resolved. Closing.

Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: