1786254 - Few Android/bitbar workers are available

There are still exceptions on trees (especially autoland) with tasks getting to deadline exceeded before they're picked up by any worker, like these. A larger range to see how things look on tree is this one. Is there something that can be done about it?
RIght now https://earthangel-b40313e5.influxcloud.net/d/wIJoZ4HWk/android-queues?orgId=1&from=now-24h&to=now&refresh=30s lists 8 Pixel workers having issues and one A51.
Reopening for now until things get better.

Status: RESOLVED → REOPENED

Flags: needinfo?(aerickson)

Resolution: FIXED → ---

Andrew Erickson [:aerickson]

Assignee

Comment 7

•

2 years ago

(In reply to Cosmin Sabou [:CosminS] from comment #6)

There are still exceptions on trees (especially autoland) with tasks getting to deadline exceeded before they're picked up by any worker, like these.

These all say things about being cancelled (but I don't see a cAll task in the decision worker).

"[taskcluster 2022-08-28T01:18:35.757Z] Command ABORTED after 33m58.363490362s"

Are those really tasks that timed out?

A larger range to see how things look on tree is this one. Is there something that can be done about it?

I do see 'deadline exceeded' for these, but I'm unsure why they'd time out as they're not try tasks (should be higher priority) and we were doing well last Friday with devices. Why are autoland jobs testing a single change?

:jmaher, is it possible we may have scheduled too many tests on the p2s? The a51s don't seem to have this problem (we have more of them and they're faster).

RIght now https://earthangel-b40313e5.influxcloud.net/d/wIJoZ4HWk/android-queues?orgId=1&from=now-24h&to=now&refresh=30s lists 8 Pixel workers having issues and one A51.
Reopening for now until things get better.

These devices go offline... they're mobile phones that require human intervention to fix (and Bitbar doesn't work on the weekend). I don't know that there's anything I can do to deliver a higher availability.

Flags: needinfo?(aerickson) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 8

•

2 years ago

good question :aerickson, I looked at a 90 day window and you see a trend:
https://earthangel-b40313e5.influxcloud.net/d/wIJoZ4HWk/android-queues?orgId=1&refresh=5m&from=now-90d&to=now

August 11th things frequently have much larger queues and backlogs and that is when Bug 1768558 landed (running fenix/android perf tests in BOTH fission and no-fission).

So this would fall in line with "we are overscheduling".

This is the same problem (maybe more of a problem) on the A51 phones as well.

I think why we are seeing issues on P2 is that the phones are 3.5-4 years old and have a higher chance of failure/going offline. The A51's are brand new.

I know there is talk of only running perf on A51 and not P2, maybe we could at least turn off non-fission on P2 or all perf and add some A51's. Alternatively we could reduce the tests we run (maybe no chrome tests, maybe no live tests, maybe only tp6m-essential)

:sparky, could you help get a conversation going within the perftools team regarding how to reduce the scheduled load on the P2 phones?

Flags: needinfo?(jmaher) → needinfo?(gmierz2)

Bug 1786254 - Prevent android from being scheduled by the perf backfills. r?jmaher 2 years ago Greg Mierzwinski [:sparky] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1786254 - Disable non-essential fission perf tests on android. r?#perftest 2 years ago Greg Mierzwinski [:sparky] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1786254 - Disable non-essential android fission perf tests on autoland. r?#perftest 2 years ago Greg Mierzwinski [:sparky] 48 bytes, text/x-phabricator-request		Details \| Review