1596408 - not enough capacity in gecko-t-bitbar-gw-perf-p2 worker pool to satisfy demand from trunk trees

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

5 years ago

gecko-t-bitbar-gw-perf-p2 workers run the Android 8.0 performance tasks (raptor, browsertime) and the queue of pending tasks almost never gets to zero excepts.

From a time window in which builds were busted during the last UTC night, it can be approximated that 100 tasks/hour is the current maximum those machines can complete.

https://sql.telemetry.mozilla.org/queries/66264/source shows 66 tasks run per autoland push with a full task set. That gets scheduled only every fifth push but at least every, so possibly more than once per hour. Pushes to central schedule 160+ such tasks.

That means there are more tasks getting scheduled than can be run during working days (tasks will either get dropped after 24h waiting time or coalesced).

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Comment 1

•

5 years ago

:davehunt, have you been scheduling more tests on the p2 devices? This summer (July/August) we had spare capacity.

:aerickson, what is the average number of devices online daily? Has our available devices been reduced since August?

Flags: needinfo?(dave.hunt)

Flags: needinfo?(aerickson)

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 2

•

5 years ago

:rwood are we running browsertime regularly against Android? That's the only potential increase that I can think of.

Flags: needinfo?(dave.hunt) → needinfo?(rwood)

Robert Wood [:rwood]

Comment 3

•

5 years ago

(In reply to Dave Hunt [:davehunt] [he/him] ⌚BST from comment #2)

:rwood are we running browsertime regularly against Android? That's the only potential increase that I can think of.

Yes we are running 2 browsertime jobs (tp6m-1, tp6m-1-c) on the geckoview example app on central and integration branches:

https://searchfox.org/mozilla-central/source/taskcluster/ci/test/browsertime-mobile.yml

Flags: needinfo?(rwood)

Andrew Erickson [:aerickson]

Comment 4

•

5 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #1)

:aerickson, what is the average number of devices online daily? Has our available devices been reduced since August?

For the perf-p2 queue:

We've had 35 devices configured for the last 90 days (in the past 6 months we've been as high as 40).
When the queue is full, we've hit 34 concurrent devices working (averaging 30 active at a time due to reboots and cleaning up after tests).
Over the last 90 days, we've had an average of 1 device offline at any given time.

Flags: needinfo?(aerickson)

Joel Maher ( :jmaher ) (UTC -8) (PTO back normal Nov 17)

Comment 5

•

5 years ago

thanks for the replies, it sounds like the 2 extra jobs pushed the limits on capacity, :rwood could those be run on mozilla-central only, or do they need to be run on integration? maybe a single job would suffice?

I see that we have 2 jobs (tp6m-1, tp6m-1-c) for both p2-arm64 and p2-pgo, that is 4 jobs per m-c push. In addition we run on refbrow, so on an aggregate basis we add 20 jobs on m-c (currently broken with 30+ minute timeouts) and on autoland we run opt/pgo for both arm7 and arm64, so 8 jobs are scheduled for every full run (~20 times/day).

I would recommend reducing the ~180 daily jobs from automation to something lower, maybe no opt but only pgo? that would cut it in half. Also if we are running btime and raptor, can we cut the duplication except for one browser and one config? Maybe only on m-c for the duplication?

Robert Wood [:rwood]

Comment 6

•

5 years ago

Ah right sorry I missed that (that the browsertime android jobs are also on fenix and refbrow). I believe they are running on both central and the integration branches for data comparison reasons (raptor vs browsertime) - but :sparky could we just do the validation on try pushes instead?

I agree, to reduce the duplication let's just run raptor-browsertime android jobs on central on the geckoview example app cold page-load only, that will be enough to verify the automation. With the option to run all on try still of course. Then if/when given the final go-ahead to switch over to browsertime we can turn off the other raptor android jobs as we enable more browsertime android ones.

Also good suggestion Joel re: raptor/browesrtime android jobs should run on pgo only.

:sparky, :davehunt, are you ok with that?

Flags: needinfo?(gmierz2)

Flags: needinfo?(dave.hunt)

Greg Mierzwinski [:sparky]

Comment 7

•

5 years ago

Yes to both questions :)

Flags: needinfo?(gmierz2)

Robert Wood [:rwood]

Updated

•

5 years ago

Depends on: 1596593

Geoff Brown [:gbrown]

Updated

•

5 years ago

Priority: -- → P3

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 8

•

5 years ago

Sorry for the delay, yes this sounds good to me. It looks like the dependency is now resolved. How is the capacity looking now?

Flags: needinfo?(dave.hunt) → needinfo?(aryx.bugmail)

Andrew Erickson [:aerickson]

Comment 9

•

5 years ago

We're slammed with 1200 jobs in the p2-perf queue currently. It does look like Jobs have decreased a bit since last week.

https://earthangel-b40313e5.influxcloud.net/d/wIJoZ4HWk/android-queues?orgId=1&from=now-7d&to=now&refresh=5m&fullscreen&panelId=6

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 10

•

5 years ago

https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&var-workerType=gecko-t-bitbar-gw-perf-p2&from=1574242567999&to=1574278060957

autoland pushes are down to 58 jobs/push with all jobs, central to 108. That leaves little capacity for Try pushes (e.g. the most recent pushes which request 600 jobs on these machines might not see them run because they get discarded after 24h of waiting).

Flags: needinfo?(aryx.bugmail)

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

5 years ago

Depends on: 1596612

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Bugzilla

not enough capacity in gecko-t-bitbar-gw-perf-p2 worker pool to satisfy demand from trunk trees

Categories

(Testing :: General, task, P3)

Tracking

(Not tracked)

People

(Reporter: aryx, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Updated

Comment 8

Comment 9

Comment 10

Updated

Updated