Closed Bug 1562988 Opened 6 years ago Closed 6 years ago

Explain why bitbar Android devices (like Pixel 2) sometimes take ~20 minutes to pick up a new task from a backed up queue

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: nalexander, Assigned: bc)

References

Details

Nick Alexander :nalexander [he/him]

Reporter

Description

•

6 years ago

Looking at URLs like https://tools.taskcluster.net/provisioners/proj-autophone/worker-types/gecko-t-bitbar-gw-perf-p2/workers/bitbar/pixel2-53, I see that devices often take a long time, including up to 20 minutes, to pick up a new task. (This is from a queue with 100s of pending jobs.)

For example, the second row (currently), with Task ID as_5QhUDRsqJYYwhHsO39w says it finished at 2019-07-02T18:30:40.678Z. But the next task (with Task ID Lj4weZwWTWC4Uo1X4wO4kA) started at 2019-07-02T18:50:41.791Z
That’s 20 minutes. We have ~500 tasks in the queue.

Where is that time going? Generally I see ~5-8 minutes visually, but sometimes it's 1 or 2 minutes and sometimes much more. Is it possible that we don’t release the device until some additional processing happens?

So: can we produce a distribution for each device, and across all devices, so that we can get more useful information about how fast the queue is actually servicing tasks? (Maybe these already exist -- that would be great!J)

And can we explain where the bubbles in the pipeline are? 'cuz we're very, very resource constrained here and can't afford to lose even 5% of our throughput.

Nick Alexander :nalexander [he/him]

Reporter

Comment 1

•

6 years ago

aerickson: are you the right person to do the initial investigation? Bitbar (on Slack) suggested that they turn devices around in ~1.5 minutes (which is great).

Flags: needinfo?(aerickson)

Nick Alexander :nalexander [he/him]

Reporter

Comment 2

•

6 years ago

(In reply to Nick Alexander :nalexander [he/him] from comment #1)

aerickson: are you the right person to do the initial investigation? Bitbar (on Slack) suggested that they turn devices around in ~1.5 minutes (which is great).

Redirecting to bc.

From Slack:

Sakari Rautiainen [12:43 PM] @nalexander Here is steps how does it looks from our end:

2. The device host where device is connected loads configuration and input files, plus perfroms a few other initiliazation params - this takes a few seconds
3. THe device starts a container and attach device to it, and dedicates IP for container.
4. "task cluster" part starts
...
5 "task cluster" part end
6. Container is stopped/killed, device is returned back to host control. Test results are sent back. Takes about 1minute
7. Device is cleaned - this takes about 1min with pixel 2, about 3min with moto g5
8. Device available -> jumpt to step #1```

Flags: needinfo?(aerickson) → needinfo?(bob)

Bob Clary [:bc] (inactive)

Assignee

Comment 3

•

6 years ago

I've worked around the issue caused by the backlog and taskcluster superseding jobs which caused our test run manager to not start enough tests however I continue to see this issue where devices are not being scheduled. I've commented on bitbar slack and will follow up tomorrow when they are available.

Bob Clary [:bc] (inactive)

Assignee

Comment 4

•

6 years ago

Initially my work around started too many bitbar test which appeared to cause the bitbar testdroid server to crawl to a stop. I reduced the number of waiting tests I would create, deleted the backlog of waiting tests (bitbar tests not ours) and restarted. It appears to have worked for working through the queues. We have also had a much better device utilization since then. We appear to no longer be superseding tasks which means we are at a higher level of utilization and I don't see stale devices with the exception of a couple of offline ones. More news after I talk to bitbar.

Bob Clary [:bc] (inactive)

Assignee

Comment 5

•

6 years ago

The first issue is the short run times of superseded tasks which result in tests finishing long before the project loop comes back around to process the workertype/project/devicegroup. I filed bug 1563307 to see if we can get generic worker to keep consuming superseded tasks until it hits a good one.

The second issue is the long interval between successive checks for pending tasks for a workertype/project/devicegroup. Starting a test at bitbar can take 10-13 seconds. For the large device groups for mozilla-gw-perftest-{g5,p2} this can result in the observed idle time for some devices especially now that I am pre-populating waiting bitbar test runs. The solution for this is to make the test run manager multithreaded and check each workertype/project/devicegroup on its own thread so we don't have to wait for the others to be processed. I've filed bug 1563377 for multi-threading.

A final possible issue with with bitbar's database and cloud server. We'll continue investigations of that next week after the holiday.

Closing this out as I believe we have the answers we were looking for.

Assignee: nobody → bob

Status: NEW → RESOLVED

Closed: 6 years ago

Flags: needinfo?(bob)

Resolution: --- → FIXED

Bugzilla

Explain why bitbar Android devices (like Pixel 2) sometimes take ~20 minutes to pick up a new task from a backed up queue

Categories

(Taskcluster :: Operations and Service Requests, task)

Tracking

(Not tracked)

People

(Reporter: nalexander, Assigned: bc)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5