Closed Bug 1562988 Opened 6 years ago Closed 6 years ago

Explain why bitbar Android devices (like Pixel 2) sometimes take ~20 minutes to pick up a new task from a backed up queue

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nalexander, Assigned: bc)

References

Details

Looking at URLs like https://tools.taskcluster.net/provisioners/proj-autophone/worker-types/gecko-t-bitbar-gw-perf-p2/workers/bitbar/pixel2-53, I see that devices often take a long time, including up to 20 minutes, to pick up a new task. (This is from a queue with 100s of pending jobs.)

For example, the second row (currently), with Task ID as_5QhUDRsqJYYwhHsO39w says it finished at 2019-07-02T18:30:40.678Z. But the next task (with Task ID Lj4weZwWTWC4Uo1X4wO4kA) started at 2019-07-02T18:50:41.791Z
That’s 20 minutes. We have ~500 tasks in the queue.

Where is that time going? Generally I see ~5-8 minutes visually, but sometimes it's 1 or 2 minutes and sometimes much more. Is it possible that we don’t release the device until some additional processing happens?

So: can we produce a distribution for each device, and across all devices, so that we can get more useful information about how fast the queue is actually servicing tasks? (Maybe these already exist -- that would be great!J)

And can we explain where the bubbles in the pipeline are? 'cuz we're very, very resource constrained here and can't afford to lose even 5% of our throughput.

aerickson: are you the right person to do the initial investigation? Bitbar (on Slack) suggested that they turn devices around in ~1.5 minutes (which is great).

Flags: needinfo?(aerickson)

(In reply to Nick Alexander :nalexander [he/him] from comment #1)

aerickson: are you the right person to do the initial investigation? Bitbar (on Slack) suggested that they turn devices around in ~1.5 minutes (which is great).

Redirecting to bc.

From Slack:

Sakari Rautiainen [12:43 PM] @nalexander Here is steps how does it looks from our end:

2. The device host where device is connected loads configuration and input files, plus perfroms a few other initiliazation params - this takes a few seconds
3. THe device starts a container and attach device to it, and dedicates IP for container.
4. "task cluster" part starts
...
5 "task cluster" part end
6. Container is stopped/killed, device is returned back to host control. Test results are sent back. Takes about 1minute
7. Device is cleaned - this takes about 1min with pixel 2, about 3min with moto g5
8. Device available -> jumpt to step #1```
Flags: needinfo?(aerickson) → needinfo?(bob)

I've worked around the issue caused by the backlog and taskcluster superseding jobs which caused our test run manager to not start enough tests however I continue to see this issue where devices are not being scheduled. I've commented on bitbar slack and will follow up tomorrow when they are available.

Initially my work around started too many bitbar test which appeared to cause the bitbar testdroid server to crawl to a stop. I reduced the number of waiting tests I would create, deleted the backlog of waiting tests (bitbar tests not ours) and restarted. It appears to have worked for working through the queues. We have also had a much better device utilization since then. We appear to no longer be superseding tasks which means we are at a higher level of utilization and I don't see stale devices with the exception of a couple of offline ones. More news after I talk to bitbar.

The first issue is the short run times of superseded tasks which result in tests finishing long before the project loop comes back around to process the workertype/project/devicegroup. I filed bug 1563307 to see if we can get generic worker to keep consuming superseded tasks until it hits a good one.

The second issue is the long interval between successive checks for pending tasks for a workertype/project/devicegroup. Starting a test at bitbar can take 10-13 seconds. For the large device groups for mozilla-gw-perftest-{g5,p2} this can result in the observed idle time for some devices especially now that I am pre-populating waiting bitbar test runs. The solution for this is to make the test run manager multithreaded and check each workertype/project/devicegroup on its own thread so we don't have to wait for the others to be processed. I've filed bug 1563377 for multi-threading.

A final possible issue with with bitbar's database and cloud server. We'll continue investigations of that next week after the holiday.

Closing this out as I believe we have the answers we were looking for.

Assignee: nobody → bob
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(bob)
Resolution: --- → FIXED
See Also: → 1563307, 1563377
You need to log in before you can comment on or make changes to this bug.