Explain why bitbar Android devices (like Pixel 2) sometimes take ~20 minutes to pick up a new task from a backed up queue
Categories
(Taskcluster :: Operations and Service Requests, task)
Tracking
(Not tracked)
People
(Reporter: nalexander, Assigned: bc)
References
Details
Looking at URLs like https://tools.taskcluster.net/provisioners/proj-autophone/worker-types/gecko-t-bitbar-gw-perf-p2/workers/bitbar/pixel2-53, I see that devices often take a long time, including up to 20 minutes, to pick up a new task. (This is from a queue with 100s of pending jobs.)
For example, the second row (currently), with Task ID as_5QhUDRsqJYYwhHsO39w says it finished at 2019-07-02T18:30:40.678Z. But the next task (with Task ID Lj4weZwWTWC4Uo1X4wO4kA) started at 2019-07-02T18:50:41.791Z
That’s 20 minutes. We have ~500 tasks in the queue.
Where is that time going? Generally I see ~5-8 minutes visually, but sometimes it's 1 or 2 minutes and sometimes much more. Is it possible that we don’t release the device until some additional processing happens?
So: can we produce a distribution for each device, and across all devices, so that we can get more useful information about how fast the queue is actually servicing tasks? (Maybe these already exist -- that would be great!J)
And can we explain where the bubbles in the pipeline are? 'cuz we're very, very resource constrained here and can't afford to lose even 5% of our throughput.
| Reporter | ||
Comment 1•6 years ago
|
||
aerickson: are you the right person to do the initial investigation? Bitbar (on Slack) suggested that they turn devices around in ~1.5 minutes (which is great).
| Reporter | ||
Comment 2•6 years ago
|
||
(In reply to Nick Alexander :nalexander [he/him] from comment #1)
aerickson: are you the right person to do the initial investigation? Bitbar (on Slack) suggested that they turn devices around in ~1.5 minutes (which is great).
Redirecting to bc.
From Slack:
Sakari Rautiainen [12:43 PM] @nalexander Here is steps how does it looks from our end:
2. The device host where device is connected loads configuration and input files, plus perfroms a few other initiliazation params - this takes a few seconds
3. THe device starts a container and attach device to it, and dedicates IP for container.
4. "task cluster" part starts
...
5 "task cluster" part end
6. Container is stopped/killed, device is returned back to host control. Test results are sent back. Takes about 1minute
7. Device is cleaned - this takes about 1min with pixel 2, about 3min with moto g5
8. Device available -> jumpt to step #1```
| Assignee | ||
Comment 3•6 years ago
|
||
I've worked around the issue caused by the backlog and taskcluster superseding jobs which caused our test run manager to not start enough tests however I continue to see this issue where devices are not being scheduled. I've commented on bitbar slack and will follow up tomorrow when they are available.
| Assignee | ||
Comment 4•6 years ago
|
||
Initially my work around started too many bitbar test which appeared to cause the bitbar testdroid server to crawl to a stop. I reduced the number of waiting tests I would create, deleted the backlog of waiting tests (bitbar tests not ours) and restarted. It appears to have worked for working through the queues. We have also had a much better device utilization since then. We appear to no longer be superseding tasks which means we are at a higher level of utilization and I don't see stale devices with the exception of a couple of offline ones. More news after I talk to bitbar.
| Assignee | ||
Comment 5•6 years ago
|
||
The first issue is the short run times of superseded tasks which result in tests finishing long before the project loop comes back around to process the workertype/project/devicegroup. I filed bug 1563307 to see if we can get generic worker to keep consuming superseded tasks until it hits a good one.
The second issue is the long interval between successive checks for pending tasks for a workertype/project/devicegroup. Starting a test at bitbar can take 10-13 seconds. For the large device groups for mozilla-gw-perftest-{g5,p2} this can result in the observed idle time for some devices especially now that I am pre-populating waiting bitbar test runs. The solution for this is to make the test run manager multithreaded and check each workertype/project/devicegroup on its own thread so we don't have to wait for the others to be processed. I've filed bug 1563377 for multi-threading.
A final possible issue with with bitbar's database and cloud server. We'll continue investigations of that next week after the holiday.
Closing this out as I believe we have the answers we were looking for.
Description
•