treeherder currently misses jobs (at least for a frequently failing machine)
Categories
(Tree Management :: Treeherder: Data Ingestion, defect)
Tracking
(Not tracked)
People
(Reporter: aryx, Unassigned)
Details
The pixel2-02 had many failures: https://tools.taskcluster.net/provisioners/proj-autophone/worker-types/gecko-t-bitbar-gw-unit-p2/workers/bitbar/pixel2-02
https://tools.taskcluster.net/groups/N2eUUQOJTzaZaCS8uqbcog/tasks/Q29tjsV0Qr-Uu1c7sVauHQ/runs/0 says there should be a failed ' test-android-hw-p2-8-0-arm7-api-16-qr/opt-geckoview-reftest-e10s-4' job. But the linked push doesn't show one: https://treeherder.mozilla.org/#/jobs?repo=try&resultStatus=superseded%2Ctestfailed%2Cbusted%2Cexception%2Csuccess%2Cretry%2Cusercancel%2Crunning%2Cpending%2Crunnable&tier=1%2C2%2C3&revision=5836c979b7a1b472288e095607d6524457a033e3&searchStr=geckoview%2Creftest
Querying the treeherder database only shows 1 recent failure and 1 superseded job, the others are successful.
![]() |
Reporter | |
Updated•5 years ago
|
![]() |
Reporter | |
Comment 1•5 years ago
|
||
![]() |
Reporter | |
Comment 2•5 years ago
|
||
Treeherder knows about the jobs but because it regards them as still running, they have the start of the unix epoch as the time stamp.
Comment 3•5 years ago
|
||
I mentioned this to :Aryx in IRC but recording it here too.
All these failures happened on physical Android devices running in bitbar. Each of these devices is controlled by a device host that acts as the intermediary between Taskcluster and the device, similar to a foopy for anyone old enough to remember those. The device host introduces another layer that could be intercepting or failing to perpetuate status.
If the devices are failing tests in rapid succession, it's possible the device host can't keep up.
Comment 4•5 years ago
|
||
Sounds like the failed task result isn't getting through to Treeherder. Either it wasn't sent to us, or we somehow dropped it during ingestion.
Armenzg: Is this something that would fall into your domain to investigate?
Aryx: Do you have a more recent example of this happening? The original links are now pointing to the TC instance that was shut down Nov 9th.
Updated•5 years ago
|
![]() |
Reporter | |
Comment 5•5 years ago
|
||
Couldn't find a recent occurrence (looked at those Android jobs from comment 0).
Comment 6•5 years ago
|
||
This is harder to look at since all the links are broken (post-migration).
If this happens again, the ingestion can be tested locally like this:
# One tab
docker-compose up
# Another tab
docker-compose run backend bash
./manage.py ingest_push_and_tasks task --task-id <id>
Please re-open if you have post-migration examples
Comment 7•5 years ago
|
||
FYI I also fixed bug 1595902 not long ago which I believe this could be related to.
Description
•