Closed Bug 1125476 Opened 9 years ago Closed 9 years ago

Jobs in the objectstore can get stuck in the 'loading' state if the worker dies

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: emorley, Unassigned)

References

Details

Query of jobs in the mozilla-inbound objectstore, grouped by state:

+----------+-----------------+
| count(*) | processed_state |
+----------+-----------------+
|     5279 | ready           |
|     4479 | loading         |
|  2397534 | complete        |
+----------+-----------------+
3 rows in set (1.23 sec)

Sheeri found jobs as far back as June in the loading state, as well as others from the last 30 days - so this is a pre-existing and ongoing problem. There were 9000 jobs in this state for the Try repo for example.

When a worker takes a 'ready' job object, it marks it as loading:
https://github.com/mozilla/treeherder-service/blob/956c84e3e2d140da01369a7178d61acaac62d2d0/treeherder/model/derived/jobs.py#L2452

We then fetch all jobs with the loading state and that worker's CONNECTION_ID():
https://github.com/mozilla/treeherder-service/blob/3f762b296f86f014cfa34c33a916ae9c10f96e18/treeherder/model/sql/objectstore.json#L29

However I'm presuming if the worker has issues/gets killed, the job is orphaned - and unless a worker with the same CONNECTION_ID() ends up fetching it, it will stay in that state indefinitely.

We should periodically reset these jobs back to the 'ready' state, so we actually ingest them.

The jobs in the objectstore have a loaded_timestamp - but it's the time the row was inserted, not the time the job was set to 'loading'. We could possibly still use that - and say if a job has been in the queue for longer than N hours/minutes, mark it as ready (since most of the time we should process the jobs in near realtime). However the problem with that is if we have infra issues/a backlog, we'll end up resetting lots of jobs that are actually not orphaned. So we may need to add a taken_timestamp, and record the time they were marked as 'loading'.
Example job in the objectstore:
https://www.irccloud.com/pastebin/U2bsN5Z8
Priority: P1 → P2
Looking at a week's worth of try jobs, this affected 7644 or ~4% of jobs :-(

Execute:
> SELECT processed_state, count(processed_state) as count FROM try_objectstore_1.objectstore 
WHERE loaded_timestamp 
BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 8 day))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 day))
GROUP BY processed_state

+ -------------------- + ---------- +
| processed_state      | count      |
+ -------------------- + ---------- +
| loading              | 7644       |
| complete             | 180650     |
+ -------------------- + ---------- +
2 rows

There are many entries with the same worker id, giving credence to the idea that a worker is dying (or being killed during deploy) and leaving behind a bunch of stuck jobs:
https://emorley.pastebin.mozilla.org/8832749
The ratio of stuck to successful jobs appears to have regressed significantly since comment 0. Particularly since when reading the comment 0 figures you have to remember that they are "since the beginning of time", which you would expect to make the ratio seem worse than it is, since completed jobs are purged after 4 months, whereas those in the loading state are not.
Depends on: 1140349
So Joel was seeing instances of this on a push today (https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f9d4c921efa) - and we haven't deployed today.

Another theory for the spike might be these process-object exceptions I'm seeing on both prod and stage:
 exceptions:TypeError: 'NoneType' object has no attribute '__getitem__' 
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3785087676

The worker could be dying and so all of the tasks it had claimed in that batch get stuck loading?
Depends on: 1162526
Depends on: 1162682
Ok, so the situation here is now significantly improved:
1) The recent chronic exception that was causing the worker to die, has been fixed (bug 1162526).
2) An exception during the processing of one job no longer makes the worker lose all of the jobs it was handling (up to 100 jobs) - bug 1162682.
Summary: Jobs in the objectstore can get stuck in the 'loading' state and never get ingested → Jobs in the objectstore can get stuck in the 'loading' state if the worker dies
Depends on: 1163659
Depends on: 1163802
Depends on: 1163804
Wontfix given the objectstore is being removed in bug 1140349.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.