1125476 - Jobs in the objectstore can get stuck in the 'loading' state if the worker dies

Reporter

Description

•

9 years ago

Query of jobs in the mozilla-inbound objectstore, grouped by state:

+----------+-----------------+
| count(*) | processed_state |
+----------+-----------------+
|     5279 | ready           |
|     4479 | loading         |
|  2397534 | complete        |
+----------+-----------------+
3 rows in set (1.23 sec)

Sheeri found jobs as far back as June in the loading state, as well as others from the last 30 days - so this is a pre-existing and ongoing problem. There were 9000 jobs in this state for the Try repo for example.

When a worker takes a 'ready' job object, it marks it as loading:
https://github.com/mozilla/treeherder-service/blob/956c84e3e2d140da01369a7178d61acaac62d2d0/treeherder/model/derived/jobs.py#L2452

We then fetch all jobs with the loading state and that worker's CONNECTION_ID():
https://github.com/mozilla/treeherder-service/blob/3f762b296f86f014cfa34c33a916ae9c10f96e18/treeherder/model/sql/objectstore.json#L29

However I'm presuming if the worker has issues/gets killed, the job is orphaned - and unless a worker with the same CONNECTION_ID() ends up fetching it, it will stay in that state indefinitely.

We should periodically reset these jobs back to the 'ready' state, so we actually ingest them.

The jobs in the objectstore have a loaded_timestamp - but it's the time the row was inserted, not the time the job was set to 'loading'. We could possibly still use that - and say if a job has been in the queue for longer than N hours/minutes, mark it as ready (since most of the time we should process the jobs in near realtime). However the problem with that is if we have infra issues/a backlog, we'll end up resetting lots of jobs that are actually not orphaned. So we may need to add a taken_timestamp, and record the time they were marked as 'loading'.

Ed Morley [:emorley]

Reporter

Comment 1

•

9 years ago

Example job in the objectstore:
https://www.irccloud.com/pastebin/U2bsN5Z8

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Priority: P1 → P2

Ed Morley [:emorley]

Reporter

Comment 3

•

9 years ago

Looking at a week's worth of try jobs, this affected 7644 or ~4% of jobs :-(

Execute:
> SELECT processed_state, count(processed_state) as count FROM try_objectstore_1.objectstore 
WHERE loaded_timestamp 
BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 8 day))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 day))
GROUP BY processed_state

+ -------------------- + ---------- +
| processed_state      | count      |
+ -------------------- + ---------- +
| loading              | 7644       |
| complete             | 180650     |
+ -------------------- + ---------- +
2 rows

There are many entries with the same worker id, giving credence to the idea that a worker is dying (or being killed during deploy) and leaving behind a bunch of stuck jobs:
https://emorley.pastebin.mozilla.org/8832749

Ed Morley [:emorley]

Reporter

Comment 4

•

9 years ago

The ratio of stuck to successful jobs appears to have regressed significantly since comment 0. Particularly since when reading the comment 0 figures you have to remember that they are "since the beginning of time", which you would expect to make the ratio seem worse than it is, since completed jobs are purged after 4 months, whereas those in the loading state are not.

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1140349

Ed Morley [:emorley]

Reporter

Comment 5

•

9 years ago

So Joel was seeing instances of this on a push today (https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f9d4c921efa) - and we haven't deployed today.

Another theory for the spike might be these process-object exceptions I'm seeing on both prod and stage:
 exceptions:TypeError: 'NoneType' object has no attribute '__getitem__' 
https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3785087676

The worker could be dying and so all of the tasks it had claimed in that batch get stuck loading?

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1162526

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1162682

Ed Morley [:emorley]

Reporter

Comment 6

•

9 years ago

Ok, so the situation here is now significantly improved:
1) The recent chronic exception that was causing the worker to die, has been fixed (bug 1162526).
2) An exception during the processing of one job no longer makes the worker lose all of the jobs it was handling (up to 100 jobs) - bug 1162682.

Summary: Jobs in the objectstore can get stuck in the 'loading' state and never get ingested → Jobs in the objectstore can get stuck in the 'loading' state if the worker dies

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1163659

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1163802

Ed Morley [:emorley]

Reporter

Updated

•

9 years ago

Depends on: 1163804

Ed Morley [:emorley]

Reporter

Comment 7

•

9 years ago

Wontfix given the objectstore is being removed in bug 1140349.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WONTFIX

Bugzilla

Quick Search

Jobs in the objectstore can get stuck in the 'loading' state if the worker dies

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P2)

Tracking

(Not tracked)

People

(Reporter: emorley, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 3

Comment 4

Updated

Comment 5

Updated

Updated

Comment 6

Updated

Updated

Updated

Comment 7