Closed
Bug 1125476
Opened 9 years ago
Closed 9 years ago
Jobs in the objectstore can get stuck in the 'loading' state if the worker dies
Categories
(Tree Management :: Treeherder: Data Ingestion, defect, P2)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: emorley, Unassigned)
References
Details
Query of jobs in the mozilla-inbound objectstore, grouped by state: +----------+-----------------+ | count(*) | processed_state | +----------+-----------------+ | 5279 | ready | | 4479 | loading | | 2397534 | complete | +----------+-----------------+ 3 rows in set (1.23 sec) Sheeri found jobs as far back as June in the loading state, as well as others from the last 30 days - so this is a pre-existing and ongoing problem. There were 9000 jobs in this state for the Try repo for example. When a worker takes a 'ready' job object, it marks it as loading: https://github.com/mozilla/treeherder-service/blob/956c84e3e2d140da01369a7178d61acaac62d2d0/treeherder/model/derived/jobs.py#L2452 We then fetch all jobs with the loading state and that worker's CONNECTION_ID(): https://github.com/mozilla/treeherder-service/blob/3f762b296f86f014cfa34c33a916ae9c10f96e18/treeherder/model/sql/objectstore.json#L29 However I'm presuming if the worker has issues/gets killed, the job is orphaned - and unless a worker with the same CONNECTION_ID() ends up fetching it, it will stay in that state indefinitely. We should periodically reset these jobs back to the 'ready' state, so we actually ingest them. The jobs in the objectstore have a loaded_timestamp - but it's the time the row was inserted, not the time the job was set to 'loading'. We could possibly still use that - and say if a job has been in the queue for longer than N hours/minutes, mark it as ready (since most of the time we should process the jobs in near realtime). However the problem with that is if we have infra issues/a backlog, we'll end up resetting lots of jobs that are actually not orphaned. So we may need to add a taken_timestamp, and record the time they were marked as 'loading'.
Reporter | ||
Comment 1•9 years ago
|
||
Example job in the objectstore: https://www.irccloud.com/pastebin/U2bsN5Z8
Reporter | ||
Updated•9 years ago
|
Priority: P1 → P2
Reporter | ||
Comment 3•9 years ago
|
||
Looking at a week's worth of try jobs, this affected 7644 or ~4% of jobs :-( Execute: > SELECT processed_state, count(processed_state) as count FROM try_objectstore_1.objectstore WHERE loaded_timestamp BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 8 day)) AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 day)) GROUP BY processed_state + -------------------- + ---------- + | processed_state | count | + -------------------- + ---------- + | loading | 7644 | | complete | 180650 | + -------------------- + ---------- + 2 rows There are many entries with the same worker id, giving credence to the idea that a worker is dying (or being killed during deploy) and leaving behind a bunch of stuck jobs: https://emorley.pastebin.mozilla.org/8832749
Reporter | ||
Comment 4•9 years ago
|
||
The ratio of stuck to successful jobs appears to have regressed significantly since comment 0. Particularly since when reading the comment 0 figures you have to remember that they are "since the beginning of time", which you would expect to make the ratio seem worse than it is, since completed jobs are purged after 4 months, whereas those in the loading state are not.
Reporter | ||
Comment 5•9 years ago
|
||
So Joel was seeing instances of this on a push today (https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f9d4c921efa) - and we haven't deployed today. Another theory for the spike might be these process-object exceptions I'm seeing on both prod and stage: exceptions:TypeError: 'NoneType' object has no attribute '__getitem__' https://rpm.newrelic.com/accounts/677903/applications/4180461/traced_errors/3785087676 The worker could be dying and so all of the tasks it had claimed in that batch get stuck loading?
Reporter | ||
Comment 6•9 years ago
|
||
Ok, so the situation here is now significantly improved: 1) The recent chronic exception that was causing the worker to die, has been fixed (bug 1162526). 2) An exception during the processing of one job no longer makes the worker lose all of the jobs it was handling (up to 100 jobs) - bug 1162682.
Summary: Jobs in the objectstore can get stuck in the 'loading' state and never get ingested → Jobs in the objectstore can get stuck in the 'loading' state if the worker dies
Reporter | ||
Comment 7•9 years ago
|
||
Wontfix given the objectstore is being removed in bug 1140349.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•