Closed
Bug 1163659
Opened 9 years ago
Closed 9 years ago
Rescue/clean up jobs that are stuck in the loading state in the objectstore
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P3)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: emorley)
References
Details
Due to bug 1125476, if a worker is killed/dies/hits an exception, the rows in the objectstore that it claimed for processing, get stuck in the 'loading' state. Until recently (bug 1162682) for exceptions this would include _all_ of the jobs in the batch of up to 100. Lets rescue jobs in the last couple of months on try, and delete any others.
Assignee | ||
Comment 1•9 years ago
|
||
For Try, since the beginning of Treeherder: Execute: > SELECT COUNT(*) FROM try_objectstore_1.objectstore WHERE `processed_state` = 'loading' AND `loaded_timestamp` < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 4 hour)) + ------------- + | COUNT(*) | + ------------- + | 56379 | + ------------- + 1 rows For the last 3 months: Execute: > SELECT COUNT(*) FROM objectstore WHERE `processed_state` = 'loading' AND `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 3 month)) AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 4 hour)) + ------------- + | COUNT(*) | + ------------- + | 51566 | + ------------- + 1 rows ...which gives credence to the idea that bug 1162526 was the main issue (which was introduced by bug 1113873 in ~March), and with that fixed bug 1125476 should cause us much fewer problems.
Assignee | ||
Comment 2•9 years ago
|
||
Execute: > SELECT COUNT(*) FROM try_objectstore_1.objectstore WHERE `processed_state` = 'loading' AND `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day)) AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 hour)) + ------------- + | COUNT(*) | + ------------- + | 4671 | + ------------- + 1 rows > UPDATE try_objectstore_1.objectstore SET `worker_id` = NULL, `processed_state` = 'ready' WHERE `processed_state` = 'loading' AND `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day)) AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 hour)) 4671 row(s) affected Rows matched: 4671 Changed: 4671 Warnings: 0 DELETE FROM try_objectstore_1.objectstore WHERE `processed_state` = 'loading' AND `loaded_timestamp` < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 8 day)) 51613 row(s) affected
Assignee | ||
Comment 3•9 years ago
|
||
So after waiting a while for all of the 'ready' rows to turn to 'loading' and then hopefully 'completed', we have:
Execute:
> SELECT processed_state, error, COUNT(processed_state) FROM try_objectstore_1.objectstore
WHERE `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 2 day))
GROUP BY processed_state, error
+ -------------------- + ---------- + --------------------------- +
| processed_state | error | COUNT(processed_state) |
+ -------------------- + ---------- + --------------------------- +
| loading | N | 2345 |
| complete | N | 1865 |
+ -------------------- + ---------- + --------------------------- +
2 rows
So a number of them did get imported ok. But 2345 are still getting stuck!? (the completed count hasn't changed for at least 20 mins, so I doubt they are still in progress).
I'm starting to get suspicious about the code that marks rows as errored :-/
Assignee | ||
Comment 4•9 years ago
|
||
Execute: > SELECT job_guid, error FROM try_objectstore_1.objectstore WHERE `loaded_timestamp` > UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day)) AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 hour)) AND processed_state = "loading" ORDER BY id LIMIT 5 + ------------- + ---------- + | job_guid | error | + ------------- + ---------- + | 18f348cc29cd07a02e425bbd41187c17c70a9360 | N | | f9a589f4849d171beb4e67edf52787c67662008b | N | | 783d2ac95b318746834bbbb0c6d4ba232dabbe64 | N | | 64f6faaa58260879164338dfafd88b084add64dd | N | | ba1dc1dd59fe4c602b933ecca137c4ae3ae57426 | N | + ------------- + ---------- + 5 rows Execute: > SELECT id, job_coalesced_to_guid, state, result FROM try_jobs_1.job WHERE job_guid IN ("18f348cc29cd07a02e425bbd41187c17c70a9360", "f9a589f4849d171beb4e67edf52787c67662008b", "783d2ac95b318746834bbbb0c6d4ba232dabbe64", "64f6faaa58260879164338dfafd88b084add64dd", "ba1dc1dd59fe4c602b933ecca137c4ae3ae57426") + ------- + -------------------------- + ---------- + ----------- + | id | job_coalesced_to_guid | state | result | + ------- + -------------------------- + ---------- + ----------- + | 7212045 | | completed | usercancel | | 7212025 | | completed | usercancel | | 7212026 | | completed | usercancel | | 7212029 | | completed | usercancel | | 7212027 | | completed | usercancel | | NULL | NULL | NULL | NULL | + ------- + -------------------------- + ---------- + ----------- + 6 rows Ah ha! Two bugs: 1) In load_job_data() we use _remove_existing_jobs() to filter out jobs that already exist in the job table in the same state (ie: filter out a completed job if we already have that same guid as completed in the job table): https://github.com/mozilla/treeherder/blob/a8cd88aa4f590b5dad741121e1215a41db41e329/treeherder/model/derived/jobs.py#L1162 ...however by filtering out these jobs, we don't mark that objectstore row as done with processing, at the end of load_job_data(): https://github.com/mozilla/treeherder/blob/a8cd88aa4f590b5dad741121e1215a41db41e329/treeherder/model/derived/jobs.py#L1359 2) When people press cancel on a job, we mark the job as usercancel, even before we get confirmation from builds-4hr. We have to do this because cancelling a pending job means the job just disappears from a builds-4hr point of view. However it seems we're also doing it for running jobs too - so end up with a job marked as "completed" (and result usercancel) in the jobs table - and so when we finally see that job in builds-4hr, _remove_existing_jobs() says it's a dupe so filters it out.
Assignee | ||
Comment 5•9 years ago
|
||
The objectstore: the gift that keeps on giving. Filed bug 1163804, bug 1163802 for the issues found in comment 4. They've been there pretty much since the beginning - but I guess we just don't cancel that many running jobs (compared to the total number).
Assignee | ||
Comment 6•9 years ago
|
||
I've deleted all entries older than 2 days from the main repos on stage+prod. I'll do another proper sweep once bug 1164824 and either of bug 1163802 or bug 1163804 are fixed.
Assignee | ||
Comment 7•9 years ago
|
||
This is lower priority now that (a) bug 1162682 & bug 1162526are fixed so fewer jobs are getting stuck, and (b) I purged/recovered many of the existing stuck jobs in comment 6.
Priority: P2 → P3
Assignee | ||
Comment 8•9 years ago
|
||
The worst of this was fixed a while ago, and now with bug 1140349, we can stop worrying about the rest :-)
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•