Closed Bug 1163659 Opened 9 years ago Closed 9 years ago

Rescue/clean up jobs that are stuck in the loading state in the objectstore

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Due to bug 1125476, if a worker is killed/dies/hits an exception, the rows in the objectstore that it claimed for processing, get stuck in the 'loading' state.

Until recently (bug 1162682) for exceptions this would include _all_ of the jobs in the batch of up to 100.

Lets rescue jobs in the last couple of months on try, and delete any others.
For Try, since the beginning of Treeherder:

Execute:
> SELECT COUNT(*) FROM try_objectstore_1.objectstore
WHERE `processed_state` = 'loading'
AND `loaded_timestamp` < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 4 hour))

+ ------------- +
| COUNT(*)      |
+ ------------- +
| 56379         |
+ ------------- +
1 rows

For the last 3 months:

Execute:
> SELECT COUNT(*) FROM objectstore
WHERE `processed_state` = 'loading'
AND `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 3 month))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 4 hour))

+ ------------- +
| COUNT(*)      |
+ ------------- +
| 51566         |
+ ------------- +
1 rows

...which gives credence to the idea that bug 1162526 was the main issue (which was introduced by bug 1113873 in ~March), and with that fixed bug 1125476 should cause us much fewer problems.
Execute:
> SELECT COUNT(*) FROM try_objectstore_1.objectstore
WHERE `processed_state` = 'loading'
AND `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 hour))

+ ------------- +
| COUNT(*)      |
+ ------------- +
| 4671          |
+ ------------- +
1 rows

> UPDATE try_objectstore_1.objectstore SET `worker_id` = NULL, `processed_state` = 'ready' WHERE `processed_state` = 'loading' AND `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day)) AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 hour))

4671 row(s) affected Rows matched: 4671  Changed: 4671  Warnings: 0

DELETE FROM try_objectstore_1.objectstore WHERE `processed_state` = 'loading' AND `loaded_timestamp` < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 8 day))

51613 row(s) affected
So after waiting a while for all of the 'ready' rows to turn to 'loading' and then hopefully 'completed', we have:

Execute:
> SELECT processed_state, error, COUNT(processed_state) FROM try_objectstore_1.objectstore
WHERE `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 2 day))
GROUP BY processed_state, error

+ -------------------- + ---------- + --------------------------- +
| processed_state      | error      | COUNT(processed_state)      |
+ -------------------- + ---------- + --------------------------- +
| loading              | N          | 2345                        |
| complete             | N          | 1865                        |
+ -------------------- + ---------- + --------------------------- +
2 rows

So a number of them did get imported ok. But 2345 are still getting stuck!? (the completed count hasn't changed for at least 20 mins, so I doubt they are still in progress).

I'm starting to get suspicious about the code that marks rows as errored :-/
Execute:
> SELECT job_guid, error FROM try_objectstore_1.objectstore
WHERE `loaded_timestamp` > UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 hour))
AND processed_state = "loading"
ORDER BY id
LIMIT 5

+ ------------- + ---------- +
| job_guid      | error      |
+ ------------- + ---------- +
| 18f348cc29cd07a02e425bbd41187c17c70a9360 | N          |
| f9a589f4849d171beb4e67edf52787c67662008b | N          |
| 783d2ac95b318746834bbbb0c6d4ba232dabbe64 | N          |
| 64f6faaa58260879164338dfafd88b084add64dd | N          |
| ba1dc1dd59fe4c602b933ecca137c4ae3ae57426 | N          |
+ ------------- + ---------- +
5 rows

Execute:
> SELECT id, job_coalesced_to_guid, state, result FROM try_jobs_1.job
WHERE job_guid IN 
("18f348cc29cd07a02e425bbd41187c17c70a9360", "f9a589f4849d171beb4e67edf52787c67662008b",
"783d2ac95b318746834bbbb0c6d4ba232dabbe64", "64f6faaa58260879164338dfafd88b084add64dd",
"ba1dc1dd59fe4c602b933ecca137c4ae3ae57426")

+ ------- + -------------------------- + ---------- + ----------- +
| id      | job_coalesced_to_guid      | state      | result      |
+ ------- + -------------------------- + ---------- + ----------- +
| 7212045 |                            | completed  | usercancel  |
| 7212025 |                            | completed  | usercancel  |
| 7212026 |                            | completed  | usercancel  |
| 7212029 |                            | completed  | usercancel  |
| 7212027 |                            | completed  | usercancel  |
| NULL    | NULL                       | NULL       | NULL        |
+ ------- + -------------------------- + ---------- + ----------- +
6 rows

Ah ha!

Two bugs:

1) In load_job_data() we use _remove_existing_jobs() to filter out jobs that already exist in the job table in the same state (ie: filter out a completed job if we already have that same guid as completed in the job table):
https://github.com/mozilla/treeherder/blob/a8cd88aa4f590b5dad741121e1215a41db41e329/treeherder/model/derived/jobs.py#L1162
...however by filtering out these jobs, we don't mark that objectstore row as done with processing, at the end of load_job_data():
https://github.com/mozilla/treeherder/blob/a8cd88aa4f590b5dad741121e1215a41db41e329/treeherder/model/derived/jobs.py#L1359

2) When people press cancel on a job, we mark the job as usercancel, even before we get confirmation from builds-4hr. We have to do this because cancelling a pending job means the job just disappears from a builds-4hr point of view. However it seems we're also doing it for running jobs too - so end up with a job marked as "completed" (and result usercancel) in the jobs table - and so when we finally see that job in builds-4hr, _remove_existing_jobs() says it's a dupe so filters it out.
Depends on: 1163802
Depends on: 1163804
The objectstore: the gift that keeps on giving. 
Filed bug 1163804, bug 1163802 for the issues found in comment 4.
They've been there pretty much since the beginning - but I guess we just don't cancel that many running jobs (compared to the total number).
Depends on: 1163996
Depends on: 1164824
I've deleted all entries older than 2 days from the main repos on stage+prod. I'll do another proper sweep once bug 1164824 and either of bug  1163802 or bug 1163804 are fixed.
This is lower priority now that (a) bug 1162682 & bug 1162526are fixed so fewer jobs are getting stuck, and (b) I purged/recovered many of the existing stuck jobs in comment 6.
Priority: P2 → P3
The worst of this was fixed a while ago, and now with bug 1140349, we can stop worrying about the rest :-)
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.