Rescue/clean up jobs that are stuck in the loading state in the objectstore

RESOLVED FIXED

Status

Tree Management
Treeherder: Infrastructure
P3
normal
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: emorley, Assigned: emorley)

Tracking

Details

(Assignee)

Description

3 years ago
Due to bug 1125476, if a worker is killed/dies/hits an exception, the rows in the objectstore that it claimed for processing, get stuck in the 'loading' state.

Until recently (bug 1162682) for exceptions this would include _all_ of the jobs in the batch of up to 100.

Lets rescue jobs in the last couple of months on try, and delete any others.
(Assignee)

Comment 1

3 years ago
For Try, since the beginning of Treeherder:

Execute:
> SELECT COUNT(*) FROM try_objectstore_1.objectstore
WHERE `processed_state` = 'loading'
AND `loaded_timestamp` < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 4 hour))

+ ------------- +
| COUNT(*)      |
+ ------------- +
| 56379         |
+ ------------- +
1 rows

For the last 3 months:

Execute:
> SELECT COUNT(*) FROM objectstore
WHERE `processed_state` = 'loading'
AND `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 3 month))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 4 hour))

+ ------------- +
| COUNT(*)      |
+ ------------- +
| 51566         |
+ ------------- +
1 rows

...which gives credence to the idea that bug 1162526 was the main issue (which was introduced by bug 1113873 in ~March), and with that fixed bug 1125476 should cause us much fewer problems.
(Assignee)

Comment 2

3 years ago
Execute:
> SELECT COUNT(*) FROM try_objectstore_1.objectstore
WHERE `processed_state` = 'loading'
AND `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 hour))

+ ------------- +
| COUNT(*)      |
+ ------------- +
| 4671          |
+ ------------- +
1 rows

> UPDATE try_objectstore_1.objectstore SET `worker_id` = NULL, `processed_state` = 'ready' WHERE `processed_state` = 'loading' AND `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day)) AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 hour))

4671 row(s) affected Rows matched: 4671  Changed: 4671  Warnings: 0

DELETE FROM try_objectstore_1.objectstore WHERE `processed_state` = 'loading' AND `loaded_timestamp` < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 8 day))

51613 row(s) affected
(Assignee)

Comment 3

3 years ago
So after waiting a while for all of the 'ready' rows to turn to 'loading' and then hopefully 'completed', we have:

Execute:
> SELECT processed_state, error, COUNT(processed_state) FROM try_objectstore_1.objectstore
WHERE `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 2 day))
GROUP BY processed_state, error

+ -------------------- + ---------- + --------------------------- +
| processed_state      | error      | COUNT(processed_state)      |
+ -------------------- + ---------- + --------------------------- +
| loading              | N          | 2345                        |
| complete             | N          | 1865                        |
+ -------------------- + ---------- + --------------------------- +
2 rows

So a number of them did get imported ok. But 2345 are still getting stuck!? (the completed count hasn't changed for at least 20 mins, so I doubt they are still in progress).

I'm starting to get suspicious about the code that marks rows as errored :-/
(Assignee)

Comment 4

3 years ago
Execute:
> SELECT job_guid, error FROM try_objectstore_1.objectstore
WHERE `loaded_timestamp` > UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 hour))
AND processed_state = "loading"
ORDER BY id
LIMIT 5

+ ------------- + ---------- +
| job_guid      | error      |
+ ------------- + ---------- +
| 18f348cc29cd07a02e425bbd41187c17c70a9360 | N          |
| f9a589f4849d171beb4e67edf52787c67662008b | N          |
| 783d2ac95b318746834bbbb0c6d4ba232dabbe64 | N          |
| 64f6faaa58260879164338dfafd88b084add64dd | N          |
| ba1dc1dd59fe4c602b933ecca137c4ae3ae57426 | N          |
+ ------------- + ---------- +
5 rows

Execute:
> SELECT id, job_coalesced_to_guid, state, result FROM try_jobs_1.job
WHERE job_guid IN 
("18f348cc29cd07a02e425bbd41187c17c70a9360", "f9a589f4849d171beb4e67edf52787c67662008b",
"783d2ac95b318746834bbbb0c6d4ba232dabbe64", "64f6faaa58260879164338dfafd88b084add64dd",
"ba1dc1dd59fe4c602b933ecca137c4ae3ae57426")

+ ------- + -------------------------- + ---------- + ----------- +
| id      | job_coalesced_to_guid      | state      | result      |
+ ------- + -------------------------- + ---------- + ----------- +
| 7212045 |                            | completed  | usercancel  |
| 7212025 |                            | completed  | usercancel  |
| 7212026 |                            | completed  | usercancel  |
| 7212029 |                            | completed  | usercancel  |
| 7212027 |                            | completed  | usercancel  |
| NULL    | NULL                       | NULL       | NULL        |
+ ------- + -------------------------- + ---------- + ----------- +
6 rows

Ah ha!

Two bugs:

1) In load_job_data() we use _remove_existing_jobs() to filter out jobs that already exist in the job table in the same state (ie: filter out a completed job if we already have that same guid as completed in the job table):
https://github.com/mozilla/treeherder/blob/a8cd88aa4f590b5dad741121e1215a41db41e329/treeherder/model/derived/jobs.py#L1162
...however by filtering out these jobs, we don't mark that objectstore row as done with processing, at the end of load_job_data():
https://github.com/mozilla/treeherder/blob/a8cd88aa4f590b5dad741121e1215a41db41e329/treeherder/model/derived/jobs.py#L1359

2) When people press cancel on a job, we mark the job as usercancel, even before we get confirmation from builds-4hr. We have to do this because cancelling a pending job means the job just disappears from a builds-4hr point of view. However it seems we're also doing it for running jobs too - so end up with a job marked as "completed" (and result usercancel) in the jobs table - and so when we finally see that job in builds-4hr, _remove_existing_jobs() says it's a dupe so filters it out.
(Assignee)

Updated

3 years ago
Depends on: 1163802
(Assignee)

Updated

3 years ago
Depends on: 1163804
(Assignee)

Comment 5

3 years ago
The objectstore: the gift that keeps on giving. 
Filed bug 1163804, bug 1163802 for the issues found in comment 4.
They've been there pretty much since the beginning - but I guess we just don't cancel that many running jobs (compared to the total number).
(Assignee)

Updated

3 years ago
Depends on: 1163996
(Assignee)

Updated

3 years ago
Depends on: 1164824
(Assignee)

Comment 6

3 years ago
I've deleted all entries older than 2 days from the main repos on stage+prod. I'll do another proper sweep once bug 1164824 and either of bug  1163802 or bug 1163804 are fixed.
(Assignee)

Comment 7

3 years ago
This is lower priority now that (a) bug 1162682 & bug 1162526are fixed so fewer jobs are getting stuck, and (b) I purged/recovered many of the existing stuck jobs in comment 6.
Priority: P2 → P3
(Assignee)

Comment 8

3 years ago
The worst of this was fixed a while ago, and now with bug 1140349, we can stop worrying about the rest :-)
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.