1163659 - Rescue/clean up jobs that are stuck in the loading state in the objectstore

Assignee

Description

•

9 years ago

Due to bug 1125476, if a worker is killed/dies/hits an exception, the rows in the objectstore that it claimed for processing, get stuck in the 'loading' state.

Until recently (bug 1162682) for exceptions this would include _all_ of the jobs in the batch of up to 100.

Lets rescue jobs in the last couple of months on try, and delete any others.

Ed Morley [:emorley]

Assignee

Comment 1

•

9 years ago

For Try, since the beginning of Treeherder:

Execute:
> SELECT COUNT(*) FROM try_objectstore_1.objectstore
WHERE `processed_state` = 'loading'
AND `loaded_timestamp` < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 4 hour))

+ ------------- +
| COUNT(*)      |
+ ------------- +
| 56379         |
+ ------------- +
1 rows

For the last 3 months:

Execute:
> SELECT COUNT(*) FROM objectstore
WHERE `processed_state` = 'loading'
AND `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 3 month))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 4 hour))

+ ------------- +
| COUNT(*)      |
+ ------------- +
| 51566         |
+ ------------- +
1 rows

...which gives credence to the idea that bug 1162526 was the main issue (which was introduced by bug 1113873 in ~March), and with that fixed bug 1125476 should cause us much fewer problems.

Ed Morley [:emorley]

Assignee

Comment 2

•

9 years ago

Execute:
> SELECT COUNT(*) FROM try_objectstore_1.objectstore
WHERE `processed_state` = 'loading'
AND `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 hour))

+ ------------- +
| COUNT(*)      |
+ ------------- +
| 4671          |
+ ------------- +
1 rows

> UPDATE try_objectstore_1.objectstore SET `worker_id` = NULL, `processed_state` = 'ready' WHERE `processed_state` = 'loading' AND `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day)) AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 hour))

4671 row(s) affected Rows matched: 4671  Changed: 4671  Warnings: 0

DELETE FROM try_objectstore_1.objectstore WHERE `processed_state` = 'loading' AND `loaded_timestamp` < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 8 day))

51613 row(s) affected

Ed Morley [:emorley]

Assignee

Comment 3

•

9 years ago

So after waiting a while for all of the 'ready' rows to turn to 'loading' and then hopefully 'completed', we have:

Execute:
> SELECT processed_state, error, COUNT(processed_state) FROM try_objectstore_1.objectstore
WHERE `loaded_timestamp` BETWEEN UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 2 day))
GROUP BY processed_state, error

+ -------------------- + ---------- + --------------------------- +
| processed_state      | error      | COUNT(processed_state)      |
+ -------------------- + ---------- + --------------------------- +
| loading              | N          | 2345                        |
| complete             | N          | 1865                        |
+ -------------------- + ---------- + --------------------------- +
2 rows

So a number of them did get imported ok. But 2345 are still getting stuck!? (the completed count hasn't changed for at least 20 mins, so I doubt they are still in progress).

I'm starting to get suspicious about the code that marks rows as errored :-/

Ed Morley [:emorley]

Assignee

Comment 4

•

9 years ago

Execute:
> SELECT job_guid, error FROM try_objectstore_1.objectstore
WHERE `loaded_timestamp` > UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 7 day))
AND UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 1 hour))
AND processed_state = "loading"
ORDER BY id
LIMIT 5

+ ------------- + ---------- +
| job_guid      | error      |
+ ------------- + ---------- +
| 18f348cc29cd07a02e425bbd41187c17c70a9360 | N          |
| f9a589f4849d171beb4e67edf52787c67662008b | N          |
| 783d2ac95b318746834bbbb0c6d4ba232dabbe64 | N          |
| 64f6faaa58260879164338dfafd88b084add64dd | N          |
| ba1dc1dd59fe4c602b933ecca137c4ae3ae57426 | N          |
+ ------------- + ---------- +
5 rows

Execute:
> SELECT id, job_coalesced_to_guid, state, result FROM try_jobs_1.job
WHERE job_guid IN 
("18f348cc29cd07a02e425bbd41187c17c70a9360", "f9a589f4849d171beb4e67edf52787c67662008b",
"783d2ac95b318746834bbbb0c6d4ba232dabbe64", "64f6faaa58260879164338dfafd88b084add64dd",
"ba1dc1dd59fe4c602b933ecca137c4ae3ae57426")

+ ------- + -------------------------- + ---------- + ----------- +
| id      | job_coalesced_to_guid      | state      | result      |
+ ------- + -------------------------- + ---------- + ----------- +
| 7212045 |                            | completed  | usercancel  |
| 7212025 |                            | completed  | usercancel  |
| 7212026 |                            | completed  | usercancel  |
| 7212029 |                            | completed  | usercancel  |
| 7212027 |                            | completed  | usercancel  |
| NULL    | NULL                       | NULL       | NULL        |
+ ------- + -------------------------- + ---------- + ----------- +
6 rows

Ah ha!

Two bugs:

1) In load_job_data() we use _remove_existing_jobs() to filter out jobs that already exist in the job table in the same state (ie: filter out a completed job if we already have that same guid as completed in the job table):
https://github.com/mozilla/treeherder/blob/a8cd88aa4f590b5dad741121e1215a41db41e329/treeherder/model/derived/jobs.py#L1162
...however by filtering out these jobs, we don't mark that objectstore row as done with processing, at the end of load_job_data():
https://github.com/mozilla/treeherder/blob/a8cd88aa4f590b5dad741121e1215a41db41e329/treeherder/model/derived/jobs.py#L1359

2) When people press cancel on a job, we mark the job as usercancel, even before we get confirmation from builds-4hr. We have to do this because cancelling a pending job means the job just disappears from a builds-4hr point of view. However it seems we're also doing it for running jobs too - so end up with a job marked as "completed" (and result usercancel) in the jobs table - and so when we finally see that job in builds-4hr, _remove_existing_jobs() says it's a dupe so filters it out.

Ed Morley [:emorley]

Assignee

Updated

•

9 years ago

Depends on: 1163802

Ed Morley [:emorley]

Assignee

Updated

•

9 years ago

Depends on: 1163804

Ed Morley [:emorley]

Assignee

Comment 5

•

9 years ago

The objectstore: the gift that keeps on giving. 
Filed bug 1163804, bug 1163802 for the issues found in comment 4.
They've been there pretty much since the beginning - but I guess we just don't cancel that many running jobs (compared to the total number).

Ed Morley [:emorley]

Assignee

Updated

•

9 years ago

Depends on: 1163996

Ed Morley [:emorley]

Assignee

Updated

•

9 years ago

Depends on: 1164824

Ed Morley [:emorley]

Assignee

Comment 6

•

9 years ago

I've deleted all entries older than 2 days from the main repos on stage+prod. I'll do another proper sweep once bug 1164824 and either of bug  1163802 or bug 1163804 are fixed.

Ed Morley [:emorley]

Assignee

Comment 7

•

9 years ago

This is lower priority now that (a) bug 1162682 & bug 1162526are fixed so fewer jobs are getting stuck, and (b) I purged/recovered many of the existing stuck jobs in comment 6.

Priority: P2 → P3

Ed Morley [:emorley]

Assignee

Comment 8

•

9 years ago

The worst of this was fixed a while ago, and now with bug 1140349, we can stop worrying about the rest :-)

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

Rescue/clean up jobs that are stuck in the loading state in the objectstore

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P3)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: emorley)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Updated

Comment 5

Updated

Updated

Comment 6

Comment 7

Comment 8