Closed Bug 1226565 Opened 9 years ago Closed 9 years ago

DB replication lag causes update_parse_status exceptions during log parsing

Categories

(Tree Management :: Treeherder: API, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1273231

People

(Reporter: emorley, Unassigned)

References

Details

Attachments

(2 files)

On occasions where we're suffering from database replication lag, one of the most notable/annoying symptoms are hundreds if not thousands of log parsing exceptions, similar to: [2015-11-20 04:04:10,249] ERROR [treeherder.log_parser.utils:129] Failed to upload parsed artifact for mozilla-inbound 9460cbbfaf51fd9ad80b9d886c43471c0624794b (http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-emulator-debug/1448018435/mozilla-inbound_ubuntu64_vm-b2g-emulator-debug_test-cppunit-bm53-tests1-linux64-build638.txt.gz): 404 Client Error: NOT FOUND [2015-11-20 04:04:09,623] ERROR [treeherder.webapp.api.exceptions:24] ObjectNotFoundException: For table 'job_log_url': {'id': 99999999} Whilst the UPDATE query [1] that is called via the update_parse_status API [2] is indeed run on the master, it's then followed by a SELECT [3] to retrieve the job_log_url info that's run against the read_only host [4]. In the case of replication lag, the job_log_url row being updated might not yet exist on the read_only host, meaning the SELECT will return a ObjectNotFoundException, resulting in a 404. This only occurs on stage for now, since prod has both the master host and read_only host VIPs pointing at the same machine due to lag issues. Possible solutions: 1) Make the get_job_log_url_detail SELECT use the master host. However this same query is used for the job_log_url retrieve, which doesn't need to use the master. 2) Use a separate get_job_log_url_detail query just for update_parse_status that uses the master instead. 3) Drop the select entirely, and just do the update - since the response (eg id, job_id, url, name, parse_status) is pretty much just repeating back to the client what they already know. However this no longer 404s if the id doesn't exist on the master either, since there's no way to get the UPDATE run via datasource to indicate if it changed any rows. 4) Same as #3, except add support to datasource for returning the number of affected rows, and use that to determine whether to return a 404 or not. I prefer #4, since: * adding support to datasource is actually going to be pretty easy * it doesn't make sense for update_parse_status to return the job_log_url record, and using a SELECT just to confirm the UPDATE succeeded is doubling the number of queries unnecessarily ...unless anyone has alternative suggestions? [1] https://github.com/mozilla/treeherder/blob/af9932c877f7a1ee64e720e7d161a7cfe8e508f3/treeherder/model/sql/jobs.json#L284-L291 [2] https://github.com/mozilla/treeherder/blob/cd79f9a96a8e0a0b51722f8df0d7d6f6cfd2c904/treeherder/webapp/api/job_log_url.py#L57 [3] https://github.com/mozilla/treeherder/blob/cd79f9a96a8e0a0b51722f8df0d7d6f6cfd2c904/treeherder/webapp/api/job_log_url.py#L58 [4] https://github.com/mozilla/treeherder/blob/af9932c877f7a1ee64e720e7d161a7cfe8e508f3/treeherder/model/sql/jobs.json#L562-L567
Attachment #8690048 - Flags: review?(mdoglio) → review+
Attachment #8690064 - Flags: review?(mdoglio) → review+
Comment on attachment 8690048 [details] [review] Datasource: Add support for a return_type of `rowcount` Thank you for the reviews :-) https://github.com/jeads/datasource/commit/124076b55a70550ae8d6a20dfd718cdf4619ae0d
Attachment #8690048 - Flags: checkin+
Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/d90775a22773e0f803dee3e6f64e50c2fba7ec48 Bug 1226565 - Test that update-parse-status 404s for a non-existent id We're about to change the implementation of update_parse_status; this test will ensure trying to update the status for a non-existent job_log_url record will still result in a 404 response. https://github.com/mozilla/treeherder/commit/83214f5edfeed572b031da0a41170f96cb79ec61 Bug 1226565 - Clean up the update_parse_status() try-except https://github.com/mozilla/treeherder/commit/9cda039e8be719addddeda674172e8eb6fe56e4d Bug 1226565 - Update datasource to v0.10.0 (adds 'rowcount' return_type) Adds support for a `return_type` of `rowcount`: https://github.com/jeads/datasource/pull/39 https://github.com/jeads/datasource/compare/v0.9.1...v0.10.0 https://github.com/mozilla/treeherder/commit/08d70961c811a6d7f7077b8e08de44abdd1aac8b Bug 1226565 - Stop update-parse-status from doing an additional SELECT Previously it was performing both an UPDATE and then a SELECT, where the SELECT was being performed on the read_only DB host. This meant in cases of replication lag, the update-parse-status would return a 404, since `get_job_log_url_detail` raises `ObjectNotFoundException` if zero rows rows are found on the read_only host. The SELECT was performing two tasks: 1) Allow the update-parse-status API to respond with the job_log_url record. However this is of little use to consumers of that endpoint. 2) Ensuring the API would return 404 if the job_log_url id in question actually did not exist at all (even on the master). We can now perform (2) using datasource's new `return_type` of `rowcount`, which means we only have to perform the UPDATE and not also the SELECT. This not only avoids the race when we're suffering from replication lag, but also halves the number of queries run for each log parsed.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
So it turns out that MySQL tries to be smart and doesn't call a row updated if the new values matched the old (ie rowcount is zero even if the row was valid). This means that if for whatever reason we try to mark a log as parsed twice in a row (which ideally we shouldn't be doing - sounds like bug 1183246), we now get a 404. To fix, we can add the SELECT back in, but only do it when rowcount is zero, which will avoid the double query in the common case). I'll revert this in the meantime.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/8668c872090c903a8132e2fb3df70ba7ceb8e148 Revert "Bug 1226565 - Stop update-parse-status from doing an additional SELECT" This reverts commit 08d70961c811a6d7f7077b8e08de44abdd1aac8b.
Assignee: emorley → nobody
Attachment #8690064 - Flags: checkin-
Comment on attachment 8690064 [details] [review] Treeherder: Prevent update-parse-status 404s when there's DB replication lag (Trying to get Bugzilla todos to stop displaying this bug in "to check in")
Attachment #8690064 - Flags: review+
The API to update the log parse status was removed in: https://github.com/mozilla/treeherder/pull/1510 And then everything rewritten to use the ORM in: https://github.com/mozilla/treeherder/pull/1496 So this is now fixed :-)
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: