Closed Bug 1495714 Opened 6 years ago Closed 6 years ago

502 error, corrupted release blob: "Caught non-fatal exception finding fileUrl for partial update"

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlorenzo, Assigned: asilva)

References

Details

Attachments

(2 files)

:jcristau manually modified the release "Firefox-62.0.3-build1" to include the WNP entry. Sadly, even though the blob was correct, it ended up in a corrupt state. Both Julien and I were unable to neither view the data (from the WebUI), nor modify it, nor schedule a modification, not to revert a change. Every time we wanted to make some changes, we got a 502 error out of nginx without any additional detail. This was the only release affected. "Firefox-62.0.3-build1-No-WNP" was fine. I looked at the logs. nginx read: > { > "cid": 2582515, > "level": "error", > "pid": 2066, > "time": "2018-10-02 08:31:09", > "tid": 0, > "message": "upstream prematurely closed connection while reading response header from upstream, client: 63.245.210.133, server: , request: \"GET /api/releases/Firefox-62.0.3-build1 HTTP/1.1\", upstream: \"http://127.0.0.1:8000/api/releases/Firefox-62.0.3-build1\", host: \"aus4-admin.mozilla.org\", referrer: \"https://aus4-admin.mozilla.org/releases\"" > } The balrog logs themselves read: > { > "EnvVersion": "2.0", > "Severity": 3, > "Timestamp": "2018-10-02 08:44:44", > "Hostname": "ip-172-31-27-40.us-west-2.compute.internal", > "Pid": 1649, > "Fields": { > "message": "Caught non-fatal exception finding fileUrl for partial update:", > "traceback": "Uncaught exception:\n File \"./auslib/blobs/apprelease.py\", line 106, in _getSpecificPatchXML\n url = self._getUrl(updateQuery, patchKey, patch, specialForceHosts)\n File \"./auslib/blobs/apprelease.py\", line 698, in _getUrl\n raise ValueError(\"Couldn't find fileUrl\")\n<type 'exceptions.ValueError'>\nValueError(\"Couldn't find fileUrl\",)\n", > "requestid": 139773391128336, > "error": "ValueError(\"Couldn't find fileUrl\",)" > }, > "Logger": "Balrog", > "Type": "auslib.blobs.apprelease.ReleaseBlobV4" > } I'm not sure how the database ended up in this state, nor why the No-WNP blob isn't affected. Julien and I found a workaround: 1. Download the blob from https://aus4-admin.mozilla.org/api/releases/Firefox-62.0.3-build1 2. Temporarily make the rules that point to this blob to the No-WNP one 3. Delete the broken release 4. Recreate it thanks to the blob in step 1. 5. Try to to view the data => See it works 6. Revert step 2. Filing this issue for posteriority.
Scheduling a change to that blob actually worked.
True. Thank you for the clarification. I'd like to add, though, it kept the blob in a broken state, still.
Nick, any ideas what happened here? Can we inspect the old blob?
Flags: needinfo?(nthomas)
Notable from postmortem, this probably won't happen if we do "automate WNP in-tree"
I've tried to look at the blob but the admin UI is not being helpful, and it's easy to make the admin server/API unresponsive (see bug 1497117, and bug 1497108 to add limits). Will look again when admin is more helpful, but may have to ask for data direct from the DB. Are those logs from Sentry ? They don't look familiar to me.
Logs are from an S3 bucket[1]. It's dedicated at storing logs for balrog prod. I'm not sure what generates them. [1] https://361527076523.signin.aws.amazon.com/console
oremj, is it possible to get a copy of the prod db ? RDS backups look like snapshots rather than db dumps so perhaps it's infeasible to do that. Alternatively, could you run these queries instead select * from releases_history where name="Firefox-62.0.3-build1" order by change_id desc limit 10; select * from releases_scheduled_changes where base_name="Firefox-62.0.3-build1" order by sc_id desc limit 10; The output will be long because the data column contains a lot of JSON data.
Flags: needinfo?(oremj)
The data in the releases_history query appears to be fine. From the "auslib.blobs.apprelease.ReleaseBlobV4" in comment #0, which looks like an update.xml request rather than something on the admin side, the app thought it as a V4 blob instead of V9. This is very strange because the schema_version value of '9' doesn't change at all. Perhaps there was some corruption blob cache, which was cleared by the deletion and re-add of the release. I need to regain my access to the logs bucket to tell more. Allan, can you think of anything else that might have happened ?
Flags: needinfo?(nthomas) → needinfo?(allan.tavares)
I also think it was a corrupted blob in cache. I try reproduce this through the following steps: 1 - Create a new release using the blob from comment #0; 2 - Change the WNP link directly in DB: > update releases set `data` = REPLACE(`data`, 'firefox/62.0.2', 'firefox/62.0.3') where name = 'Firefox-62.0.3-build1' 3 - Stop/Start container. No problem happens. Corrupting the blob: 1 - Create a new release using the blob from comment #0; 2 - Force missing "fileUrls" directly in DB: > update releases set `data` = REPLACE(`data`, 'fileUrls', 'corrupted') where name = 'Firefox-62.0.3-build1' 3 - Stop/Start container. When request from update endpoint, I got: > 2018-10-11 20:03:51,828 - DEBUG - PID: 8 - Request: 140616380570896 - auslib.blobs.apprelease.ReleaseBlobV9._getUrl#697: Couldn't find fileUrl No problems when requested from admin API.
Flags: needinfo?(allan.tavares)
We hit this again with 63.0.1 (build4) :nthomas is able to reproduce in dev and is fixing up production right now.
I think this happens when a scheduled change is made to a release, but not when the 'Update' button is used (which you can still do when only on the test channels). To reproduce * download the Firefox-63.0.1-build4-No-WNP and Firefox-63.0.1-build4 blobs from prod * modify Firefox-63.0.1-build4-No-WNP.json to have internal name Firefox-63.0.1-build4, but be stored in Firefox-63.0.1-build4-before-WNP.json * upload Firefox-63.0.1-build4-before-WNP.json to balrog * schedule a change using Firefox-63.0.1-build4.json
Commit pushed to master at https://github.com/mozilla/balrog https://github.com/mozilla/balrog/commit/86bd3be7e531d6aebad7b9b0e8095de6322f2b92 Bug 1495714 - Compare data_version with integer_types(int, long) (#837)
Allan tracked down the problem with scheduled changes and created a fix over in https://github.com/mozilla/balrog/pull/837, which I've landed in master. We'll do a release sometime soon to get it to prod. Thanks Allan!
Assignee: nobody → allan.tavares
Deployed today.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: Release Engineering → Release Engineering Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: