HTTP 503 Service Unavailable: https://treeherder
[12:57] < bc> known issue on staging with 503 Service Unavailable posting to https://treeherder.allizom.org/api/project/mozilla-central/jobs/ ? [13:54] <& wlach> bc: no, let me take a look [13:54] <& wlach> thanks for bringing it up [14:02] <& wlach> bc: could you file a bug? I need to run but I see the problem -- you are submitting a new reference data signature, so it's trying to update the exclusion profiles, but that operation is timing out [14:03] <& wlach> I think we might be able to get around that by switching to identifying exclusion profiles by id instead of hash
Looking at new relic, it seem like it's timing out in update_flat_exclusions() in treeherder/model/models.py (around line 575) I don't think this is a new problem, but we evidently need to make this operation faster somehow. Perhaps switching from matching on hash strings to integers might help. I'll investigate tomorrow.
Assignee: nobody → wlachance
I think bc is submitting to both stage and prod, so if this wasn't a new problem, it would be affecting both?
I am currently burning in some new Pixel devices and checking out their behavior on all of the tests I have defined. See: https://treeherder.allizom.org/#/jobs?repo=mozilla-central&filter-searchStr=autophone&tochange=28eb27fd7d141a8f9f2a6dbfae892ad78ecb40a5&fromchange=05328d3102efd4d5fc0696489734d7771d24459f I haven't run the full Mochitest or Reftest suites in some time.
Created attachment 8815875 [details] Another failure I did some more investigation and it looks like it failed in another spot here, though I doubt it has anything to do with the specific operation (rather than it just randomly timing out here because the operation takes too long). It looks like we're basically rewriting the flat exclusion profile information for *every* repository, even if a new type of job is only added to one. I think we ought to just do away with storing "flat exclusion profiles" in the database altogether. If you know the project + exclusion profile name you want, this information can be retrieved fairly quickly as part of another get operation.
Comment on attachment 8815876 [details] [review] [treeherder] wlach:1320805 > mozilla:master I'm pretty sure this should fix the problem. See PR for details. We probably want to land this in stages -- the db migration should wait until we're sure we don't want to back out.
Attachment #8815876 - Flags: review?(emorley)
Comment on attachment 8815876 [details] [review] [treeherder] wlach:1320805 > mozilla:master Sorry for the delay. This looks fine to me, though I'm less familiar with the nuances of the exclusion profile handling than Cameron. The PR will need rebasing to update the migration name/dependency, since there are new migrations on master since.
Attachment #8815876 - Flags: review?(emorley) → review+
Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/f9a04c816f828d1272c4c8f9bf39fb19cbdb1faa Bug 1320805 - Create per-project exclusion lists on-demand The way we had written things, a "flat exclusion" information (which is essentially just a cache) had to be rewritten for *every* repository on submission of a new job, even though the new job would apply to only one repository. Let's fix this by just calculating this information on demand (most of the time this should be very fast, as we already store the final data in memcache) https://github.com/mozilla/treeherder/commit/0070e189b003420f676280293ad6b5b092eba1d3 Bug 1320805 - Remove flat exclusion column from db
Seems to be working well on stage, will make it to production on next deploy.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.