Closed
Bug 1303904
Opened 8 years ago
Closed 7 years ago
Decision task failing trying to fetch json-automationrelevance on merge to mozilla-aurora
Categories
(Developer Services :: Mercurial: hg.mozilla.org, defect)
Developer Services
Mercurial: hg.mozilla.org
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: gps)
References
Details
Attachments
(6 files, 1 obsolete file)
59 bytes,
text/x-review-board-request
|
glob
:
review+
|
Details |
59 bytes,
text/x-review-board-request
|
glob
:
review+
|
Details |
59 bytes,
text/x-review-board-request
|
glob
:
review+
|
Details |
59 bytes,
text/x-review-board-request
|
glob
:
review+
|
Details |
59 bytes,
text/x-review-board-request
|
glob
:
review+
|
Details |
59 bytes,
text/x-review-board-request
|
glob
:
review+
|
Details |
<philor>: tools.taskcluster.net/task-inspector/#ayJi2q5jSK2RAb7bhOga6Q/0 is the decision task failing on the merge to aurora, timing out fetching automationrelevance from hg.m.o <gps>: philor, dustin: that /may/ be a regression from the recent refactor. previous merge days went fine. <gps>: a 5s timeout is really short <gps>: the API is likely CPU bound on the server <gps>: you can test that by running `time curl mozilla-aurora/json-automationrele…df759f09b1411f33fa0920835b919fc81` a few dozen times. i bet the variance in execution time is really low <gps>: looks like it is timing out after 60s on the server :/ Bottom line: if we do this regularly on merges, then we'll need to keep the tree closed after the merge (I can't believe we don't just do that as a matter of policy) and then have something innocuous around to push to the closed tree in order to get our first set of Taskcluster jobs on the merge to see how busted they are.
Assignee | ||
Comment 1•8 years ago
|
||
I don't think this is a regression: the server-side process is taking 60+s to generate output. Bottleneck is reading the pushlog. There's no easy workaround. We'll have to wait for the next push.
Component: General → Mercurial: hg.mozilla.org
Product: Taskcluster → Developer Services
QA Contact: hwine
Comment 3•8 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #1) > I don't think this is a regression: the server-side process is taking 60+s > to generate output. Bottleneck is reading the pushlog. > > There's no easy workaround. We'll have to wait for the next push. yeah this seems to work here
Updated•7 years ago
|
QA Contact: hwine → klibby
Comment 4•7 years ago
|
||
seems this hit us again on the beta to release merge day
Assignee | ||
Comment 5•7 years ago
|
||
Ugh. I feel bad about having a bug impacting merge day activities. So I decided to spend a few minutes to dig deeper. The automationrelevant() revset is not a problem: I can execute it on the command line and it completes very quickly. The bottleneck is calling webuitl.changelistentry() thousands of times. Specifically, the pushlog extension's code that runs during changelistentry() opens a new connection to the SQLite database then runs 2 SQL queries. I suspect reusing the DB connection will make the perf issue go away. But that change isn't exactly trivial :/
Comment 6•7 years ago
|
||
Given that the endpoint is specific to automation, it would not be unreasonable to return some "lots of things changed!" response when the revset exceeds a certain size, with corresponding handling of that response in-tree.
Comment 8•7 years ago
|
||
if we can just ignore the automationrelevant() and let decision task finish, that would be good
Comment 9•7 years ago
|
||
We hit this again today (merge day). +1, if the number of commits hits a certain size, we should return true. I'm hesitant to equate timeouts as relevant, since that means we might run all automation whenever hg.m.o gets really slammed, which would exacerbate things. I'm equally hesitant to equate timeouts as non-relevance, since this logic breaks on merge day. Setting a threshold for the number of commits on the server side makes the most sense to me.
Comment hidden (mozreview-request) |
Comment on attachment 8844279 [details] hgmo: Cap at 500 revisions in automationrelevancewebcommand (bug 1303904) I'm unsure if this will work like I think it will work (I can't get my test environment working for some reason), but thought it'd be worth a shot. If this works like I think it will work, I think it will just stop looking at csets after we reach 500 csets, and then carry on with the rest of the function with that trimmed list. Unless I read gps's comment wrong and I need to move this into the automationrelevant() revset itself?
Attachment #8844279 -
Flags: feedback?(aki)
Comment 12•7 years ago
|
||
Comment on attachment 8844279 [details] hgmo: Cap at 500 revisions in automationrelevancewebcommand (bug 1303904) I'm not sure what the "yes, this is relevant; ignore the details" response should be. It looks like we populate a json blob normally. I'm also not sure if 500 is the right threshold, but it's definitely better than an unbounded maximum. This patch probably requires more knowledgeable eyes than mine.
Attachment #8844279 -
Flags: feedback?(aki) → feedback+
Happy to wait for gps to return, we'll still have a month after that before the next thousand+ cset merge day.
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(gps)
Updated•7 years ago
|
Attachment #8844279 -
Flags: feedback?(gps)
Assignee | ||
Comment 14•7 years ago
|
||
Comment on attachment 8844279 [details] hgmo: Cap at 500 revisions in automationrelevancewebcommand (bug 1303904) I'll come up with a hack to fix the perf issues. The reason I was reluctant to do this before is I preferred to work in an upstream API so we wouldn't need a hack. But on further reflection, that is a lot of work. So the hacky route we will go.
Flags: needinfo?(gps)
Attachment #8844279 -
Flags: feedback?(gps) → feedback-
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → gps
Status: NEW → ASSIGNED
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 21•7 years ago
|
||
The abridged version of the commit messages is this series drops time to first byte for an example uplift push from ~120s to ~1.5s and total size of the JSON response from 2348 MB to 10.8 MB. I listened to Entry of the Gladiators as I was writing the commit messages.
Attachment #8844279 -
Attachment is obsolete: true
Comment 22•7 years ago
|
||
mozreview-review |
Comment on attachment 8875868 [details] hgmo: stream JSON responses (bug 1303904); https://reviewboard.mozilla.org/r/147270/#review152956 ::: hgext/hgmo/__init__.py:319 (Diff revision 1) > # Round trip to ensure we have valid JSON. > try: > d = json.loads(stdout) > - return json.dumps(d, indent=2, sort_keys=True) > + return stream_json(d) > except Exception: > return json.dumps({'error': 'invalid JSON returned; report this error'}, > indent=2) could we assume it's stable and stream its output directly to the requester without a json decode/encode round trip? as per our chat feel free to drop this issue and leave as-is; this isn't in the critical path for this performance issue.
Attachment #8875868 -
Flags: review?(glob) → review+
Comment 23•7 years ago
|
||
mozreview-review |
Comment on attachment 8875869 [details] pushlog: make repo.pushlog a cached property (bug 1303904); https://reviewboard.mozilla.org/r/147272/#review154254 ::: commit-message-0f52a:6 (Diff revision 1) > +pushlog: make repo.pushlog a cached property (bug 1303904); r?glob > + > +Currently, repo.pushlog creates a new pushlog instance on every > +access. This creates overhead and prevents some forms of caching. > + > +The "propertyfilecache" decorator cached the property value, but only i think you mean "repofilecache" here, as propertyfilecache creates a new object on every call.
Attachment #8875869 -
Flags: review?(glob) → review+
Comment 24•7 years ago
|
||
mozreview-review |
Comment on attachment 8875870 [details] pushlog: factor out code to obtain a push from a node (bug 1303904); https://reviewboard.mozilla.org/r/147274/#review154258
Attachment #8875870 -
Flags: review?(glob) → review+
Comment 25•7 years ago
|
||
mozreview-review |
Comment on attachment 8875871 [details] hgmo: refactor loop iterating over revs (bug 1303904); https://reviewboard.mozilla.org/r/147276/#review154260
Attachment #8875871 -
Flags: review?(glob) → review+
Comment 26•7 years ago
|
||
mozreview-review |
Comment on attachment 8875872 [details] pushlog: API to cache pushlog lookups (bug 1303904); https://reviewboard.mozilla.org/r/147278/#review154262 yeah, this is somewhat hacky, but it looks sane given the approach.
Attachment #8875872 -
Flags: review?(glob) → review+
Comment 27•7 years ago
|
||
mozreview-review |
Comment on attachment 8875873 [details] hgmo: don't send list of push nodes with automationrelevant response (bug 1303904); https://reviewboard.mozilla.org/r/147280/#review154264
Attachment #8875873 -
Flags: review?(glob) → review+
Assignee | ||
Comment 28•7 years ago
|
||
mozreview-review-reply |
Comment on attachment 8875868 [details] hgmo: stream JSON responses (bug 1303904); https://reviewboard.mozilla.org/r/147270/#review152956 > could we assume it's stable and stream its output directly to the requester without a json decode/encode round trip? > > as per our chat feel free to drop this issue and leave as-is; this isn't in the critical path for this performance issue. Yeah, we can improve this code. But not in this bug.
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 35•7 years ago
|
||
Pushed by gszorc@mozilla.com: https://hg.mozilla.org/hgcustom/version-control-tools/rev/2ba557b29106 hgmo: stream JSON responses ; r=glob https://hg.mozilla.org/hgcustom/version-control-tools/rev/fa9313510326 pushlog: make repo.pushlog a cached property ; r=glob https://hg.mozilla.org/hgcustom/version-control-tools/rev/01497dd90af6 pushlog: factor out code to obtain a push from a node ; r=glob https://hg.mozilla.org/hgcustom/version-control-tools/rev/cbe2a5a5ae20 hgmo: refactor loop iterating over revs ; r=glob https://hg.mozilla.org/hgcustom/version-control-tools/rev/af4420a6680e pushlog: API to cache pushlog lookups ; r=glob https://hg.mozilla.org/hgcustom/version-control-tools/rev/c23d48384dba hgmo: don't send list of push nodes with automationrelevant response ; r=glob
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 36•7 years ago
|
||
This has been deployed. HTTP requests like https://hg.mozilla.org/releases/mozilla-aurora/json-automationrelevance/5dddbefdf759f09b1411f33fa0920835b919fc81 that previously timed out now complete in ~6s. This should no longer impact uplift day activities.
You need to log in
before you can comment on or make changes to this bug.
Description
•