Closed Bug 1303904 Opened 9 years ago Closed 8 years ago

Decision task failing trying to fetch json-automationrelevance on merge to mozilla-aurora

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: philor, Assigned: gps)

References

Details

Attachments

(6 files, 1 obsolete file)

hgmo: Cap at 500 revisions in automationrelevancewebcommand (bug 1303904) 8 years ago Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed) 59 bytes, text/x-review-board-request	mozilla : feedback+ gps : feedback-	Details
hgmo: stream JSON responses (bug 1303904); 8 years ago Gregory Szorc [:gps] 59 bytes, text/x-review-board-request	glob : review+	Details
pushlog: make repo.pushlog a cached property (bug 1303904); 8 years ago Gregory Szorc [:gps] 59 bytes, text/x-review-board-request	glob : review+	Details
pushlog: factor out code to obtain a push from a node (bug 1303904); 8 years ago Gregory Szorc [:gps] 59 bytes, text/x-review-board-request	glob : review+	Details
hgmo: refactor loop iterating over revs (bug 1303904); 8 years ago Gregory Szorc [:gps] 59 bytes, text/x-review-board-request	glob : review+	Details
pushlog: API to cache pushlog lookups (bug 1303904); 8 years ago Gregory Szorc [:gps] 59 bytes, text/x-review-board-request	glob : review+	Details
hgmo: don't send list of push nodes with automationrelevant response (bug 1303904); 8 years ago Gregory Szorc [:gps] 59 bytes, text/x-review-board-request	glob : review+	Details

Phil Ringnalda (:philor)

Reporter

Description

•

9 years ago

<philor>: tools.taskcluster.net/task-inspector/#ayJi2q5jSK2RAb7bhOga6Q/0 is the decision task failing on the merge to aurora, timing out fetching automationrelevance from hg.m.o <gps>: philor, dustin: that /may/ be a regression from the recent refactor. previous merge days went fine. <gps>: a 5s timeout is really short <gps>: the API is likely CPU bound on the server <gps>: you can test that by running `time curl mozilla-aurora/json-automationrele…df759f09b1411f33fa0920835b919fc81` a few dozen times. i bet the variance in execution time is really low <gps>: looks like it is timing out after 60s on the server :/ Bottom line: if we do this regularly on merges, then we'll need to keep the tree closed after the merge (I can't believe we don't just do that as a matter of policy) and then have something innocuous around to push to the closed tree in order to get our first set of Taskcluster jobs on the merge to see how busted they are.

Gregory Szorc [:gps]

Assignee

Comment 1

•

9 years ago

I don't think this is a regression: the server-side process is taking 60+s to generate output. Bottleneck is reading the pushlog. There's no easy workaround. We'll have to wait for the next push.

Component: General → Mercurial: hg.mozilla.org

Product: Taskcluster → Developer Services

QA Contact: hwine

Carsten Book [:Tomcat]

Updated

•

9 years ago

Blocks: 1317333

Carsten Book [:Tomcat]

Updated

•

9 years ago

No longer blocks: 1317333

Carsten Book [:Tomcat]

Comment 3

•

9 years ago

(In reply to Gregory Szorc [:gps] from comment #1) > I don't think this is a regression: the server-side process is taking 60+s > to generate output. Bottleneck is reading the pushlog. > > There's no easy workaround. We'll have to wait for the next push. yeah this seems to work here

hwine

Updated

•

8 years ago

QA Contact: hwine → klibby

Carsten Book [:Tomcat]

Comment 4

•

8 years ago

seems this hit us again on the beta to release merge day

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

8 years ago

Comment 5

•

8 years ago

Ugh. I feel bad about having a bug impacting merge day activities. So I decided to spend a few minutes to dig deeper. The automationrelevant() revset is not a problem: I can execute it on the command line and it completes very quickly. The bottleneck is calling webuitl.changelistentry() thousands of times. Specifically, the pushlog extension's code that runs during changelistentry() opens a new connection to the SQLite database then runs 2 SQL queries. I suspect reusing the DB connection will make the perf issue go away. But that change isn't exactly trivial :/

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

8 years ago

Given that the endpoint is specific to automation, it would not be unreasonable to return some "lots of things changed!" response when the revset exceeds a certain size, with corresponding handling of that response in-tree.

Joel Maher ( :jmaher ) (UTC -8)

Comment 8

•

8 years ago

if we can just ignore the automationrelevant() and let decision task finish, that would be good

Aki Sasaki (not active)

Comment 9

•

8 years ago

We hit this again today (merge day). +1, if the number of commits hits a certain size, we should return true. I'm hesitant to equate timeouts as relevant, since that means we might run all automation whenever hg.m.o gets really slammed, which would exacerbate things. I'm equally hesitant to equate timeouts as non-relevance, since this logic breaks on merge day. Setting a threshold for the number of commits on the server side makes the most sense to me.

Comment hidden (mozreview-request)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 11

•

8 years ago

Comment on attachment 8844279 [details] hgmo: Cap at 500 revisions in automationrelevancewebcommand (bug 1303904) I'm unsure if this will work like I think it will work (I can't get my test environment working for some reason), but thought it'd be worth a shot. If this works like I think it will work, I think it will just stop looking at csets after we reach 500 csets, and then carry on with the rest of the function with that trimmed list. Unless I read gps's comment wrong and I need to move this into the automationrelevant() revset itself?

Attachment #8844279 - Flags: feedback?(aki)

Aki Sasaki (not active)

Comment 12

•

8 years ago

Comment on attachment 8844279 [details] hgmo: Cap at 500 revisions in automationrelevancewebcommand (bug 1303904) I'm not sure what the "yes, this is relevant; ignore the details" response should be. It looks like we populate a json blob normally. I'm also not sure if 500 is the right threshold, but it's definitely better than an unbounded maximum. This patch probably requires more knowledgeable eyes than mine.

Attachment #8844279 - Flags: feedback?(aki) → feedback+

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 13

•

8 years ago

Happy to wait for gps to return, we'll still have a month after that before the next thousand+ cset merge day.

Gregory Szorc [:gps]

Assignee

Updated

•

8 years ago

Flags: needinfo?(gps)

Aki Sasaki (not active)

Updated

•

8 years ago

Attachment #8844279 - Flags: feedback?(gps)

Gregory Szorc [:gps]

Assignee

Comment 14

•

8 years ago

Comment on attachment 8844279 [details] hgmo: Cap at 500 revisions in automationrelevancewebcommand (bug 1303904) I'll come up with a hack to fix the perf issues. The reason I was reluctant to do this before is I preferred to work in an upstream API so we wouldn't need a hack. But on further reflection, that is a lot of work. So the hacky route we will go.

Flags: needinfo?(gps)

Attachment #8844279 - Flags: feedback?(gps) → feedback-

Gregory Szorc [:gps]

Assignee

Updated

•

8 years ago

Assignee: nobody → gps

Status: NEW → ASSIGNED

Comment hidden (mozreview-request)

Gregory Szorc [:gps]

Assignee

Comment 21

•

8 years ago

The abridged version of the commit messages is this series drops time to first byte for an example uplift push from ~120s to ~1.5s and total size of the JSON response from 2348 MB to 10.8 MB. I listened to Entry of the Gladiators as I was writing the commit messages.

:glob ✱

Updated

•

8 years ago

Attachment #8844279 - Attachment is obsolete: true

:glob ✱

Comment 22

•

8 years ago

mozreview-review

Comment on attachment 8875868 [details] hgmo: stream JSON responses (bug 1303904); https://reviewboard.mozilla.org/r/147270/#review152956 ::: hgext/hgmo/__init__.py:319 (Diff revision 1) > # Round trip to ensure we have valid JSON. > try: > d = json.loads(stdout) > - return json.dumps(d, indent=2, sort_keys=True) > + return stream_json(d) > except Exception: > return json.dumps({'error': 'invalid JSON returned; report this error'}, > indent=2) could we assume it's stable and stream its output directly to the requester without a json decode/encode round trip? as per our chat feel free to drop this issue and leave as-is; this isn't in the critical path for this performance issue.

Attachment #8875868 - Flags: review?(glob) → review+

:glob ✱

Comment 23

•

8 years ago

mozreview-review

Comment on attachment 8875869 [details] pushlog: make repo.pushlog a cached property (bug 1303904); https://reviewboard.mozilla.org/r/147272/#review154254 ::: commit-message-0f52a:6 (Diff revision 1) > +pushlog: make repo.pushlog a cached property (bug 1303904); r?glob > + > +Currently, repo.pushlog creates a new pushlog instance on every > +access. This creates overhead and prevents some forms of caching. > + > +The "propertyfilecache" decorator cached the property value, but only i think you mean "repofilecache" here, as propertyfilecache creates a new object on every call.

Attachment #8875869 - Flags: review?(glob) → review+

:glob ✱

Comment 24

•

8 years ago

mozreview-review

Comment on attachment 8875870 [details] pushlog: factor out code to obtain a push from a node (bug 1303904); https://reviewboard.mozilla.org/r/147274/#review154258

Attachment #8875870 - Flags: review?(glob) → review+

:glob ✱

Comment 25

•

8 years ago

mozreview-review

Comment on attachment 8875871 [details] hgmo: refactor loop iterating over revs (bug 1303904); https://reviewboard.mozilla.org/r/147276/#review154260

Attachment #8875871 - Flags: review?(glob) → review+

:glob ✱

Comment 26

•

8 years ago

mozreview-review

Comment on attachment 8875872 [details] pushlog: API to cache pushlog lookups (bug 1303904); https://reviewboard.mozilla.org/r/147278/#review154262 yeah, this is somewhat hacky, but it looks sane given the approach.

Attachment #8875872 - Flags: review?(glob) → review+

:glob ✱

Comment 27

•

8 years ago

mozreview-review

Comment on attachment 8875873 [details] hgmo: don't send list of push nodes with automationrelevant response (bug 1303904); https://reviewboard.mozilla.org/r/147280/#review154264

Attachment #8875873 - Flags: review?(glob) → review+

Gregory Szorc [:gps]

Assignee

Comment 28

•

8 years ago

mozreview-review-reply

Comment on attachment 8875868 [details] hgmo: stream JSON responses (bug 1303904); https://reviewboard.mozilla.org/r/147270/#review152956 > could we assume it's stable and stream its output directly to the requester without a json decode/encode round trip? > > as per our chat feel free to drop this issue and leave as-is; this isn't in the critical path for this performance issue. Yeah, we can improve this code. But not in this bug.

Comment hidden (mozreview-request)

Pulsebot

Comment 35

•

8 years ago

Pushed by gszorc@mozilla.com: https://hg.mozilla.org/hgcustom/version-control-tools/rev/2ba557b29106 hgmo: stream JSON responses ; r=glob https://hg.mozilla.org/hgcustom/version-control-tools/rev/fa9313510326 pushlog: make repo.pushlog a cached property ; r=glob https://hg.mozilla.org/hgcustom/version-control-tools/rev/01497dd90af6 pushlog: factor out code to obtain a push from a node ; r=glob https://hg.mozilla.org/hgcustom/version-control-tools/rev/cbe2a5a5ae20 hgmo: refactor loop iterating over revs ; r=glob https://hg.mozilla.org/hgcustom/version-control-tools/rev/af4420a6680e pushlog: API to cache pushlog lookups ; r=glob https://hg.mozilla.org/hgcustom/version-control-tools/rev/c23d48384dba hgmo: don't send list of push nodes with automationrelevant response ; r=glob

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Gregory Szorc [:gps]

Assignee

Comment 36

•

8 years ago

This has been deployed. HTTP requests like https://hg.mozilla.org/releases/mozilla-aurora/json-automationrelevance/5dddbefdf759f09b1411f33fa0920835b919fc81 that previously timed out now complete in ~6s. This should no longer impact uplift day activities.

You need to log in before you can comment on or make changes to this bug.