Closed Bug 1303904 Opened 8 years ago Closed 7 years ago

Decision task failing trying to fetch json-automationrelevance on merge to mozilla-aurora

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: gps)

References

Details

Attachments

(6 files, 1 obsolete file)

<philor>: tools.taskcluster.net/task-inspector/#ayJi2q5jSK2RAb7bhOga6Q/0 is the decision task failing on the merge to aurora, timing out fetching automationrelevance from hg.m.o
<gps>: philor, dustin: that /may/ be a regression from the recent refactor. previous merge days went fine.
<gps>: a 5s timeout is really short
<gps>: the API is likely CPU bound on the server
<gps>: you can test that by running `time curl mozilla-aurora/json-automationrele…df759f09b1411f33fa0920835b919fc81` a few dozen times. i bet the variance in execution time is really low
<gps>: looks like it is timing out after 60s on the server :/

Bottom line: if we do this regularly on merges, then we'll need to keep the tree closed after the merge (I can't believe we don't just do that as a matter of policy) and then have something innocuous around to push to the closed tree in order to get our first set of Taskcluster jobs on the merge to see how busted they are.
I don't think this is a regression: the server-side process is taking 60+s to generate output. Bottleneck is reading the pushlog.

There's no easy workaround. We'll have to wait for the next push.
Component: General → Mercurial: hg.mozilla.org
Product: Taskcluster → Developer Services
QA Contact: hwine
Blocks: 1317333
No longer blocks: 1317333
(In reply to Gregory Szorc [:gps] from comment #1)
> I don't think this is a regression: the server-side process is taking 60+s
> to generate output. Bottleneck is reading the pushlog.
> 
> There's no easy workaround. We'll have to wait for the next push.

yeah this seems to work here
QA Contact: hwine → klibby
seems this hit us again on the beta to release merge day
See Also: → 1305492
Ugh. I feel bad about having a bug impacting merge day activities. So I decided to spend a few minutes to dig deeper.

The automationrelevant() revset is not a problem: I can execute it on the command line and it completes very quickly. The bottleneck is calling webuitl.changelistentry() thousands of times. Specifically, the pushlog extension's code that runs during changelistentry() opens a new connection to the SQLite database then runs 2 SQL queries.

I suspect reusing the DB connection will make the perf issue go away. But that change isn't exactly trivial :/
Given that the endpoint is specific to automation, it would not be unreasonable to return some "lots of things changed!" response when the revset exceeds a certain size, with corresponding handling of that response in-tree.
if we can just ignore the automationrelevant() and let decision task finish, that would be good
We hit this again today (merge day).
+1, if the number of commits hits a certain size, we should return true.

I'm hesitant to equate timeouts as relevant, since that means we might run all automation whenever hg.m.o gets really slammed, which would exacerbate things.  I'm equally hesitant to equate timeouts as non-relevance, since this logic breaks on merge day.  Setting a threshold for the number of commits on the server side makes the most sense to me.
Comment on attachment 8844279 [details]
hgmo: Cap at 500 revisions in automationrelevancewebcommand (bug 1303904)

I'm unsure if this will work like I think it will work (I can't get my test environment working for some reason), but thought it'd be worth a shot. 

If this works like I think it will work, I think it will just stop looking at csets after we reach 500 csets, and then carry on with the rest of the function with that trimmed list.

Unless I read gps's comment wrong and I need to move this into the automationrelevant() revset itself?
Attachment #8844279 - Flags: feedback?(aki)
Comment on attachment 8844279 [details]
hgmo: Cap at 500 revisions in automationrelevancewebcommand (bug 1303904)

I'm not sure what the "yes, this is relevant; ignore the details" response should be.  It looks like we populate a json blob normally.  I'm also not sure if 500 is the right threshold, but it's definitely better than an unbounded maximum.  This patch probably requires more knowledgeable eyes than mine.
Attachment #8844279 - Flags: feedback?(aki) → feedback+
Happy to wait for gps to return, we'll still have a month after that before the next thousand+ cset merge day.
Flags: needinfo?(gps)
Attachment #8844279 - Flags: feedback?(gps)
Comment on attachment 8844279 [details]
hgmo: Cap at 500 revisions in automationrelevancewebcommand (bug 1303904)

I'll come up with a hack to fix the perf issues.

The reason I was reluctant to do this before is I preferred to work in an upstream API so we wouldn't need a hack. But on further reflection, that is a lot of work. So the hacky route we will go.
Flags: needinfo?(gps)
Attachment #8844279 - Flags: feedback?(gps) → feedback-
Assignee: nobody → gps
Status: NEW → ASSIGNED
The abridged version of the commit messages is this series drops time to first byte for an example uplift push from ~120s to ~1.5s and total size of the JSON response from 2348 MB to 10.8 MB. I listened to Entry of the Gladiators as I was writing the commit messages.
Attachment #8844279 - Attachment is obsolete: true
Comment on attachment 8875868 [details]
hgmo: stream JSON responses (bug 1303904);

https://reviewboard.mozilla.org/r/147270/#review152956

::: hgext/hgmo/__init__.py:319
(Diff revision 1)
>      # Round trip to ensure we have valid JSON.
>      try:
>          d = json.loads(stdout)
> -        return json.dumps(d, indent=2, sort_keys=True)
> +        return stream_json(d)
>      except Exception:
>          return json.dumps({'error': 'invalid JSON returned; report this error'},
>                            indent=2)

could we assume it's stable and stream its output directly to the requester without a json decode/encode round trip?

as per our chat feel free to drop this issue and leave as-is; this isn't in the critical path for this performance issue.
Attachment #8875868 - Flags: review?(glob) → review+
Comment on attachment 8875869 [details]
pushlog: make repo.pushlog a cached property (bug 1303904);

https://reviewboard.mozilla.org/r/147272/#review154254

::: commit-message-0f52a:6
(Diff revision 1)
> +pushlog: make repo.pushlog a cached property (bug 1303904); r?glob
> +
> +Currently, repo.pushlog creates a new pushlog instance on every
> +access. This creates overhead and prevents some forms of caching.
> +
> +The "propertyfilecache" decorator cached the property value, but only

i think you mean "repofilecache" here, as propertyfilecache creates a new object on every call.
Attachment #8875869 - Flags: review?(glob) → review+
Comment on attachment 8875870 [details]
pushlog: factor out code to obtain a push from a node (bug 1303904);

https://reviewboard.mozilla.org/r/147274/#review154258
Attachment #8875870 - Flags: review?(glob) → review+
Comment on attachment 8875871 [details]
hgmo: refactor loop iterating over revs (bug 1303904);

https://reviewboard.mozilla.org/r/147276/#review154260
Attachment #8875871 - Flags: review?(glob) → review+
Comment on attachment 8875872 [details]
pushlog: API to cache pushlog lookups (bug 1303904);

https://reviewboard.mozilla.org/r/147278/#review154262

yeah, this is somewhat hacky, but it looks sane given the approach.
Attachment #8875872 - Flags: review?(glob) → review+
Comment on attachment 8875873 [details]
hgmo: don't send list of push nodes with automationrelevant response (bug 1303904);

https://reviewboard.mozilla.org/r/147280/#review154264
Attachment #8875873 - Flags: review?(glob) → review+
Comment on attachment 8875868 [details]
hgmo: stream JSON responses (bug 1303904);

https://reviewboard.mozilla.org/r/147270/#review152956

> could we assume it's stable and stream its output directly to the requester without a json decode/encode round trip?
> 
> as per our chat feel free to drop this issue and leave as-is; this isn't in the critical path for this performance issue.

Yeah, we can improve this code. But not in this bug.
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/2ba557b29106
hgmo: stream JSON responses ; r=glob
https://hg.mozilla.org/hgcustom/version-control-tools/rev/fa9313510326
pushlog: make repo.pushlog a cached property ; r=glob
https://hg.mozilla.org/hgcustom/version-control-tools/rev/01497dd90af6
pushlog: factor out code to obtain a push from a node ; r=glob
https://hg.mozilla.org/hgcustom/version-control-tools/rev/cbe2a5a5ae20
hgmo: refactor loop iterating over revs ; r=glob
https://hg.mozilla.org/hgcustom/version-control-tools/rev/af4420a6680e
pushlog: API to cache pushlog lookups ; r=glob
https://hg.mozilla.org/hgcustom/version-control-tools/rev/c23d48384dba
hgmo: don't send list of push nodes with automationrelevant response ; r=glob
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
This has been deployed. HTTP requests like https://hg.mozilla.org/releases/mozilla-aurora/json-automationrelevance/5dddbefdf759f09b1411f33fa0920835b919fc81 that previously timed out now complete in ~6s. This should no longer impact uplift day activities.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: