Closed Bug 1603115 Opened 4 years ago Closed 4 years ago

2.06 - 2.67% JS (linux64-shippable, windows10-64, windows10-64-qr, windows7-32, windows7-32-shippable) regression on push 6c3b5dd001803c3e140d9b0b148aa7c25fd0a047 (Mon December 2 2019)

Categories

(Testing :: Performance, defect)

Version 3
defect
Not set
normal

Tracking

(firefox-esr68 wontfix, firefox71 unaffected, firefox72 wontfix, firefox73 fixed)

RESOLVED FIXED
mozilla73
Tracking Status
firefox-esr68 --- wontfix
firefox71 --- unaffected
firefox72 --- wontfix
firefox73 --- fixed

People

(Reporter: Bebe, Assigned: Gijs)

References

Details

(Keywords: perf, perf-alert, regression)

Attachments

(1 file)

We have detected an awsy regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=6c3b5dd001803c3e140d9b0b148aa7c25fd0a047

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

3% JS windows7-32 opt 83,856,559.52 -> 86,092,146.86
3% JS windows7-32-shippable opt 84,042,200.33 -> 86,182,011.38
3% JS windows7-32-shippable opt 84,078,171.34 -> 86,217,091.44
2% JS linux64-shippable opt 109,946,261.60 -> 112,227,923.26
2% JS windows10-64 opt 113,376,618.66 -> 115,712,603.64
2% JS windows10-64-qr opt 114,217,775.63 -> 116,568,997.78

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=24440

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/AWSY/Tests

This myght have been caused by 2 commits:

1 No bug:
2. Bug 1599614 - Make some DOM code less nsresult-happy

RyanVM, emilio can you take a look over this alert and maybe find the culprit.

Thanks

Flags: needinfo?(ryanvm)
Flags: needinfo?(emilio)

Interestingly, we also saw some tests become more timeout-prone after that pinning update landed (i.e. bug 1592498). I guess increased GC pressure could be a reason for that seemingly-inexplicable behavior change?

That said, I'm looking at the patch and struggling to understand what could possibly have made things so much worse. The only obvious change I can see from that patch is that blocklist.xml went from 372KB to 489KB (and there's one particularly-big new line). It looks like addons.json had a similar increase (699KB vs. 823KB). I have no idea what's happening under the hood with that data, though. Gijs, do you have any idea how we might debug this more? I'm worried that this is quite likely affecting real-world users now too since these are remotely-updated files.

Flags: needinfo?(ryanvm) → needinfo?(gijskruitbosch+bugs)

(In reply to Florin Strugariu [:Bebe] (needinfo me) from comment #1)

  1. Bug 1599614 - Make some DOM code less nsresult-happy

I'm pretty sure the relevant patch from that bug (https://hg.mozilla.org/integration/autoland/rev/bb7d90522676c6594f0fd2338beb976672c0dc29) is not to blame for this performance regression.

Flags: needinfo?(emilio)

The previous commit before that is https://hg.mozilla.org/integration/autoland/rev/9dea219937cd181806ec08f5f6e55a6d4685fef4, which is a compiler update, and could reasonably have caused it? Though it was backed out later.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #2)

Interestingly, we also saw some tests become more timeout-prone after that pinning update landed (i.e. bug 1592498). I guess increased GC pressure could be a reason for that seemingly-inexplicable behavior change?

Yeah, it's possible.

That said, I'm looking at the patch and struggling to understand what could possibly have made things so much worse. The only obvious change I can see from that patch is that blocklist.xml went from 372KB to 489KB (and there's one particularly-big new line). It looks like addons.json had a similar increase (699KB vs. 823KB). I have no idea what's happening under the hood with that data, though. Gijs, do you have any idea how we might debug this more?

For memory regressions you can download the memory reports off treeherder and do an about:memory diff. That shows +1.56mb of regexps-shared, +0.24mb for the long strings, and +0.5mb for the heap overhead, which neatly covers the 2.something mb increase. So yes, the blocklist increase is at fault here.

I don't know that there's much we can do except that:

  1. the regexps look very inefficient to me. There's a lot of overlap between items in the huge regex that could be represented more concisely. But I also know we're planning on getting rid of those regexes, and manually improving the regex might get in the way of doing that. :TheOne, do you know what our options are there?
  2. 1.56mb compiled representation for a 1k string regex seems like a large (15 fold) increase. :mccr8, do you know (who might know) why that is and/or if there's anything that we could do to the regex to make that less terrible?

I'm worried that this is quite likely affecting real-world users now too since these are remotely-updated files.

Do these updates not also run on the beta and/or release repo? Did we see the same effect there?

Flags: needinfo?(ryanvm)
Flags: needinfo?(gijskruitbosch+bugs)
Flags: needinfo?(continuation)
Flags: needinfo?(awagner)

We saw similar timeout issues on Beta (though I don't remember the bug # offhand - Aryx maybe remembers?). I haven't seen any reports of this on Beta yet, but it also might have gotten mixed in with the m-c merge which happened around the same time.

Flags: needinfo?(ryanvm) → needinfo?(aryx.bugmail)

Iain, it looks like you've been looking at SpiderMonkey's regexp engine recently. Do you have any answers for Gijs's questions in comment 5? Thanks.

Flags: needinfo?(continuation) → needinfo?(iireland)

Sorry, where is this regex?

A 15x increase doesn't seem implausible to me. Regex syntax is pretty terse. I keyboard-hammered a ~400 character regex and checked the size of the code we generated for it, and the generated instructions are indeed 15-20x the length of the regex string.

If you can point me at the specific regex, I can take a less generic peek at what we're generating under the covers.

Flags: needinfo?(iireland)

(In reply to Iain Ireland [:iain] from comment #8)

Sorry, where is this regex?

A 15x increase doesn't seem implausible to me. Regex syntax is pretty terse. I keyboard-hammered a ~400 character regex and checked the size of the code we generated for it, and the generated instructions are indeed 15-20x the length of the regex string.

If you can point me at the specific regex, I can take a less generic peek at what we're generating under the covers.

https://hg.mozilla.org/integration/autoland/rev/6c3b5dd001803c3e140d9b0b148aa7c25fd0a047#l1.28

If this is expected overhead, I'm not 100% sure whether it's useful to investigate further, it doesn't sound like we're likely to get order-of-magnitude improvements there in the near future. That's somewhat understandable - this regex is pretty crazy - but then I guess we should come up with an alternative.

Flags: needinfo?(gijskruitbosch+bugs)

kmag pointed me at bug 1345334 which he filed a few years ago that has some ideas for reducing memory usage.

(In reply to Andrew McCreight [:mccr8] from comment #10)

kmag pointed me at bug 1345334 which he filed a few years ago that has some ideas for reducing memory usage.

We do now already use indexeddb (on nightly/beta, we're hoping to roll out on release after 72 using normandy). The problem is that these regexes mean you can't use them to do a hash/ID based lookup, you have to fetch all the entries out of the DB and then iterate. There are plans to fix that (ie stop using the long regexes and start having per-add-on block items in the DB) in Q1, AIUI, but that's not now.

Right now we cache the entries in memory. We could throw away the in-memory copies of the regexes whenever we're done with establishing whether a given add-on is blocked, but that's liable to significantly increasing the indexeddb traffic / CPU usage (which is already not trivial because of the size of the data that's being structured-cloned across threads by the indexeddb implementation as well, AFAICT), which is also not helpful.

If we are going to attempt to stopgap this, I'd sooner look if we could manually split the massive regexes (which are really just of the form /^(addonID1|addonID2|addonID3|...|addonID10000)$/) into a list of IDs so we can do ID- based lookup on the client side, so we don't end up wasting the regex space. I don't know if that's going to help much / enough here, though, and I likely won't have time to try it until next week at the earliest.

(In reply to :Gijs (he/him) from comment #11)

If we are going to attempt to stopgap this, I'd sooner look if we could manually split the massive regexes (which are really just of the form /^(addonID1|addonID2|addonID3|...|addonID10000)$/) into a list of IDs so we can do ID- based lookup on the client side, so we don't end up wasting the regex space. I don't know if that's going to help much / enough here, though, and I likely won't have time to try it until next week at the earliest.

We started doing blocks with more IDs per block to reduce amount of blocks, which I was told adds a lot of overhead in the XML as well.

There are indeed plans to moving to a different format (that doesn't use regexps). Jorge can speak more about the timeline. Please NI him if needed.

Flags: needinfo?(awagner)

Shot in the dark: it looks like the change to blocklist.xml has gone back and forth a few times. We've been seeing spikes in OOM crashes in regexp code over the last month or so: specifically, one spike starting November 16, and another on December 2. (See bug 1598331.) Do those dates line up with anything going on here? I had been working under the assumption that the change had to be external, because it didn't line up with a release, but if blocklist.xml is updated out-of-band then that might be an alternate explanation.

December 2 would definitely fit the timeline of this bug. Not sure about November 16, though.

Didn't find the bug requested in comment 6.

Flags: needinfo?(aryx.bugmail)

OK, I have some local changes to, wait for it, use regular expressions to convert the giant regular expressions into something else, that doesn't have a 15x size overhead (plus not-great runtime characteristics, cf. bug 1602099).

Before: https://perfht.ml/38HIRUt

Spending about 155ms in frames containing "blockl", and 88ms in frames containing "regex"

After: https://perfht.ml/38NHK5D

Spending about 67ms in frames containing "blockl" and about 6ms in frames containing "regex".

Trypush: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc9b166a3961ce60106e7629b4304d7a42e5c6cf . Compare (not ready yet): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=08a9513a376776863a1ee13dc036021b87976cb2&newProject=try&newRevision=bc9b166a3961ce60106e7629b4304d7a42e5c6cf&framework=4

There's still some GC churn when this processing happens, where I'm not 100% sure if that's avoidable - I'm basically having to extract a data structure out of the giant string, and I don't think this is possible without creating any temporary copies that then get thrown away by GC.

On the other hand, no more red jank bar in the profile!

Flags: needinfo?(gijskruitbosch+bugs)
See Also: → 1602099
Assignee: nobody → gijskruitbosch+bugs
Status: NEW → ASSIGNED
See Also: → 1604655

The blocklist changes from bug 1604655 have landed across all branches. Did we see any improvement from that?

Flags: needinfo?(fstrugariu)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #18)

The blocklist changes from bug 1604655 have landed across all branches. Did we see any improvement from that?

FWIW, I don't really expect so - I'm just hoping it'll reduce OOM crashes.

I need to polish up the patch I have for this - the comparison in comment #17 shows that it addresses almost all (if not all) of the regression here (enough to put it within noise of the "before" state). I just haven't gotten around to it yet.

The only perf change I see in that period is:
https://treeherder.mozilla.org/perf.html#/alerts?id=24581

Flags: needinfo?(fstrugariu)
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/b75dc26cb153
make blocklist not use regular expressions locally for ID-or'd items, r=mconley,zombie
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla73

Did you want to nominate this for Beta & ESR uplift?

Flags: needinfo?(gijskruitbosch+bugs)

(In reply to Ryan VanderMeulen [:RyanVM][PTO Dec 21 - Jan 1] from comment #24)

Did you want to nominate this for Beta & ESR uplift?

I originally planned to, but by now I don't know.

The lack of automated test coverage of some of the specifics here (despite reasonable test coverage overall) that became obvious through some of the review comments and my digging ito it, the fact that I'm literally using regular expressions to parse regular expressions, and the fact that we're basically out of betas for 72, mean that I'm not convinced about 72. It also helps that in bug 1604655 we reduced some of the block sizes so we're left with maybe some of the perf issue (see below) but hopefully not the increased crashes.

So I'm tempted to let this ride the train with 73 for now and evaluate ESR again for that cycle. That'll also give us time to confirm this sufficiently addresses the perf regression. That sound OK?

Having just taken a peek at the graphs here, there are improvements on the 19th on https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=autoland,1959387,1,4 (and, to a less noticeable/alert-y degree, on the other graphs) that look like they fit nicely with bug 1604655 anyway, so perhaps that also helped bring memory usage down. I don't see significant jumps in the graph for this patch landing, though it's been very quiet over Christmas so it's hard to tell with the noise levels - the trypush in comment #17 were fairly convincing but were before the patches from bug 1604655 landed...

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(ryanvm)

Sounds good to me.

Flags: needinfo?(ryanvm)

Given that this didn't seem to lead to any noticeable change when it hit autoland, I think we can just leave ESR68 as-is at this point. Feel free to nominate for uplift if you disagree, though.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: