Closed Bug 1559878 Opened 5 years ago Closed 5 years ago

1.98 - 5.12% a11yr / tp5n main_startup_fileio (windows7-32-shippable) regression on push 8b110b9889c34d70465d2a2b98b1747ae03f3c3f

Categories

(Firefox Build System :: General, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(firefox-esr60 unaffected, firefox67 unaffected, firefox68 unaffected, firefox69+ fixed)

RESOLVED FIXED
mozilla69
Tracking Status
firefox-esr60 --- unaffected
firefox67 --- unaffected
firefox68 --- unaffected
firefox69 + fixed

People

(Reporter: igoldan, Assigned: Gijs)

References

(Regression)

Details

(4 keywords)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=26625ccc272ce893c72ccd0dab2d5ef6b727ddae&tochange=8b110b9889c34d70465d2a2b98b1747ae03f3c3f

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

5% a11yr windows7-32-shippable opt e10s stylo 144.46 -> 151.86
2% tp5n main_startup_fileio windows7-32-shippable opt e10s stylo 101,528,879.00 -> 103,543,734.33

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=21297

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Performance_sheriffing/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Performance_sheriffing/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling

Product: Testing → Firefox Build System
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → DUPLICATE

This is unexpected, whereas the build time increases aren't in the same way, so I don't think we should dupe this over. We should understand this better and/or revert the pgo e10s change on windows, and/or maybe fixing bug 1557762 or bug 1557785 can fix this.

Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b34fcfdbed777cc6edaa0058619b6fc1fa32bb67&selectedJob=252487460 seems to suggest fixing bug 1557762 fixes the a11yr stuff, but not the startup IO thing.

Florian, it looks like there are some markers dumped into the logfile for the xperf job that produces the io numbers, that I could use to find out what the difference before/after the change was - how would I go about doing that?

Flags: needinfo?(florian)
See Also: → 1555311

(In reply to :Gijs (he/him) from comment #3)

Sorry for not noticing this needinfo earlier.

Florian, it looks like there are some markers dumped into the logfile for the xperf job that produces the io numbers, that I could use to find out what the difference before/after the change was - how would I go about doing that?

I don't know much about this xperf test unfortunately. The JSON line containing PoisonIOInterposer several times in the xperf_raw.log uploaded artifact seems to contain information from the IOInterposer. Are you asking how to compare these pieces of JSON? If so, I would JSON parse them, then dump the file paths, sort, and then diff the before/after lists. I'm not sure that'll help understand the regression though. I would need to have a better understanding of how the main_startup_fileio value is computed.

Flags: needinfo?(florian)

OK. The a11yr thing has come back down as a result of bug 1557762. Looking at the IO thing some more. It looks like after this regression it was bimodal, and that seems to now have settled on the high end of that bimodal. :-(

Assignee: nobody → gijskruitbosch+bugs
Status: REOPENED → ASSIGNED
Depends on: 1557762

The poison IO stuff is basically identical, as far as I can tell.

The readcount/readbytes data in the log is more illuminating. Specifically, in later builds, the following entries appear that did not before:

msctf.dll
C:/windows/Prefetch/FIREFOX.EXE-[somehash].pf

and it also catches us reading xul.dll 50 more times (2988 after the switch, 2948 before) and nss3.dll 8 more times (84 vs 76).

Bytewise, this accounts for:

file bytes before bytes after diff
msctf.dll 0 65536 65536
prefetch firefox.exe 0 311952 311952
xul.dll 193200128 195821568 2621440
nss.dll 4980736 5505024 524288
Total - - 3,523,216

which is actually slightly more than what we're recorded as regressing here, though it matches up rather neatly with the total regression we're now looking at on the talos graph. (FWIW, I compared numbers from https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=249402118&group_state=expanded&revision=5367531b598c04c6478605832fbafc2df0184c40&searchStr=xperf%2C7 and https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=da7de725195576687e821ecffd29c7886e2edc3d&searchStr=7%2Cxperf&selectedJob=250098278 )

I have a bunch of questions at this point that I'm not sure how to answer:

  1. why was there any change as a result of the PGO change here?
  2. why wasn't there a similar change on other platforms?
  3. why are the numbers for this test on win7 about 100x as high as on win10? (918831 is a representative value for win10, ie about 915kb, whereas win7 reports over 100mb)
  4. the ctf dll is from MS's text services framework, which means I suspect bug 1415270 for tripping loading that in the parent on win7. That dll alone is comparatively a tiny part of the impact here, but it does make me less sure about whether the pgo changes are actually at fault here. I'll do a trypush that backs out my changes to verify.

Joel/Ionuț , who could help with questions 1-3, ie who knows about xperf ?

Flags: needinfo?(jmaher)
Flags: needinfo?(igoldan)

File I/O often correlates with binary size. Looking at the "Windows 2012 shippable opt" builds in your comparison:

(FWIW, I compared numbers from
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=249402118&group_state=expanded&revision=5367531b598c04c6478605832fbafc2df0184c40&searchStr=xperf%2C7 and
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=da7de725195576687e821ecffd29c7886e2edc3d&searchStr=7%2Cxperf&selectedJob=250098278 )

Before:

    NSS section sizes opt nightly: 2370010
    XUL section sizes opt nightly: 94344820

After:

    NSS section sizes opt nightly: 2618210
    XUL section sizes opt nightly: 95536532

So libxul gained more than a megabyte, and NSS gained a pretty alarming 10%. Was the difference in these pushes just PGO settings? Or did some feature code land in between or something?

For more details about the xperf test machinery, aklotz would be a good starting point.

Thanks! I guess I'll see from a trypush if the binary size changes are the result of the e10s thing or not...

(In reply to :Gijs (he/him) from comment #6)

<snip>
I have a bunch of questions at this point that I'm not sure how to answer:

  1. why was there any change as a result of the PGO change here?
  2. why wasn't there a similar change on other platforms?
  3. why are the numbers for this test on win7 about 100x as high as on win10? (918831 is a representative value for win10, ie about 915kb, whereas win7 reports over 100mb)
  4. the ctf dll is from MS's text services framework, which means I suspect bug 1415270 for tripping loading that in the parent on win7. That dll alone is comparatively a tiny part of the impact here, but it does make me less sure about whether the pgo changes are actually at fault here. I'll do a trypush that backs out my changes to verify.

:aklotz, any recommended reading / advice / answers for the first 3 questions here? :-)

Flags: needinfo?(jmaher)
Flags: needinfo?(igoldan)
Flags: needinfo?(aklotz)

(In reply to David Major [:dmajor] from comment #7)

Before:

    NSS section sizes opt nightly: 2370010
    XUL section sizes opt nightly: 94344820

After:

    NSS section sizes opt nightly: 2618210
    XUL section sizes opt nightly: 95536532

So libxul gained more than a megabyte, and NSS gained a pretty alarming 10%. Was the difference in these pushes just PGO settings? Or did some feature code land in between or something?

Looks like it's just the PGO settings:

Empty trypush (https://treeherder.mozilla.org/#/jobs?repo=try&revision=f1e6f9b8d9663f092e0f27a3c88e49aec2646536&selectedJob=254434385 ):

installer size opt nightly: 71145736
NSS section sizes opt nightly: 2630178
XUL section sizes opt nightly: 96357452

Backout (so e10s turned off; https://treeherder.mozilla.org/#/jobs?repo=try&revision=83f72cd9f01390fc93eb72bbf03f10b47d3097a5&selectedJob=254433998 ):

installer size opt nightly: 70097095
NSS section sizes opt nightly: 2360946
XUL section sizes opt nightly: 94241652

This isn't really my area of expertise - do we have a way of inspecting/diffing the resulting binaries and finding out what the cause of the difference is?

I'll note that there was a similar increase on the 64-bit builds, which was somehow not flagged up in the same way, which comes back to my earlier questions of why only win7 complained here...

Flags: needinfo?(dmajor)

I can try to take a look with some binary size tools tomorrow.

So, in nss3.dll, 90%+ of the size change is attributable to increased inlining in sqlite3.c. A bunch of sqlite functions disappeared from the binary and their callers got larger (presumably multiple callers per callee). I don't see how that relates to e10s though.

In xul.dll, the size changes are less clear-cut, but a large chunk of the delta is in ipc and ipdl. I guess that makes sense. Suddenly we have profile data that never existed before for e10s codepaths, and now the compiler can apply optimizations there.

One would hope that we'd get something in return for these optimizations, like better perf in sqlite or IPC paths, but I guess nothing in our suite is sensitive enough to those to have showed an improvement? My gut feeling is that the increased optimization is probably still a good thing though.

Flags: needinfo?(dmajor)

OK, then I'm going to go ahead and mark this fixed for fixing the (bigger) a11yr regression, and accept that we can't do much about the binary size here. Perhaps there's scope to improve the binary size / runtime speed trade-off we're making here in PGO more generally, but that seems like something we'd want to consider more holistically (ie not specific to this particular regression) anyway.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69

sgtm

Flags: needinfo?(aklotz)

(In reply to David Major [:dmajor] from comment #11)

So, in nss3.dll, 90%+ of the size change is attributable to increased inlining in sqlite3.c. A bunch of sqlite functions disappeared from the binary and their callers got larger (presumably multiple callers per callee). I don't see how that relates to e10s though.

That's...weird? Surely we're not making more calls to NSS in the e10s case? ni? to keeler for input there.

One would hope that we'd get something in return for these optimizations, like better perf in sqlite or IPC paths, but I guess nothing in our suite is sensitive enough to those to have showed an improvement? My gut feeling is that the increased optimization is probably still a good thing though.

I guess sqlite paths are probably dominated by the disk I/O, to some extent. IPC is probably in the same boat: serialization performance matters, but at some point, you still have to go into the kernel and copy data around, plus copy data into/out of your IPC buffers...

Flags: needinfo?(dkeeler)

e10s means we have an instance of NSS in the parent process and an instance in each content process. The instances in the content processes should be in "memory-only" mode, which means they shouldn't be calling any sqlite functions, so my guess would be that there are the same number or fewer calls to NSS that result in sqlite calls with e10s enabled, but apparently that's not what's happening? That said, I think it would be safe to say that overall the number of NSS calls in the e10s case is larger, because e.g. certificates we encounter get decoded in both the parent process and whichever content processes encounter them.

Flags: needinfo?(dkeeler)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #15)

e10s means we have an instance of NSS in the parent process and an instance in each content process. The instances in the content processes should be in "memory-only" mode, which means they shouldn't be calling any sqlite functions, so my guess would be that there are the same number or fewer calls to NSS that result in sqlite calls with e10s enabled, but apparently that's not what's happening?

The perf regression here was about IO on the actual DLLs, because the DLL is bigger now. This might be a result of more inlining in the e10s case, maybe if some inlined functions get called through the memory-only paths now when they don't in non-e10s? I think the hypothesis in comment #12 / comment #14 is that we've traded larger filesize for better runtime performance here, where the runtime perf improvement is too small to notice in our test frameworks, and one of the reasons that could be is that the codepaths where we got some runtime perf improvements involve sqlite or other IO (the cost of which would dominate runtime cost for the CPU/in-memory stuff).

That said, I think it would be safe to say that overall the number of NSS calls in the e10s case is larger, because e.g. certificates we encounter get decoded in both the parent process and whichever content processes encounter them.

I find this confusing though. Is it cheaper to do this work twice than to send the decoded cert info to the content process from the parent, or is there some reason that the latter is very difficult? If we could do the latter, could we avoid loading nss things in the child completely? (I'd take this to a different bug but atm I'm operating on the assumption that I'm just wildly misunderstanding what's going on so it's not worth filing another bug, plus I'd like to keep some context for my questions for now.)

Flags: needinfo?(dkeeler)

(In reply to :Gijs (he/him) from comment #16)

I find this confusing though. Is it cheaper to do this work twice than to send the decoded cert info to the content process from the parent, or is there some reason that the latter is very difficult? If we could do the latter, could we avoid loading nss things in the child completely? (I'd take this to a different bug but atm I'm operating on the assumption that I'm just wildly misunderstanding what's going on so it's not worth filing another bug, plus I'd like to keep some context for my questions for now.)

I think webcrypto happens in the content process, so unless we have a completely different cryptography implementation (or remote it to the parent, which seems like the wrong thing to do from a safety point of view), then we can't not have NSS in the content process. That said, avoiding decoding certificates in the content process would actually solve other issues we've encountered, such as bug 1544511.

Flags: needinfo?(dkeeler)

I filed bug 1566191 for the PSM content process stuff.

Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.