Closed Bug 1413570 Opened 2 years ago Closed 2 years ago

16.78% installer size (linux32) regression on push da60fbc995312439ae99a50f28816da360da787b (Tue Oct 31 2017)

Categories

(Core :: Memory Allocator, defect)

53 Branch
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox-esr52 --- unaffected
firefox56 --- unaffected
firefox57 --- unaffected
firefox58 --- fixed

People

(Reporter: jmaher, Assigned: glandium)

References

Details

(Keywords: regression)

Attachments

(1 file)

We have detected a build metrics regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=da60fbc995312439ae99a50f28816da360da787b

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 17%  installer size summary linux32 pgo      64,298,163.50 -> 75,089,979.33


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=10282

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Automated_Performance_Testing_and_Sheriffing/Build_Metrics
this isn't representing a single push as the build was broken for 7 pushes:
https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&filter-searchStr=linux%20build%20pgo&tochange=da60fbc995312439ae99a50f28816da360da787b&fromchange=607922c730a149eb229cd3011c846c547f8004f4

:glandium as you broke the build, can you help figure out which is the root cause of the installer size regression which appears to be on linux32 pgo only
Blocks: 1413096
Flags: needinfo?(mh+mozilla)
Component: Untriaged → Memory Allocator
Product: Firefox → Core
Blocks: 1252556
So there are 2 problems:
- Running firefox during the PGO profile can fail unnoticed. Unsurprisingly, there's a bug on file about this: bug 1252556. This is what happens and why the installer size grew, because optimization levels changed (presumably, it compiled everything at -O3).
- https://hg.mozilla.org/mozilla-central/rev/09890a442cf8 indirectly caused a crash. More details below.

Note that the first phase of PGO generates code differently from normal builds or from final PGO code generation, which is why the produced build still works, and why no problem appears on normal builds either.

Now, as for why Firefox crashes during the PGO profile. First, how it crashes:

Program received signal SIGSEGV, Segmentation fault.
0x08067ce7 in BaseAllocator::calloc(unsigned int, unsigned int) ()
    at /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/CheckedInt.h:741
741     MOZ_CHECKEDINT_BASIC_BINARY_OPERATOR(Mul, *)
(gdb) bt
#0  0x08067ce7 in BaseAllocator::calloc(unsigned int, unsigned int) ()
    at /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/CheckedInt.h:741
#1  0x0806805c in Allocator<MozJemallocBase>::calloc(unsigned int, unsigned int) ()
    at /builds/worker/workspace/build/src/memory/build/malloc_decls.h:49
#2  0x0805db5e in calloc () at /builds/worker/workspace/build/src/memory/build/malloc_decls.h:49
#3  0xf7fe8343 in _dl_new_object () from /lib/ld-linux.so.2
#4  0xf7fe460f in _dl_map_object_from_fd () from /lib/ld-linux.so.2
#5  0xf7fe5a26 in _dl_map_object () from /lib/ld-linux.so.2
#6  0xf7ff05ab in dl_open_worker () from /lib/ld-linux.so.2
#7  0xf7fec016 in _dl_catch_error () from /lib/ld-linux.so.2

This is what points to https://hg.mozilla.org/mozilla-central/rev/09890a442cf8 , which added the use of CheckedInt in calloc.

At the assembly level, this is where we crash:
   0x08067cd0 <+304>:   imul   %edi,%esi
   0x08067cd3 <+307>:   movdqa -0xbc04(%ebx),%xmm1
   0x08067cdb <+315>:   movdqa 0xd8cc(%ebx),%xmm0
   0x08067ce3 <+323>:   paddq  %xmm1,%xmm0
=> 0x08067ce7 <+327>:   movdqa %xmm1,-0x28(%ebp)
   0x08067cec <+332>:   movdqa %xmm0,0xd8cc(%ebx)

(gdb) print $ebp
$1 = (void *) 0xffffadc0

The crashing instruction is movqda which is Move Aligned Double Quadword. The keyword here is "aligned". Intel docs say: "When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated."

and while ebp is 16-byte aligned, ebp - 0x28 is ... not.

At this point, you could think the problem lies in the generated code. But it's actually not. That ebp value is just a copy of the stack pointer after the frame pointer is pushed on the stack at function entry. That value is expected to be 16-byte aligned per the x86 ABI that GCC >= 4.5 produces. So breaking upper the stack, in calloc itself, here is what we can see:

Breakpoint 1, calloc () at /builds/worker/workspace/build/src/memory/build/malloc_decls.h:49
49      MALLOC_DECL(calloc, void *, size_t, size_t)
(gdb) print $esp
$1 = (void *) 0xffffb4d0

This is the first call to calloc in the process... and it doesn't crash.

Breakpoint 1, calloc () at /builds/worker/workspace/build/src/memory/build/malloc_decls.h:49
49      MALLOC_DECL(calloc, void *, size_t, size_t)
(gdb) print $esp
$2 = (void *) 0xffffadf8

The second call, however, is not, and is the crashing one, coming directly from glibc's ld-linux.so.

Which either means a bug in that version of the glibc, or that it was built with the old ABI that didn't align the stack to 16 bytes.

Either way, the movdqa instruction doesn't like that, and the crash happens. Interestingly, that instruction looks very much like it's part of the PGO profiling code itself (coverage code), and not the result of compiling our code. Note the irony.
Assignee: nobody → mh+mozilla
Flags: needinfo?(mh+mozilla)
Comment on attachment 8924376 [details]
Bug 1413570 - Disable SSE2 when building mozjemalloc on x86 during PGO profile gen phase.

https://reviewboard.mozilla.org/r/195634/#review200798
Attachment #8924376 - Flags: review?(nfroyd) → review+
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/fa3be8644d73
Disable SSE2 when building mozjemalloc on x86 during PGO profile gen phase. r=froydnj
https://hg.mozilla.org/mozilla-central/rev/fa3be8644d73
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
This got fixed:

== Change summary for alert #10307 (as of Wed, 01 Nov 2017 23:57:36 GMT) ==

Improvements:

 14%  installer size summary linux32 pgo      75,174,776.17 -> 64,326,347.42

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=10307
You need to log in before you can comment on or make changes to this bug.