Closed Bug 1413570 Opened 2 years ago Closed 2 years ago
.78% installer size (linux32) regression on push da60fbc995312439ae99a50f28816da360da787b (Tue Oct 31 2017)
59 bytes, text/x-review-board-request
We have detected a build metrics regression from push: https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=da60fbc995312439ae99a50f28816da360da787b As author of one of the patches included in that push, we need your help to address this regression. Regressions: 17% installer size summary linux32 pgo 64,298,163.50 -> 75,089,979.33 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=10282 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format. To learn more about the regressing test(s), please see: https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Automated_Performance_Testing_and_Sheriffing/Build_Metrics
this isn't representing a single push as the build was broken for 7 pushes: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&filter-searchStr=linux%20build%20pgo&tochange=da60fbc995312439ae99a50f28816da360da787b&fromchange=607922c730a149eb229cd3011c846c547f8004f4 :glandium as you broke the build, can you help figure out which is the root cause of the installer size regression which appears to be on linux32 pgo only
2 years ago
Component: Untriaged → Memory Allocator
Product: Firefox → Core
So there are 2 problems: - Running firefox during the PGO profile can fail unnoticed. Unsurprisingly, there's a bug on file about this: bug 1252556. This is what happens and why the installer size grew, because optimization levels changed (presumably, it compiled everything at -O3). - https://hg.mozilla.org/mozilla-central/rev/09890a442cf8 indirectly caused a crash. More details below. Note that the first phase of PGO generates code differently from normal builds or from final PGO code generation, which is why the produced build still works, and why no problem appears on normal builds either. Now, as for why Firefox crashes during the PGO profile. First, how it crashes: Program received signal SIGSEGV, Segmentation fault. 0x08067ce7 in BaseAllocator::calloc(unsigned int, unsigned int) () at /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/CheckedInt.h:741 741 MOZ_CHECKEDINT_BASIC_BINARY_OPERATOR(Mul, *) (gdb) bt #0 0x08067ce7 in BaseAllocator::calloc(unsigned int, unsigned int) () at /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/CheckedInt.h:741 #1 0x0806805c in Allocator<MozJemallocBase>::calloc(unsigned int, unsigned int) () at /builds/worker/workspace/build/src/memory/build/malloc_decls.h:49 #2 0x0805db5e in calloc () at /builds/worker/workspace/build/src/memory/build/malloc_decls.h:49 #3 0xf7fe8343 in _dl_new_object () from /lib/ld-linux.so.2 #4 0xf7fe460f in _dl_map_object_from_fd () from /lib/ld-linux.so.2 #5 0xf7fe5a26 in _dl_map_object () from /lib/ld-linux.so.2 #6 0xf7ff05ab in dl_open_worker () from /lib/ld-linux.so.2 #7 0xf7fec016 in _dl_catch_error () from /lib/ld-linux.so.2 This is what points to https://hg.mozilla.org/mozilla-central/rev/09890a442cf8 , which added the use of CheckedInt in calloc. At the assembly level, this is where we crash: 0x08067cd0 <+304>: imul %edi,%esi 0x08067cd3 <+307>: movdqa -0xbc04(%ebx),%xmm1 0x08067cdb <+315>: movdqa 0xd8cc(%ebx),%xmm0 0x08067ce3 <+323>: paddq %xmm1,%xmm0 => 0x08067ce7 <+327>: movdqa %xmm1,-0x28(%ebp) 0x08067cec <+332>: movdqa %xmm0,0xd8cc(%ebx) (gdb) print $ebp $1 = (void *) 0xffffadc0 The crashing instruction is movqda which is Move Aligned Double Quadword. The keyword here is "aligned". Intel docs say: "When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated." and while ebp is 16-byte aligned, ebp - 0x28 is ... not. At this point, you could think the problem lies in the generated code. But it's actually not. That ebp value is just a copy of the stack pointer after the frame pointer is pushed on the stack at function entry. That value is expected to be 16-byte aligned per the x86 ABI that GCC >= 4.5 produces. So breaking upper the stack, in calloc itself, here is what we can see: Breakpoint 1, calloc () at /builds/worker/workspace/build/src/memory/build/malloc_decls.h:49 49 MALLOC_DECL(calloc, void *, size_t, size_t) (gdb) print $esp $1 = (void *) 0xffffb4d0 This is the first call to calloc in the process... and it doesn't crash. Breakpoint 1, calloc () at /builds/worker/workspace/build/src/memory/build/malloc_decls.h:49 49 MALLOC_DECL(calloc, void *, size_t, size_t) (gdb) print $esp $2 = (void *) 0xffffadf8 The second call, however, is not, and is the crashing one, coming directly from glibc's ld-linux.so. Which either means a bug in that version of the glibc, or that it was built with the old ABI that didn't align the stack to 16 bytes. Either way, the movdqa instruction doesn't like that, and the crash happens. Interestingly, that instruction looks very much like it's part of the PGO profiling code itself (coverage code), and not the result of compiling our code. Note the irony.
Assignee: nobody → mh+mozilla
Comment on attachment 8924376 [details] Bug 1413570 - Disable SSE2 when building mozjemalloc on x86 during PGO profile gen phase. https://reviewboard.mozilla.org/r/195634/#review200798
Attachment #8924376 - Flags: review?(nfroyd) → review+
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/autoland/rev/fa3be8644d73 Disable SSE2 when building mozjemalloc on x86 during PGO profile gen phase. r=froydnj
This got fixed: == Change summary for alert #10307 (as of Wed, 01 Nov 2017 23:57:36 GMT) == Improvements: 14% installer size summary linux32 pgo 75,174,776.17 -> 64,326,347.42 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=10307
You need to log in before you can comment on or make changes to this bug.