Closed
Bug 1338651
Opened 8 years ago
Closed 7 years ago
taskcluster cross-compiled OS X builds create perf problems when not stripped (Talos performance regressions vs. buildbot builds)
Categories
(Firefox Build System :: General, defect)
Tracking
(firefox-esr52 wontfix, firefox54 wontfix, firefox55 wontfix, firefox56 fixed, firefox57 fixed)
RESOLVED
FIXED
mozilla56
People
(Reporter: jmaher, Assigned: wcosta)
References
Details
(Keywords: perf, regression, talos-regression)
User Story
Taskcluster OS X builds are cross-compiled from a Linux Docker image. The build is done in a path under '/home/', which shows up in the symbol table of the resulting binary as STAB entries referencing the object files. Some system libraries on OS X will attempt to stat the files in those entries, which can cause noticeable performance issues. This has shown up as a result of the sandboxing system causing this behavior while reporting violations, which caused Talos regressions, as well as timeouts in GTest death tests due to the system crash reporting system causing this behavior.
Attachments
(4 files, 3 obsolete files)
4.24 MB,
patch
|
Details | Diff | Splinter Review | |
675.26 KB,
patch
|
Details | Diff | Splinter Review | |
59 bytes,
text/x-review-board-request
|
mshal
:
review+
|
Details |
59 bytes,
text/x-review-board-request
|
jseward
:
review+
|
Details |
Talos has detected a Firefox performance regression from push 9f402afa3864572af68f27eb3e16295da02841b4. As author of one of the patches included in that push, we need your help to address this regression. Regressions: 34% sessionrestore_no_auto_restore osx-10-10 opt e10s 968.42 -> 1293.5 32% sessionrestore osx-10-10 opt e10s 928.83 -> 1230.25 31% damp summary osx-10-10 opt e10s 300.83 -> 395.58 31% tsvgr_opacity summary osx-10-10 opt e10s 351.47 -> 460.9 29% a11yr summary osx-10-10 opt e10s 469.52 -> 607.59 28% ts_paint osx-10-10 opt e10s 1223 -> 1570.58 27% tart summary osx-10-10 opt e10s 11.87 -> 15.07 23% cart summary osx-10-10 opt e10s 36.84 -> 45.37 22% kraken summary osx-10-10 opt e10s 1604.32 -> 1963 19% tscrollx summary osx-10-10 opt e10s 3.16 -> 3.74 17% tsvgx summary osx-10-10 opt e10s 445.94 -> 522.67 17% tabpaint summary osx-10-10 opt e10s 62.76 -> 73.33 17% tsvg_static summary osx-10-10 opt e10s 69.25 -> 80.72 16% tps summary osx-10-10 opt e10s 25.65 -> 29.71 15% tcanvasmark summary osx-10-10 opt e10s 6273.62 -> 5335.83 12% tpaint osx-10-10 opt e10s 333.73 -> 374.12 12% tresize osx-10-10 opt e10s 29.28 -> 32.65 Improvements: 2% tp5o Main_RSS osx-10-10 opt 398745663.13 -> 388868909.48 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=5075 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format. To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running *** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! *** Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Reporter | ||
Comment 1•8 years ago
|
||
oh boy, running talos tests via BBB against a Taskcluster OSX build yields a large volume of performance regressions. What is more scary is the tests are very noisy now, so they become less useful.
Reporter | ||
Comment 2•8 years ago
|
||
Ted, would you have thoughts on how we could figure out the perf regression by just testing against a taskcluster build vs a buildbot build.
Flags: needinfo?(ted)
Comment 3•8 years ago
|
||
This is unexpected and unfortunate! Here are the obvious things I think we should sanity check first: * Verify that we're using the same compiler/linker flags in both builds * Verify that we're using the same compiler/linker version in both builds * Verify that other mozconfig build options are the same between both builds If those all check out then my next thought would be to take one of those tests that shows a large difference and have someone run a pair of builds through it on their local machine, and ensure they can reproduce the results. If so, they can profile both builds running the test and see what jumps out.
Flags: needinfo?(ted)
Reporter | ||
Comment 4•8 years ago
|
||
one thing to think about...these are cross compiled builds on linux vs building natively on osx- will we be able to realistically compare flags/toolchains?
Comment 5•8 years ago
|
||
Yes. We're using the same compiler (clang) and same binutils (cctools) for both builds.
Comment 6•8 years ago
|
||
Do we need to revisit bug 1314154 in light of this?
Reporter | ||
Comment 7•8 years ago
|
||
possibly, that difference will turn up when we start working on this.
Assignee | ||
Comment 8•8 years ago
|
||
Can we have a native build with disabled startup cache to measure the influence of it?
Flags: needinfo?(jmaher)
Reporter | ||
Comment 9•8 years ago
|
||
great idea, I have no clue how we even messed up the startup cache. Ted, do you know how to do this? I would do a baseline build: try -b o -p macosx64 -u none -t all --rebuild 5 then a startup cache change and repeat: try -b o -p macosx64 -u none -t all --rebuild 5 finally compare both.
Flags: needinfo?(jmaher) → needinfo?(ted)
Comment 10•8 years ago
|
||
We just chatted about this on IRC, but the thing that the TC builds don't do is *pre-generate* the startupcache during packaging, because they can't run the xpcshell that was built. To force a buildbot build to act that way you could just change this conditional to `if False:` in a try build: https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/installer/packager.py#375
Flags: needinfo?(ted)
Comment 11•8 years ago
|
||
:jmaher, what is the special try syntax to run talos via BBB against a Taskcluster OSX build (instead of the native buildbot build/job)?
Flags: needinfo?(jmaher)
Reporter | ||
Comment 12•8 years ago
|
||
I believe you need a special patch, possibly wcosta has prior art to share?
Flags: needinfo?(jmaher) → needinfo?(wcosta)
Assignee | ||
Comment 13•8 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #12) > I believe you need a special patch, possibly wcosta has prior art to share? You need to uncomment macosx64/opt inside test-platforms.yml
Flags: needinfo?(wcosta)
Reporter | ||
Comment 14•8 years ago
|
||
thanks to :wcosta, we have a push that disable startup-cache for buildbot and we have talos data, we can compare against the baseline with it enabled: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&originalRevision=34c6c2f302e7b48e3ad2cec575cbd34d423a9d32&newProject=try&newRevision=1b40ec069ce9226e107a4a3eda505bc9b08db6ac&framework=1&filter=osx&showOnlyImportant=0 there are no changes here- that indicates we are not seeing any talos issues with that.
Comment 15•8 years ago
|
||
That is good to know, thanks wcosta! I think we need to work through my checklist in comment 3.
Assignee | ||
Comment 16•8 years ago
|
||
One more thing that came to my mind. I added a x-build libc++ in bug 1273981, assuming it links statically, maybe it is not an optimal build. One more thing to keep eyes on.
Assignee | ||
Comment 17•8 years ago
|
||
I ran talos tests with gecko profile enabled to help investigating what's going on. buildbot: https://treeherder.mozilla.org/#/jobs?repo=try&revision=845f636356fff1ff1beb6b4539d8826982fecf63&selected buildbot-bridge (aka x-build): https://treeherder.mozilla.org/#/jobs?repo=try&revision=aab1e349b9d4f1f867eb5fe3250ceb39ce884e00&selected
Reporter | ||
Comment 18•8 years ago
|
||
keep in mind this is only an e10s issue, not non e10s, not sure why- so we need to look at the e10s tests (and a note, we can reduce try usage in the future, and focus on specific tests for the time being ( try: -b o -p macosx64 -u none -t chromez-e10s,other-e10s ) There are still issues with the profiler and talos, but the other-e10s job results in a tabpaint.zip profile (ignore all the others, it is invalid). bb: http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/8ad17a929b724b4dbee024d15460cc84d41871c646b095413e4cc8fbc69f8a2a3a155e132ae397c1a56a89c05d16775a531b4b0992c2ccdd91010f49c55bbe3e tc: http://mozilla-releng-blobs.s3.amazonaws.com/blobs/Try/sha512/419d390d79e1bddd271eeb5d5165d15a4077a5d9374abf1cddebc4e19fc24cd0b64338df833f72c027b20e926d172396045d08917c91f55f1d24a4a2c7df45c0 I download these locally, unpack, then load the cycle_0.profile file into https://cleopatra.io/ looking at this, I see that BB is 1100ms faster than TC builds in starting the DOM work on the child process. That seems odd. In the child process, I see: BB: 383 calls to mozilla::ipc::messagepump::run(), 1 call to nsTimer::cancel TC: 440 calls to mozilla::ipc::messagepump::run() odd that we call nsTimer::cancel for BB, maybe this is a side effect of sampled profiling. BB: 1 call in the nsAppShell::Run() to mozilla::dom::power::powermanagerservice::getInstance() most time is spent: BB: mach__msg_trap 93.8% TC: mach__msg_trap 91.8% I am not sure this helps at all, but possible someone else familiar with geckoprofiler can use this data and find interesting bits.
Comment 19•8 years ago
|
||
`mach__msg_trap` is basically the syscall interface to the kernel, it just means the process is waiting on something, which is pretty normal for Firefox (it usually means we're waiting in the event loop). If we're spending slightly less % time there in the TC builds it means we're spending more time using the CPU, so there's something going wrong there. The 1100ms thing is interesting--is it taking longer to start the content process in the TC build? If so that would explain a lot of this.
Assignee | ||
Comment 20•8 years ago
|
||
I put a Platform Engineer hat and started to investigate tsvgr opacity, it looks like there is not a single place that is causing performance loss. The most penalized part seems to be js engine. My guess is that the x-compiler is indeed generating slower code, and js engine suffers more just because it is a top cpu bound component. I will look at tbpaint too.
Reporter | ||
Comment 21•8 years ago
|
||
that might point to compiler options
Comment 22•8 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #21) > that might point to compiler options Agreed. Maybe we should verify Ted's assumption in comment #5?
Assignee | ||
Comment 23•8 years ago
|
||
This attachment contains the list of functions present in native build libXUL not found in x-compiled libXUL. One possibility is that these functions implement high performance operations using processor special instructions, and the build system queries the environment for the availability of such instructions, or just the cross build lacks some definitions to enable them. Just speculating. I just generated the file, will investigate further.
Assignee | ||
Comment 24•8 years ago
|
||
Another info: native build uses clang 3.8.0, x-build uses clang 3.9.0. Yes, the slower version uses the newer compiler...
Assignee | ||
Comment 25•8 years ago
|
||
Ah, I looked at some samples of assembly generated code and it feels like both compiler generate the same file.
Comment 26•8 years ago
|
||
How did you generate the diff in comment 23? There are things in there that seem like the build would be entirely broken if they were actually missing.
Comment 27•8 years ago
|
||
But also, yeah, we should make both builds use the same version of clang or we're not going to have a fair comparison *at all*.
Assignee | ||
Comment 28•8 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #26) > How did you generate the diff in comment 23? There are things in there that > seem like the build would be entirely broken if they were actually missing. # Generate the function list for x-build objdump -disassemble XUL | sed -e '/[^:]$/d' XUL_bbb.s | sed -e '/^$/d' -e 's/://' | sort | uniq > bbb_functions.txt # Generate the function for native build objdump -disassemble XUL | sed -e '/[^:]$/d' XUL_bb.s | sed -e '/^$/d' -e 's/://' | sort | uniq > bb_functions.txt diff bbb_functions.txt bb_functions.txt > functions.diff Does it sound right?
Assignee | ||
Comment 29•8 years ago
|
||
I have been investigating build differences, this is what I found: MOZ_AUTOMATION_L10N_CHECK is 0 on cross and 1 on native build. native uses clang 3.8.0, cross uses 3.9.0 native has the build option --enable-dtrace cross has the linker flag -Wl,-dead_strip while native has -Wl,-no_data_in_code_info These are related to bug 403132
Comment 30•8 years ago
|
||
I'm comparing the mozharness talos output logs for talos runs on native vs tc-build-x builds. 14:26:29 INFO - Using buildbot properties: 14:26:29 INFO - { ... The native build has: 14:26:29 INFO - "pgo_build": "False", But the tc-bbb-build-x doesn't have that flag listed. Probably doesn't make a difference but just noting it just in case...
Reporter | ||
Comment 31•8 years ago
|
||
from irc we were going to run damp in a reduced test locally to see if we can reproduce. just about all the damp subtests regress: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=autoland&originalRevision=9f23fb946723b47b9a1e1b1c82c6c3556e0354c7&newProject=autoland&newRevision=9f402afa3864572af68f27eb3e16295da02841b4&originalSignature=9663a154caf9b713fb951c8db24518c7e0ceddce&newSignature=9663a154caf9b713fb951c8db24518c7e0ceddce&framework=1 so we should be able to narrow it down nicely. to run damp locally, visit: https://hg.mozilla.org/mozilla-central/file/tip/testing/talos/talos/tests/devtools/addon/devtools-signed.xpi (go to raw and it will install) then restart the browser. Finally browser to: chrome://damp/content/damp.html and you can run in the browser. Otherwise, hack up damp manifest to reduce the runtime and push to try :) --------------------- from comment 29, I am curious about: native has the build option --enable-dtrace cross has the linker flag -Wl,-dead_strip while native has -Wl,-no_data_in_code_info
Comment 32•8 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #31) > Otherwise, hack up damp manifest to reduce the runtime and push to try :) Can't we just dry it off? ;)
Comment 33•8 years ago
|
||
In order to try to reduce the regression to a single test case, I downloaded a native buildbot osx10.10 build from a try run, and a build from taskcluster/biuldbotbridge. I ran the talos DAMP test several times locally (with 25 repeats) on my mac OSX 10.11.6. I looked at all the DAMP subtests each time, and I can't seem to reproduce the regression locally consistently. Moving back to TRY.
Comment 34•8 years ago
|
||
Whereas on try, the regression shows up very consistently on talos damp between the native buildbot and the tc bbb build, with multiple runs: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=05006e5df6243f119fce1a72c58b1f2f2514bd4f&newProject=try&newRevision=659b3477addf13b0987e013a2b8f959beabcbe3b&originalSignature=9663a154caf9b713fb951c8db24518c7e0ceddce&newSignature=9663a154caf9b713fb951c8db24518c7e0ceddce&framework=1
Comment 35•8 years ago
|
||
Reducing down to Talos DAMP simple inspector test cases only doesn't reproduce the regression on try: https://treeherder.mozilla.org/#/jobs?repo=try&author=rwood@mozilla.com&fromchange=909b9c6364ccb5a3fa7e61a495d67f9484845654
Assignee | ||
Comment 36•8 years ago
|
||
This is my last run. I disabled the "--enable-dtrace" option, both run with -Wl,dead_strip linker option and downgraded compiler to 3.8.0 in tc
Assignee | ||
Comment 37•8 years ago
|
||
And forgot to paste results... https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=28d1feb7a4915ef6a025ed23e6880fccd4cc3ebc&newProject=try&newRevision=e1ca5d69f837cd76833f0c0e6ce4550824a158d6&originalSignature=9663a154caf9b713fb951c8db24518c7e0ceddce&newSignature=9663a154caf9b713fb951c8db24518c7e0ceddce&framework=1
Updated•8 years ago
|
Summary: 11.5 - 33.57% * (osx-10-10) regression on push 9f402afa3864572af68f27eb3e16295da02841b4 (Fri Feb 10 2017) → taskcluster cross-compiled OS X builds show Talos performance regressions vs. buildbot builds
Comment 38•8 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #29) > native uses clang 3.8.0, cross uses 3.9.0 Before we dive into anything else we should get both of these builds on the same exact version of clang and compare *that*. It's entirely possible that we're chasing a regression that's completely due to using a newer version of clang. If that's so, we can just downgrade the taskcluster builds to the version the buildbot builds are using and let someone on platform sort out the performance issues with clang 3.9.
Assignee | ||
Comment 39•8 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #38) > (In reply to Wander Lairson Costa [:wcosta] from comment #29) > > native uses clang 3.8.0, cross uses 3.9.0 > > Before we dive into anything else we should get both of these builds on the > same exact version of clang and compare *that*. It's entirely possible that > we're chasing a regression that's completely due to using a newer version of > clang. If that's so, we can just downgrade the taskcluster builds to the > version the buildbot builds are using and let someone on platform sort out > the performance issues with clang 3.9. In comment 37 I have results with clang 3.8
Reporter | ||
Comment 40•8 years ago
|
||
ok, so that isn't helping us then :(
Comment 41•8 years ago
|
||
Narrowing down the Talos DAMP test to a single test case (inspectorOpen, openToolboxAndLog only) does reproduce the regression. Try push 1 (native build): https://treeherder.mozilla.org/#/jobs?repo=try&revision=befe8a08b8e8e0bfa3597336512b6df2e10279c2 Try push 2 (tc x-build): https://treeherder.mozilla.org/#/jobs?repo=try&revision=24653d00863a640273afc2ab7c6474b86a756e11 Resulting Perfherder comparison (shows a 22% perf regression): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=befe8a08b8e8e0bfa3597336512b6df2e10279c2&newProject=try&newRevision=24653d00863a640273afc2ab7c6474b86a756e11&framework=1&showOnlyImportant=0
Reporter | ||
Comment 42•8 years ago
|
||
for the raw perfherder data, native: 1900.75, 1290.795, 1887.9100000000035, 1290.505000000001, 1146.380000000001, 1129.3899999999994, 1249.8050000000003, 1306.6449999999968, 1249.010000000002, 1179.9249999999956, 1124.1549999999988, 1295.449999999997, 1195.9100000000035, 1301.760000000002, 1269.8700000000026, 1195.2900000000009, 1448.510000000002, 1312.195000000007, 1306.125, 1322.9550000000017, 1132.9349999999977, 1461.8300000000017, 1141.2600000000093, 1316.804999999993, 1145.7550000000047 x-compile: [2151.83, 1287.2200000000012, 1348.9799999999996, 1486.625, 1320.5149999999994, 1456.5900000000001, 1255.3250000000007, 1573.0049999999974, 1497.7099999999991, 1399.1199999999953, 1381.4800000000032, 1421.4649999999965, 1460.8299999999945, 1373.7699999999968, 1699.6200000000026, 1526.064999999995, 1405.7249999999985, 1386.4249999999884, 1362.8650000000052, 1524.2599999999948, 1490.074999999997, 1597.7350000000006, 1526.175000000003, 1500.3399999999965, 1472.3600000000006] Somehow x-compile continues to be 200+ms slower than native for raw data.
Assignee | ||
Comment 43•8 years ago
|
||
Investigating the logs, the tc build has a much higher occurrence rate of the following log message: 1251 INFO - PID 1965 | ###!!! [Child][MessageChannel] Error: (msgtype=0x8E0025,name=PNecko::Msg_RemoveRequestContext) |Channel closing: too late to send/recv, messages will be lost And this is causing delays in g2 test.
Assignee | ||
Comment 44•8 years ago
|
||
:dvander could you help figure out what's happening here? (I randomly picked you from git blame MessageChannel.cpp :) )
Flags: needinfo?(dvander)
This usually means one endpoint of an IPC channel has sent a "shutdown" message to the other side. It is processed on the I/O thread, and then the main thread, and in between there is a window where sending messages will yield this error. It's fairly harmless, and appears on actors that don't hand-shake their shutdown order.
Flags: needinfo?(dvander)
Comment 46•8 years ago
|
||
(In reply to Robert Wood [:rwood] from comment #41) > Narrowing down the Talos DAMP test to a single test case (inspectorOpen, > openToolboxAndLog only) does reproduce the regression. Thanks, that's useful for digging in! (In reply to Wander Lairson Costa [:wcosta] from comment #43) > Investigating the logs, the tc build has a much higher occurrence rate of > the following log message: > > 1251 INFO - PID 1965 | ###!!! [Child][MessageChannel] Error: > (msgtype=0x8E0025,name=PNecko::Msg_RemoveRequestContext) > |Channel closing: too late to send/recv, messages will be lost > > And this is causing delays in g2 test. This is also useful info! It sounds like maybe something is not working properly at the IPC layer in the taskcluster builds. Looking at the DAMP tests (which I only just now learned are devtools tests), the code that gets run during `openToolbox` does seem to do a lot of IPC messaging in the e10s case, so if IPC is somehow slower or broken I could absolutely believe that it would show up as a performance regression in these tests. I'll try to dig around in this area and see what I can find.
Assignee | ||
Comment 47•8 years ago
|
||
(In reply to David Anderson [:dvander] from comment #45) > This usually means one endpoint of an IPC channel has sent a "shutdown" > message to the other side. It is processed on the I/O thread, and then the > main thread, and in between there is a window where sending messages will > yield this error. > > It's fairly harmless, and appears on actors that don't hand-shake their > shutdown order. More recent update, I see those message in both logs, but in BB, only one message shows up, which a lot shows up in tc build. My guess is that for some reason tc built binary has a higher ipc latency than bb.
Assignee | ||
Comment 48•8 years ago
|
||
I just update m-c branch, add rwood patch to narrow the problem and the error message is gone, although the performance regression still holds.
Reporter | ||
Comment 49•8 years ago
|
||
Given that this is an e10s only regression and we see a lot of ipc related messages, maybe we should look at OSX specific IPC code as a starting point? I wouldn't know how to find out if there is osx specific code- it would be good to know either way.
Comment 50•8 years ago
|
||
There's definitely OSX-specific IPC code, like: https://dxr.mozilla.org/mozilla-central/rev/ff04d410e74b69acfab17ef7e73e7397602d5a68/ipc/glue/moz.build#107 I don't think anything is obviously *broken* with it, though, or the builds wouldn't work. We've got green debug tests, so it's clearly working. It's possible there's just some difference here that's making it *slower*. I ran it by froydnj on IRC and he suggested that it could be something odd like using a different version of libc++ which could cause a perf difference in low-level code like IPC.
Assignee | ||
Comment 51•8 years ago
|
||
Assignee: nobody → wcosta
Assignee | ||
Updated•8 years ago
|
Attachment #8849596 -
Attachment is obsolete: true
Assignee | ||
Comment 52•8 years ago
|
||
Ok, this is what I did so far: * Remove enable-dtrace flag in BB since it is not enabled on cross. * Disable startup-cache in BB as well, as cross can't generate it. These two items were just to narrow the gap between BB and cross build flags. * Added -Wl,-no_data_in_code_info to cross build. * Target cross builds to darwin 11.2.0 to match BB target. * Use yasm 1.3.0 on cross build, same version as in BB. * Removed AC_FUNC_MEMCMP from configure, since it can't be determined at runtime in cross build and docs say it is obsolete [1] * Set ac_cv_cpp_dynamic_cast_void_ptr to 'yes' in cross build. * Downgrade clang to 3.8.0 in cross build (same version as BB). None of these had any impact on performance, reminding that the performance regression only happens on e10s. Looking at assembly code, I see the compiler generates different binary code between cross and BB builds. For example, this function in libXUL has different generated code, although the compiler command line, except by cross build flags, are identical to both BB and cross builds. cross: __ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE pushq %rbp movq %rsp, %rbp pushq %r15 pushq %r14 pushq %r12 pushq %rbx subq $32, %rsp movq %rcx, %r12 movq 104(%r12), %rax addq $7, %rax cmpq $8191, %rax jbe 8 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+2B> xorl %r14d, %r14d jmp 345 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+184> movzwl %di, %ecx xorl %r14d, %r14d cmpl $15, %ecx ja 330 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+184> testb %sil, %sil js 321 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+184> movb %dl, %bl decb %bl movzbl %bl, %edi cmpl $3, %edi ja 305 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+184> movw $61951, -48(%rbp) shlb $6, %dl movzbl %dl, %edx shlb $2, %sil movzbl %sil, %esi movl %ecx, %edi shrl $2, %edi addl %esi, %edi leal 192(%rdx,%rdi), %edx movb %dl, -46(%rbp) shll $6, %ecx movl %eax, %edx shrl $11, %edx addl %ecx, %edx movb %dl, -45(%rbp) movq %rax, %rcx shrq $3, %rcx movb %cl, -44(%rbp) shlq $5, %rax orl $31, %eax movb %al, -43(%rbp) movb $-4, -42(%rbp) movq (%r12), %rax movq %r12, %rdi callq *24(%rax) movq %rax, -64(%rbp) movq %rax, -40(%rbp) movq %rax, -56(%rbp) movq -56(%rbp), %rdi leaq -48(%rbp), %rsi movl $7, %edx callq 26175531 <__ZN7mozilla18MediaRawDataWriter7PrependEPKhm> testb %al, %al je 56 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+101> movq 64(%r12), %rax movb $1, %r14b cmpb $0, (%rax) je 156 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+176> movq 24(%rax), %rax movq -56(%rbp), %rcx movq (%rcx), %r15 addq $24, %r15 cmpl $0, (%rax) je 24 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+106> movq (%r15), %rax cmpl $0, (%rax) je 187 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+1B5> addw $7, 8(%rax) jmp 117 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+176> xorl %r14d, %r14d jmp 112 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+176> movq (%r15), %rax movl (%rax), %esi incq %rsi movl $2, %edx movq %r15, %rdi callq 126149 <__ZN13nsTArray_baseI27nsTArrayInfallibleAllocator25nsTArray_CopyWithMemutilsE14EnsureCapacityIS0_EENT_15ResultTypeProxyEmm> movq (%r15), %rax movl (%rax), %ecx movw $7, 8(%rax,%rcx,2) movq (%r15), %rax cmpq 75996351(%rip), %rax je 97 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+194> incl (%rax) movq -56(%rbp), %rax movq (%rax), %rbx leaq 32(%rbx), %rdi movl 104(%r12), %r15d addl $-7, %r15d movq 32(%rbx), %rax movl (%rax), %esi incq %rsi movl $4, %edx callq 126084 <__ZN13nsTArray_baseI27nsTArrayInfallibleAllocator25nsTArray_CopyWithMemutilsE14EnsureCapacityIS0_EENT_15ResultTypeProxyEmm> movq 32(%rbx), %rax movl (%rax), %ecx movl %r15d, 8(%rax,%rcx,4) movq 32(%rbx), %rax cmpq 75996286(%rip), %rax je 32 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+194> incl (%rax) movq -56(%rbp), %rdi testq %rdi, %rdi je 5 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+184> callq 61442842 movb %r14b, %al addq $32, %rsp popq %rbx popq %r12 popq %r14 popq %r15 popq %rbp retq leaq 64693937(%rip), %rax movq 75999454(%rip), %rcx movq %rax, (%rcx) movl $451, 0 callq 61438023 xorl %edi, %edi xorl %esi, %esi callq 549516 <__Z23InvalidArrayIndex_CRASHmm> nop bb: __ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE pushq %rbp movq %rsp, %rbp pushq %r15 pushq %r14 pushq %rbx pushq %rax movq %rcx, %r15 movq 104(%r15), %rax addq $7, %rax cmpq $8191, %rax jbe 7 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+24> xorl %ebx, %ebx jmp 318 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+162> movzwl %di, %edi xorl %ebx, %ebx cmpl $15, %edi ja 304 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+162> testb %sil, %sil js 295 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+162> movb %dl, %cl decb %cl movzbl %cl, %ecx cmpl $3, %ecx ja 279 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+162> movw $61951, -32(%rbp) shlb $6, %dl movzbl %dl, %ecx shlb $2, %sil movzbl %sil, %edx movl %edi, %esi shrl $2, %esi addl %edx, %esi leal 192(%rcx,%rsi), %ecx movb %cl, -30(%rbp) shll $6, %edi movl %eax, %ecx shrl $11, %ecx addl %edi, %ecx movb %cl, -29(%rbp) movq %rax, %rcx shrq $3, %rcx movb %cl, -28(%rbp) shlq $5, %rax orl $31, %eax movb %al, -27(%rbp) movb $-4, -26(%rbp) movq (%r15), %rax movq %r15, %rdi callq *24(%rax) movq %rax, %r14 leaq -32(%rbp), %rsi movl $7, %edx movq %r14, %rdi callq 25502366 <__ZN7mozilla18MediaRawDataWriter7PrependEPKhm> testb %al, %al je 46 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+E4> movq 64(%r15), %rax movb $1, %bl cmpb $0, (%rax) je 37 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+E6> movq 24(%rax), %rax movq (%r14), %rbx addq $24, %rbx cmpl $0, (%rax) je 28 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+ED> movq (%rbx), %rax cmpl $0, (%rax) je 179 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+190> addw $7, 8(%rax) jmp 116 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+158> xorl %ebx, %ebx testq %r14, %r14 jne 111 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+15A> jmp 117 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+162> movq (%rbx), %rax movl (%rax), %esi incq %rsi movl $2, %edx movq %rbx, %rdi callq 125662 <__ZN13nsTArray_baseI27nsTArrayInfallibleAllocator25nsTArray_CopyWithMemutilsE14EnsureCapacityIS0_EENT_15ResultTypeProxyEmm> movq (%rbx), %rax movl (%rax), %ecx movw $7, 8(%rax,%rcx,2) movq (%rbx), %rax cmpq 73332912(%rip), %rax je 85 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+16F> incl (%rax) movq (%r14), %rbx leaq 32(%rbx), %rdi movl 104(%r15), %r15d addl $-7, %r15d movq 32(%rbx), %rax movl (%rax), %esi incq %rsi movl $4, %edx callq 125602 <__ZN13nsTArray_baseI27nsTArrayInfallibleAllocator25nsTArray_CopyWithMemutilsE14EnsureCapacityIS0_EENT_15ResultTypeProxyEmm> movq 32(%rbx), %rax movl (%rax), %ecx movl %r15d, 8(%rax,%rcx,4) movq 32(%rbx), %rax cmpq 73332852(%rip), %rax je 25 <__ZN11mp4_demuxer4Adts13ConvertSampleEtaaPN7mozilla12MediaRawDataE+16F> incl (%rax) movb $1, %bl movq %r14, %rdi callq 59100812 movb %bl, %al addq $8, %rsp popq %rbx popq %r14 popq %r15 popq %rbp retq leaq 62152870(%rip), %rax movq 73336107(%rip), %rcx movq %rax, (%rcx) movl $451, 0 callq 59096014 xorl %edi, %edi xorl %esi, %esi callq 543777 <__Z23InvalidArrayIndex_CRASHmm> nopl (%rax) If you want to look by yourself, download the dmg image for both BB and taskcluster to a local directory and run this script [2] there. My patches can be found here [3] (not all patches are fixes, some are just for tests). NI people that I think might have some ideas. [1] https://www.gnu.org/software/autoconf/manual/autoconf-2.60/html_node/Particular-Functions.html [2] https://gist.github.com/walac/11b9579378ee83f1eedd40b8d402768c [3] https://github.com/walac/gecko-dev/commits/mac-talos
Flags: needinfo?(nfroyd)
Flags: needinfo?(jmathies)
Comment 53•8 years ago
|
||
Total shot in the dark but considering you pasted a piece of the mp4_demuxer code, I wonder if this is related to rust using a different linker in cross builds vs BB builds? I don't know why it would only affect e10s though. There are details on the linker difference in bug 1342503 comment 8 and in bug 1329737.
Comment 54•8 years ago
|
||
There are some minor register allocation differences, but those shouldn't be significant enough to affect talos results. I do see obvious cases where the cross code is much worse: bb: callq *24(%rax) movq %rax, %r14 leaq -32(%rbp), %rsi movl $7, %edx movq %r14, %rdi cross: callq *24(%rax) movq %rax, -64(%rbp) movq %rax, -40(%rbp) movq %rax, -56(%rbp) movq -56(%rbp), %rdi leaq -48(%rbp), %rsi movl $7, %edx What are all those movq's doing in there?! That extra memory traffic can't be helpful for performance. My next suggestion would be to try and compare the preprocessed source files wherever possible; the file containing the above code: http://dxr.mozilla.org/mozilla-central/source/media/libstagefright/binding/Adts.cpp might be a decent place to start. Then you can carefully determine if differences in the preprocessed code (ruling out things like #line directives that are going to obviously differ due to path differences) are affecting the code(s) in question. That would confirm or rule out hypotheses like comment 50. OTOH, it's possible that clang/LLVM just generate different code in native vs. cross configurations due to Reasons. Obviously you don't want your compiler to do that, but it's not easy to test for...
Comment 55•8 years ago
|
||
Oh, and possibly easier than staring at preprocessed code...given that all of the problems are e10s related, maybe there's something about the binary layout of our libraries/executables that the OS X loader doesn't like, or something? Should check whether Mach-O headers, etc. etc. look reasonable when comparing the two builds.
Flags: needinfo?(nfroyd)
Comment 56•8 years ago
|
||
Or given that e10s is a problem, perhaps trawling through preprocessed sources for code under ipc/ would yield some interesting results.
Assignee | ||
Comment 57•8 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #53) > Total shot in the dark but considering you pasted a piece of the mp4_demuxer > code, I wonder if this is related to rust using a different linker in cross > builds vs BB builds? I don't know why it would only affect e10s though. > There are details on the linker difference in bug 1342503 comment 8 and in > bug 1329737. Is there an easy way to verify that in a loaner?
Flags: needinfo?(bugmail)
Assignee | ||
Comment 58•8 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #56) > Or given that e10s is a problem, perhaps trawling through preprocessed > sources for code under ipc/ would yield some interesting results. Look at preprocessed files under libstagefright and ipc, they are equal. About Mach-O, I have given a look, there are some different values in some fields, but nothing I can tell would lead to a regression. However, I am not a Mach-O expert.
Comment 59•8 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #57) > Is there an easy way to verify that in a loaner? On the buildbot loaner you can force rust/cargo to use a non-system-linker using the workaround in bug 1342503 comment 12. However you might need to do something similar on the TC loaner to make *it* also use a known linker, I'm not sure. I think that on both builds it will try to use the system linker, and the buildbot slave just has an older system linker.
Flags: needinfo?(bugmail)
Comment 60•8 years ago
|
||
That shouldn't matter--if it was using the wrong linker the build would just fail because the system linker is an ELF linker and it needs a Mach-O linker. The code in libxul gets linked by the Mozilla build system, not cargo.
Comment 61•8 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #46) > (In reply to Robert Wood [:rwood] from comment #41) > > Narrowing down the Talos DAMP test to a single test case (inspectorOpen, > > openToolboxAndLog only) does reproduce the regression. > > Thanks, that's useful for digging in! > > (In reply to Wander Lairson Costa [:wcosta] from comment #43) > > Investigating the logs, the tc build has a much higher occurrence rate of > > the following log message: > > > > 1251 INFO - PID 1965 | ###!!! [Child][MessageChannel] Error: > > (msgtype=0x8E0025,name=PNecko::Msg_RemoveRequestContext) > > |Channel closing: too late to send/recv, messages will be lost > > > > And this is causing delays in g2 test. > > This is also useful info! It sounds like maybe something is not working > properly at the IPC layer in the taskcluster builds. > > Looking at the DAMP tests (which I only just now learned are devtools > tests), the code that gets run during `openToolbox` does seem to do a lot of > IPC messaging in the e10s case, so if IPC is somehow slower or broken I > could absolutely believe that it would show up as a performance regression > in these tests. I'll try to dig around in this area and see what I can find. I hate to say it but differences in performance can trigger these. For example you tend to experience more IPC actor tear down errors in debug builds due to the changes in execution timing of communicating processes.
Flags: needinfo?(jmathies)
Comment 62•8 years ago
|
||
Given that all evidence here points to IPC, let's move this to a better component. Has anyone tried running make *.i on some of the source files in the IPC code in a cross build vs a normal build to make sure that we're actually compiling the same code in both configurations? I would probably start my investigations there...
Component: Untriaged → IPC
Product: Firefox → Core
Assignee | ||
Comment 63•8 years ago
|
||
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #62) > Given that all evidence here points to IPC, let's move this to a better > component. > > Has anyone tried running make *.i on some of the source files in the IPC > code in a cross build vs a normal build to make sure that we're actually > compiling the same code in both configurations? I would probably start my > investigations there... If you mean compared the generated source code, yes, I did, they are pretty much the same
Assignee | ||
Comment 64•8 years ago
|
||
As requested by :catlee, here are the steps I followed to compare pre-compiled source code: 1. Create a special build that uploads obj-dir as an artifact, both for TC and BB builds. 2. Download and unpack both obj-dir's. 3. Remove all object/binary files. 4. Strip file paths specific to the build machine. 5. Perform a recursive diff between both trees. No practical differences found, apart for some newlines.
Comment 65•8 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #64) > No practical differences found, apart for some newlines. There should be *no* differences at all in the source that we compiled (sans whitespace that is). Does diff -w show anything?
Assignee | ||
Comment 66•8 years ago
|
||
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #65) > (In reply to Wander Lairson Costa [:wcosta] from comment #64) > > No practical differences found, apart for some newlines. > > There should be *no* differences at all in the source that we compiled (sans > whitespace that is). Does diff -w show anything? I didn't try that, only manual check.
Assignee | ||
Comment 67•8 years ago
|
||
This is the diff of obj-dir/ipc directory for tc and bb builds.
Assignee | ||
Comment 68•8 years ago
|
||
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #65) > (In reply to Wander Lairson Costa [:wcosta] from comment #64) > > No practical differences found, apart for some newlines. > > There should be *no* differences at all in the source that we compiled (sans > whitespace that is). Does diff -w show anything? I attached the diff file for ipc subdir. The object directories themselves are quite big, but I can find a way to upload them if you want.
Flags: needinfo?(ehsan)
Assignee | ||
Comment 70•8 years ago
|
||
I assume that by "unified diff" you mean the whole obj-dir tree. Please find it attached.
Comment 71•8 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #70) > Created attachment 8862822 [details] > obj.diff.gz > > I assume that by "unified diff" you mean the whole obj-dir tree. Please find > it attached. By "unified diff", he means including the -u option when you run |diff|, which results in much more readable diffs.
Updated•8 years ago
|
Flags: needinfo?(wcosta)
Assignee | ||
Updated•8 years ago
|
Attachment #8862822 -
Attachment is obsolete: true
Flags: needinfo?(wcosta)
Assignee | ||
Updated•8 years ago
|
Attachment #8862474 -
Attachment is obsolete: true
Assignee | ||
Comment 72•8 years ago
|
||
Assignee | ||
Comment 73•8 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #71) > (In reply to Wander Lairson Costa [:wcosta] from comment #70) > > Created attachment 8862822 [details] > > obj.diff.gz > > > > I assume that by "unified diff" you mean the whole obj-dir tree. Please find > > it attached. > > By "unified diff", he means including the -u option when you run |diff|, > which results in much more readable diffs. TIL unified diff
Comment 74•8 years ago
|
||
This diff isn't actually comparing the preprocessed sources, which is what comment 62 is asking for. It's just comparing any .cpp files that might be generated as part of the build process and the dependency files generated by the compiler. We already know the former shouldn't differ, and the latter do indeed have trivial differences. Please try building with -save-temps in the ipc/ directory and all of its subdirectories (this will generate a lot of files), and diffing the .i files generated. Don't bother with diffing the .s files; we already know they differ.
Flags: needinfo?(wcosta)
Comment 75•8 years ago
|
||
Note that you may have to modify parts of paths in the resulting .i files in e.g. #line directives so you don't get *massive* amounts of differences, depending on how the build machines are set up. That's OK, so long as you're systematic about it and you explain how you got to the results you did. If different #line directives are including different files entirely (e.g. libstdc++ vs. libc++ or something like that), those sorts of differences should *not* be modified away.
Comment 76•7 years ago
|
||
Wander: :mshal is going to try to help you out with this next week. I know it's been a while since you had these builds setup for diffing, but if you can try to resurrect that env for next week, it will help you two make quicker progress.
Assignee | ||
Comment 77•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #74) > This diff isn't actually comparing the preprocessed sources, which is what > comment 62 is asking for. It's just comparing any .cpp files that might be > generated as part of the build process and the dependency files generated by > the compiler. We already know the former shouldn't differ, and the latter > do indeed have trivial differences. > > Please try building with -save-temps in the ipc/ directory and all of its > subdirectories (this will generate a lot of files), and diffing the .i files > generated. Don't bother with diffing the .s files; we already know they > differ. Sorry for the extremely long delay. When I try to build to generate intermediate files, I get errors both on BB and TC. I am going to look into, but if anybody has an idea what's going on, would be much appreciated. https://treeherder.mozilla.org/#/jobs?repo=try&revision=b30bea36b764&selectedJob=102903839
Flags: needinfo?(wcosta)
Comment 78•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #77) > Sorry for the extremely long delay. When I try to build to generate > intermediate files, I get errors both on BB and TC. I am going to look into, > but if anybody has an idea what's going on, would be much appreciated. > > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=b30bea36b764&selectedJob=102903839 Can't say I understand why, but building with '-save-temps' seems to flag additional warnings that aren't present without the flag. I can reproduce an example locally: $ clang ... Unified_cpp_test_testplugin0.cpp (no warnings) $ clang ... -save-temps Unified_cpp_test_testplugin0.cpp In file included from /Users/marf/mozilla-central-git/obj-x86_64-apple-darwin15.6.0/dom/plugins/test/testplugin/Unified_cpp_test_testplugin0.cpp:2: /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest.cpp:2305:31: warning: equality comparison with extraneous parentheses [-Wparentheses-equality] if (((objectVariant).type == NPVariantType_Object)) { ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~ /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest.cpp:2305:31: note: remove extraneous parentheses around the comparison to silence this warning if (((objectVariant).type == NPVariantType_Object)) { ~ ^ ~ /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest.cpp:2305:31: note: use '=' to turn this equality comparison into an assignment if (((objectVariant).type == NPVariantType_Object)) { ^~ = /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest.cpp:2963:26: warning: equality comparison with extraneous parentheses [-Wparentheses-equality] if (((result).type == NPVariantType_String)) { ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~ /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest.cpp:2963:26: note: remove extraneous parentheses around the comparison to silence this warning if (((result).type == NPVariantType_String)) { ~ ^ ~ /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest.cpp:2963:26: note: use '=' to turn this equality comparison into an assignment if (((result).type == NPVariantType_String)) { ^~ = In file included from /Users/marf/mozilla-central-git/obj-x86_64-apple-darwin15.6.0/dom/plugins/test/testplugin/Unified_cpp_test_testplugin0.cpp:20: /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest_utils.cpp:50:23: warning: equality comparison with extraneous parentheses [-Wparentheses-equality] if (((variant).type == NPVariantType_String)) ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~ /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest_utils.cpp:50:23: note: remove extraneous parentheses around the comparison to silence this warning if (((variant).type == NPVariantType_String)) ~ ^ ~ /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest_utils.cpp:50:23: note: use '=' to turn this equality comparison into an assignment if (((variant).type == NPVariantType_String)) ^~ = /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest_utils.cpp:52:28: warning: equality comparison with extraneous parentheses [-Wparentheses-equality] else if (((variant).type == NPVariantType_Int32)) ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest_utils.cpp:52:28: note: remove extraneous parentheses around the comparison to silence this warning else if (((variant).type == NPVariantType_Int32)) ~ ^ ~ /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest_utils.cpp:52:28: note: use '=' to turn this equality comparison into an assignment else if (((variant).type == NPVariantType_Int32)) ^~ = /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest_utils.cpp:54:28: warning: equality comparison with extraneous parentheses [-Wparentheses-equality] else if (((variant).type == NPVariantType_Double)) ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~ /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest_utils.cpp:54:28: note: remove extraneous parentheses around the comparison to silence this warning else if (((variant).type == NPVariantType_Double)) ~ ^ ~ /Users/marf/mozilla-central-git/dom/plugins/test/testplugin/nptest_utils.cpp:54:28: note: use '=' to turn this equality comparison into an assignment else if (((variant).type == NPVariantType_Double)) ^~ = 5 warnings generated. Since we build with -Werror in automation, these warnings become errors and the build fails. I'd guess the best thing to try is to disable Werror in your try push and see if that helps. Though due to the difference in behavior with and without -save-temps, I don't have the best confidence in the output.
Comment 79•7 years ago
|
||
You could also try to use our built-in .i rules instead of -save-temps with a patch: diff --git a/config/rules.mk b/config/rules.mk index 2551b07..c32580e 100644 --- a/config/rules.mk +++ b/config/rules.mk @@ -1088,7 +1088,6 @@ $(filter %.s,$(CSRCS:%.c=%.s)): %.s: %.c $(call mkdir_deps,$(MDDEPDIR)) $(REPORT_BUILD_VERBOSE) $(CC) -S $(COMPILE_CFLAGS) $($(notdir $<)_FLAGS) $(_VPATH_SRCS) -ifneq (,$(filter %.i,$(MAKECMDGOALS))) # Call as $(call _group_srcs,extension,$(SRCS)) - this will create a list # of the full sources, as well as the $(notdir) version. So: # foo.cpp sub/bar.cpp @@ -1115,6 +1114,7 @@ $$(_PREPROCESSED_$1_FILES): %.i: %.$1 $$(addprefix $$(MKDIR) -p ,$$(filter-out .,$$(@D))) $$($3) -C $$(PREPROCESS_OPTION)$$@ $(foreach var,$4,$$($(var))) $$($$(notdir $$<)_FLAGS) $$(_VPATH_SRCS) +target:: $$(_PREPROCESSED_$1_FILES) endef $(eval $(call PREPROCESS_RULES,cpp,CPPSRCS,CCC,COMPILE_CXXFLAGS)) @@ -1173,8 +1173,6 @@ endif endif -endif - $(RESFILE): %.res: %.rc $(REPORT_BUILD) @echo Creating Resource file: $@ Then you should get all the .i files and can skip -save-temps (which seems to generate a lot of extra data - some of the .s files are huge). You would still need to turn off -Werror, though.
Assignee | ||
Comment 80•7 years ago
|
||
Disabling -Werror and applying :mshal's patch (yay, thanks), it just improved a lot, but I now reached another level of weird issues [1], specially on BB. [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=c53a72985b41&selectedJob=103327939
Comment 81•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #80) > Disabling -Werror and applying :mshal's patch (yay, thanks), it just > improved a lot, but I now reached another level of weird issues [1], > specially on BB. > > [1] > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=c53a72985b41&selectedJob=103327939 The TC one looks like a clang plugin issue; you may want to disable the static analysis on mac jobs for the purposes of this comparison: http://dxr.mozilla.org/mozilla-central/source/build/macosx/cross-mozconfig.common#48 http://dxr.mozilla.org/mozilla-central/source/browser/config/mozconfigs/macosx64/opt-static-analysis#11 http://dxr.mozilla.org/mozilla-central/source/browser/config/mozconfigs/macosx64/debug-static-analysis#10 The BB one is a weird error; maybe we're using our own ranlib/ar for C++ compilation, but we're not passing that through to Rust builds, and we should be?
Comment 82•7 years ago
|
||
Could just disable geckodriver for the purposes of this comparison, since we're not running any tests with these builds.
Assignee | ||
Comment 83•7 years ago
|
||
Latest try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=29e31e6f30ed&selectedJob=103483632 At least we now have the same error for both BB and TC. I suspect the problem arises from file unification.
Comment 84•7 years ago
|
||
(In reply to Michael Shal [:mshal] from comment #78) > Can't say I understand why, but building with '-save-temps' seems to flag > additional warnings that aren't present without the flag. I can reproduce an > example locally: The temp files have been processed by the C preprocessor so C macros have been expanded and clang is reporting warnings in code that it would have ignored in the original source files. If these new warnings only affect Unified_cpp_test_testplugin0.cpp, we can just add `if CONFIG['CLANG_CXX']: CXXFLAGS += ['-Wno-error=parentheses-equality']` to the bottom of dom/plugins/test/testplugin/testplugin.mozbuild.
Blocks: stylo-nightly-build
Comment 85•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #83) > Latest try push: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=29e31e6f30ed&selectedJob=103483632 > > At least we now have the same error for both BB and TC. I suspect the > problem arises from file unification. Building without unification is going to pose its own set of problems, unfortunately. If you try the same push with sccache disabled does it change anything?
Comment 86•7 years ago
|
||
(In reply to Chris Peterson [:cpeterson] from comment #84) > (In reply to Michael Shal [:mshal] from comment #78) > > Can't say I understand why, but building with '-save-temps' seems to flag > > additional warnings that aren't present without the flag. I can reproduce an > > example locally: > > The temp files have been processed by the C preprocessor so C macros have > been expanded and clang is reporting warnings in code that it would have > ignored in the original source files. Hmm, I don't follow - all I'd expect -save-temps to do is to save the output of the C preprocessor and compiler that it has to run anyway instead of discarding them as intermediate files. Why would saving the output of the preprocessor change the behavior? The compiler should see the same preprocessed code regardless of whether it is also copied to a .i file. > If these new warnings only affect Unified_cpp_test_testplugin0.cpp, we can > just add `if CONFIG['CLANG_CXX']: CXXFLAGS += > ['-Wno-error=parentheses-equality']` to the bottom of > dom/plugins/test/testplugin/testplugin.mozbuild. Good idea, but unfortunately this isn't the only file. Using a separate rule for the .i files seems to work around the issue, though I'm not certain yet if that's generating .i's for each corresponding .o in all cases.
Comment 87•7 years ago
|
||
(In reply to Michael Shal [:mshal] from comment #86) > (In reply to Chris Peterson [:cpeterson] from comment #84) > > The temp files have been processed by the C preprocessor so C macros have > > been expanded and clang is reporting warnings in code that it would have > > ignored in the original source files. > > Hmm, I don't follow - all I'd expect -save-temps to do is to save the output > of the C preprocessor and compiler that it has to run anyway instead of > discarding them as intermediate files. Why would saving the output of the > preprocessor change the behavior? The compiler should see the same > preprocessed code regardless of whether it is also copied to a .i file. Sadly it doesn't work that way in clang: http://lists.llvm.org/pipermail/cfe-dev/2016-November/051421.html "-svae-temps will run the parser on the preprocessed output, so all macro-expansion knowledge is missing. That changes the impact of a number of warnings like redundant braces." We've hit an almost identical issue with -Werror with clang+sccache. sccache runs the preprocessor to hash the output as part of the cache key, and then attempts to compile the already-preprocessed output. Unfortunately that fails sometimes with -Werror for the same reason.
Assignee | ||
Comment 88•7 years ago
|
||
:mshal got a build with intermediate files [1] After path and line numbers cleanup, there was no difference at all: diff -u -r tc-objdir/browser/app/nsBrowserApp.i bb-objdir/browser/app/nsBrowserApp.i --- tc-objdir/browser/app/nsBrowserApp.i 2017-06-02 12:56:28.000000000 -0300 +++ bb-objdir/browser/app/nsBrowserApp.i 2017-06-02 12:56:06.000000000 -0300 @@ -73505,13 +73505,13 @@ "Firefox", "firefox", "55.0a1", - "20170602030417", + "20170601200700", "{ec8030f7-c20a-464f-9b0e-13a3a9e97384}", __null, // copyright (1 << 3) | (1 << 1), "55.0a1", "55.0a1", - "https://crash-reports.mozilla.com/submit?id={ec8030f7-c20a-464f-9b0e-13a3a9e97384}&version=55.0a1&buildid=20170602030417", + "https://crash-reports.mozilla.com/submit?id={ec8030f7-c20a-464f-9b0e-13a3a9e97384}&version=55.0a1&buildid=20170601200700", __null }; # "/src/browser/app/nsBrowserApp.cpp" 2 Only in tc-objdir/memory/replace: dummy Binary files tc-objdir/objdir.tar.bz2 and bb-objdir/objdir.tar.bz2 differ [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=5a550aef3e6b07f8c47420cdad8999455590d7d3&selectedJob=103938641
Assignee | ||
Comment 89•7 years ago
|
||
Hi there, I just ran performance comparison against latest m-c code (with my patches that make builds identical) [1]. Cross builds still behind native builds. [1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=389d3866a6c26b7613ee0d04eb710209cab5e5f9&newProject=try&newRevision=1636956c80ce26b5757a2b8260f7f5523c1aa550&framework=1&showOnlyImportant=0
If there's no diff between the *.i files -- is there a diff between the *.s files?
(That would likely be the case only if the compilers being used weren't actually the same -- I don't know what the chance of that is -- although it certainly wouldn't surprise me if Apple patches clang in various ways...)
Comment 92•7 years ago
|
||
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #90) > If there's no diff between the *.i files -- is there a diff between the *.s > files? We have diffed the disassembly of XUL.dylib, and there are no significant changes. There are differences in offsets (e.g. jumps) in some places, but those seem pretty trivial. The cross-compiled compiler seems to annotate jump table offsets in the .text section (so they disassemble as data, rather than code), but examining the actual bytes confirms that there are no differences in the stored offsets in the tables. The native (buildbot) builds are codesigned, and the cross (taskcluster) builds are not; this does not appear to produce any performance differences. The linker invocations are virtually identical besides the obvious differences in framework names and such due to the filesystem. The native build passes in libclang_rt.osx.a on the link line due to differences in how clang sets up linking for native Darwin vs. cross (who'd be crazy enough to cross-compile to *x86-64* Darwin, eh?), but fixing that difference does not appear to change performance. mshal observed that testing talos performance locally on his machine produces *no* appreciable difference between the builds. This obviously does not match up with what we're seeing on infra, but we haven't figured out why that would be the case.
Comment 93•7 years ago
|
||
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #91) > (That would likely be the case only if the compilers being used weren't > actually the same -- I don't know what the chance of that is -- although it > certainly wouldn't surprise me if Apple patches clang in various ways...) Forgot to mention: we're not using Apple's clang; we're using home-built clang, and I *believe* that our native and cross clangs are both built from the same SVN revision. So the compilers should be identical...and extra kudos to the clang developers for trying to ensure that identical targets produce the same code regardless of host, that's a non-trivial thing to do!
Comment 94•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #92) > (In reply to David Baron :dbaron: ⌚️UTC-7 from comment #90) > > If there's no diff between the *.i files -- is there a diff between the *.s > > files? > > We have diffed the disassembly of XUL.dylib, and there are no significant > changes. There are differences in offsets (e.g. jumps) in some places, but > those seem pretty trivial. There were code differences between mov and ldr, are those gone now? Has libmozglue.dylib been compared too?
Comment 95•7 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #94) > There were code differences between mov and ldr, are those gone now? s/ldr/lea/
Comment 96•7 years ago
|
||
Can we ship cross-compiled builds to the Nightly channel for a day or two and see if Telemetry collaborates mshal or Talos? In doing so, we may also get data that points us in the direction of the root cause, such as specific configurations that do or don't exhibit the slowdown.
Comment 97•7 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #96) > Can we ship cross-compiled builds to the Nightly channel for a day or two > and see if Telemetry collaborates mshal or Talos? In doing so, we may also > get data that points us in the direction of the root cause, such as specific > configurations that do or don't exhibit the slowdown. That question is outside of my pay grade; mshal might know more here, but I will also ni? him to ensure that I didn't completely misrepresent the results he saw.
Flags: needinfo?(mshal)
Comment 98•7 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #95) > (In reply to Mike Hommey [:glandium] from comment #94) > > There were code differences between mov and ldr, are those gone now? > > s/ldr/lea/ Those differences were in libmozglue; the same phenomenon did not occur in XUL (assuming the assembly diffing was done correctly). Perhaps we should be verifying linker command lines etc. for libmozglue as well, since that is compiled very differently?
Comment 99•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #98) > (In reply to Mike Hommey [:glandium] from comment #95) > > (In reply to Mike Hommey [:glandium] from comment #94) > > > There were code differences between mov and ldr, are those gone now? > > > > s/ldr/lea/ > > Those differences were in libmozglue; the same phenomenon did not occur in > XUL (assuming the assembly diffing was done correctly). Perhaps we should > be verifying linker command lines etc. for libmozglue as well, since that is > compiled very differently? Has it been checked again after bug 1369866?
Comment 100•7 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #99) > (In reply to Nathan Froyd [:froydnj] from comment #98) > > (In reply to Mike Hommey [:glandium] from comment #95) > > > (In reply to Mike Hommey [:glandium] from comment #94) > > > > There were code differences between mov and ldr, are those gone now? > > > > > > s/ldr/lea/ > > > > Those differences were in libmozglue; the same phenomenon did not occur in > > XUL (assuming the assembly diffing was done correctly). Perhaps we should > > be verifying linker command lines etc. for libmozglue as well, since that is > > compiled very differently? > > Has it been checked again after bug 1369866? No idea. mshal and ted have been doing builds with various bits of logging enabled, and mshal is already needinfo'd, so ni'ing ted too!
Flags: needinfo?(ted)
Comment 101•7 years ago
|
||
I did a try push atop mshal+wcosta's patch stack (that's been steadily removing differences they've found) that added -v to clang invocations: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f17aec55c0fc41e9e42d7a3f69a3583286804a3f&selectedJob=104981194
Flags: needinfo?(ted)
Comment 102•7 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #101) > I did a try push atop mshal+wcosta's patch stack (that's been steadily > removing differences they've found) that added -v to clang invocations: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=f17aec55c0fc41e9e42d7a3f69a3583286804a3f&selectedJob=1 > 04981194 Looks like the only significant difference in linking mozglue is the libclang_rt.osx.a issue discussed earlier.
Comment 103•7 years ago
|
||
Comment 102 is just command-line options, though...are we *super* sure that both builds are picking up the same libc++? I think we had diffed the symbols from nm and discovered basically no differences there, but...
Comment 104•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #97) > That question is outside of my pay grade; mshal might know more here, but I > will also ni? him to ensure that I didn't completely misrepresent the > results he saw. That's right - when I run the talos tests on my mac, the results are consistent. Here's just a11yr for simplicity: --- a11yr-mshal-cross.log --- a11yr: 477.958766492 --- a11yr-mshal-native.log --- a11yr: 471.046078259 --- a11yr-try-cross.log --- a11yr: 481.338055724 --- a11yr-try-native.log --- a11yr: 476.459945964 mshal-cross/native are my local builds, while try-cross/native are from my try push. Notice how they're all around 470-480 - no significant slowdown for the cross build. (These numbers are all slower than what we'd expect in a try push, but I assume that's just because my local machine is not the same hardware). I'm also running the same tests on a t-yosemite-r7 machine. Here are the same exact builds: --- a11yr-try-cross.log --- a11yr: 573.53111317 --- a11yr-try-native.log --- a11yr: 422.414690345 --- a11yr-mshal-cross.log --- a11yr: 420.767708579 --- a11yr-mshal-native.log --- a11yr: 433.032256866 Here, the "good" runs are ~420, while the try cross build is clearly worse at 570. These numbers are fairly consistent with what the actual try push shows - lower 400's for a native OSX build and upper 400's into 500's for a cross build. One curious thing here is that a cross build built from my local machine falls into the "good" category. I've been trying to find out what's different between my local build and the try build, as well as the try cross & try native pushes. All I've discovered so far is that replacing XUL in the try-cross build with the XUL either from the native try push or from my local cross build is sufficient for improving the performance results. If anyone else with a mac can run these locally to confirm, that would be helpful. Here's a try push with a native & cross build: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bdd3f1125b0c7c0edc81f5d8a42a81ae7c8436fd Alternatively, if someone has the cross-build tools handy and wants to generate a .dmg from their local machine, I can try to run it manually on the loaner and see how it compares.
Flags: needinfo?(mshal)
Comment 105•7 years ago
|
||
waw, clearly, comparing the binaries mshal-cross with try-cross would be a step forward. Could you put your mshal-cross build somewhere we can download it? (I assume the try you linked to has a build that corresponds to your try-cross results)
Comment 106•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #103) > Comment 102 is just command-line options, though...are we *super* sure that > both builds are picking up the same libc++? I think we had diffed the > symbols from nm and discovered basically no differences there, but... Oh, duh, this is assured by zero differences between .i files. So that's good.
Comment 107•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #106) > (In reply to Nathan Froyd [:froydnj] from comment #103) > > Comment 102 is just command-line options, though...are we *super* sure that > > both builds are picking up the same libc++? I think we had diffed the > > symbols from nm and discovered basically no differences there, but... > > Oh, duh, this is assured by zero differences between .i files. So that's > good. Well, almost. We'd want some assurances that we're picking up the same binary files from the linker, or at least knowing what files the linker is picking so we can inspect those. The man page for OS X ld, http://www.manpages.info/macosx/ld.1.html , says that the environment variables RC_TRACE_{ARCHIVES,DYLIBS} should be set to provide that logging output. mshal, do you want to mix setting those in with all the other logging you have?
Flags: needinfo?(mshal)
Comment 108•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #107) > Well, almost. We'd want some assurances that we're picking up the same > binary files from the linker, or at least knowing what files the linker is > picking so we can inspect those. The man page for OS X ld, > http://www.manpages.info/macosx/ld.1.html , says that the environment > variables RC_TRACE_{ARCHIVES,DYLIBS} should be set to provide that logging > output. mshal, do you want to mix setting those in with all the other > logging you have? Sure, here it is: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1dc2a3670ee0e8ff791cb0a992da24a731190b40
Flags: needinfo?(mshal)
Comment 109•7 years ago
|
||
Here are the .dmgs for the times mentioned in #c104: https://people-mozilla.org/~mshal/cross/ I'm not sure how relevant a direct comparison will be between my local cross build and the try build, however. For one thing, I realized my toolchain was slightly out of date wrt tooltool. I've updated that locally now with no speed difference (I was on the previous tooltool build of clang 3.9.0) Another difference is I'm not using the nightly mozconfig, but just the equivalent of build/macosx/cross-mozconfig.common. If I switch my local mozconfig to pull in browser/config/mozconfigs/macosx64/nightly, I see the similar bad performance that the try builds get. I tried a push where I commented out the majority of the nightly and common-opt mozconfigs so that it is essentially what I build locally: https://treeherder.mozilla.org/#/jobs?repo=try&revision=80d680b94e2182c27ce33dcab1f1a14b8961b2f3 The talos tests don't actually run correctly through mozharness, but I downloaded the build onto the loaner so I could run it through mach, and the speed is improved. It's possible one of these mozconfig options will offer a clue, though I'm suspecting that building without say '--enable-profiling' speeds up the result by roughly the same amount that the cross build degrades it, so they cancel out. Perhaps this is why my local cross build looked to be the same as native when I was building without the nightly mozconfig. I'm out tomorrow, but I'll try to dig in more next week to see what specifically was causing the change.
Comment 110•7 years ago
|
||
None of #c109 changes that I still don't see performance differences when I run the tests on my local mac, so I still think it is worthwhile for someone else with an OSX machine to try the comparison, and/or see about running a test on Nightly users as per #c96.
Comment 111•7 years ago
|
||
I've narrowed down the mozconfig flag from #c109 that was making my local cross builds appear fast to --disable-install-strip. Our macosx64 nightly mozconfig contains the flag (so it is used by both the native & cross builds on try), while my local cross build did not. On a try push where this flag is removed, the cross-build talos performance is now in-line with native: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f18d021a32a5&newProject=try&newRevision=7ec5a7fda7079bdbb4927c8e9af76a03abbfc82f&framework=1&showOnlyImportant=0 Curiously, removing this flag doesn't change the performance of the native build. Anyone have any ideas why removing the flag would only affect cross builds? And does anyone know why we needed this flag on OSX in the first place? (blame will tell you that I added the flag, but that was when I was bringing the macosx/* mozconfigs in-line with the old macosx-universal/* mozconfigs. The --disable-install-strip flag was present in the first revision of macosx-universal/nightly from bug 558180.)
Comment 112•7 years ago
|
||
We use --disable-install-strip on Nightly so users can profile Nightly builds.
Comment 113•7 years ago
|
||
I want to formally summarize some points I just made in #build. Currently, Talos on Nightly is measuring unstripped binaries. Talos on other release channels (Beta/Release/etc) is measuring stripped binaries. This means we no continuous CI coverage for Talos on stripped binaries against the development tip of Firefox. Instead, we get 1 Talos signal for stripped binary performance every 6 weeks - after uplift. I'm willing to wager that people looking at Talos changes to Beta, etc after uplift are making the assumption that every Talos change was previously triaged on central/Nightly and that Talos changes on non-Nightly channels are mostly rubber stamped. If we're lucky, a significant regression will cause someone to dig deeper. But there's no guarantee of this. Since we're not actively testing stripped binaries on central, it is entirely possible we will regress performance of stripped binaries and we will have no clue which commit it was (any from that 6 week cycle are candidates) and we may even fail to spot the regression at all! I think this is problematic and I would lobby for running Talos against stripped binaries for central/inbound/autoland. But that's a discussion for another bug. Now, the good news in all of this is I think it may be acceptable to enable OS X cross builds today. We're still not sure why performance of unstripped cross builds is slower than native unstripped builds. But, *this is not a configuration we ship outside of the Nightly channel*. So, the regression only occurs for Nightly users and is not something that will ride the trains. Furthermore, I would argue that the performance regressions don't invalidate the value of the Talos measurements: they just shift the values. If someone makes an operation faster or slower, it will (likely) impact the Talos data irregardless of whether the binary is stripped or built natively vs cross. Because this regression is blocking such important work, I think it is acceptable to take the Talos regression (knowing it won't impact release channels nor significantly invalidate ongoing Talos measurements on Nightly) and leave the cross build unstripped binary performance issue as a P1 bug for the build team to resolve with high urgency.
Reporter | ||
Comment 114•7 years ago
|
||
this is great information, we do look at the beta vs aurora vs trunk on just about every merge day when results come in. We try to match up all regressions on beta with where they were originally introduced on trunk. Looking at this bug, there is a 34% sessionrestore regression, but mozilla-beta shows parity if not an improvement in sessionrestore over trunk for the last 90 days: https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=%5Bmozilla-inbound,e9484a55bdf595b7b86aea582bb9016cdd1ba956,1,1%5D&series=%5Bmozilla-beta,e9484a55bdf595b7b86aea582bb9016cdd1ba956,1,1%5D I would expect this to be much worse if this stripped binaries would prove out to be the cause for the large measured regression in performance. Ideally we would be shipping the faster builds to users, not the slower ones. tart which is not measuring startup, has a 27% regression for cross compile builds, but is effectively identical when compared to trunk: https://treeherder.mozilla.org/perf.html#/graphs?timerange=7776000&series=%5Bmozilla-central,c849608509c2d340e4065b7cf9d7d88b8cd0fc8d,1,1%5D&series=%5Bmozilla-beta,c849608509c2d340e4065b7cf9d7d88b8cd0fc8d,1,1%5D
Reporter | ||
Comment 115•7 years ago
|
||
I didn't see comment 111, so the stripped builds do not have any effect for native builds, but do for cross compiled builds, looking at :mshal's try link, the numbers seem to confirm that we will see little to no change for what we ship, but will see a large regression on mozilla-beta. The only thing this will make hard is when we merge trunk->beta that matching up the regressions that were not fixed on trunk will be hard to determine which bugs they are related to on mozilla-beta. Probably ok to just ignore osx regressions if the answer isn't obvious.
Comment 116•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #115) > I didn't see comment 111, so the stripped builds do not have any effect for > native builds, but do for cross compiled builds, looking at :mshal's try > link, the numbers seem to confirm that we will see little to no change for > what we ship, but will see a large regression on mozilla-beta. Sorry, I'm not quite following that--why will we see a regression on mozilla-beta?
Reporter | ||
Comment 117•7 years ago
|
||
I assume beta will be stripped, therefore it will have much higher performance than nightly according to talos. so if there is a regression that shows up on nightly, when that merges to beta, we validate it and link it to the appropriate bug for tracking purposes, in this case it will not be as simple as seeing where the graphs started out the same and looking for changes.
Comment 118•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #117) > I assume beta will be stripped, therefore it will have much higher > performance than nightly according to talos. so if there is a regression > that shows up on nightly, when that merges to beta, we validate it and link > it to the appropriate bug for tracking purposes, in this case it will not be > as simple as seeing where the graphs started out the same and looking for > changes. Oh, gotcha! Yes, this will make detecting regressions between nightly and beta more difficult. I'm hopeful that we'll be able to figure out the root cause here at some point.
Comment 119•7 years ago
|
||
Performance regressions in cross builds of Mac Nightly do not need to block enabling Stylo on Nightly.
Comment 120•7 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #118) > (In reply to Joel Maher ( :jmaher) from comment #117) > > I assume beta will be stripped, therefore it will have much higher > > performance than nightly according to talos. so if there is a regression > > that shows up on nightly, when that merges to beta, we validate it and link > > it to the appropriate bug for tracking purposes, in this case it will not be > > as simple as seeing where the graphs started out the same and looking for > > changes. > > Oh, gotcha! Yes, this will make detecting regressions between nightly and > beta more difficult. I'm hopeful that we'll be able to figure out the root > cause here at some point. As we discussed in the Quantum Dev mtg last week, we have two options here to move forward: 1) Setup an new, stripped nightly build that runs in parallel to the current unstripped nightly; or, 2) Change the current nightly to be stripped (like beta), and publish a script that downloads the symbols when required (or better still, integrate that functionality into mach) Do others have a preference here? I lean towards #2 because it brings the nightly config closer to beta and avoids introducing a new build variant. None of this precludes mshal from continuing to investigate the discrepancy. He plans to spend the rest of the month on that, barring a breakthrough.
Comment 121•7 years ago
|
||
I like the idea of having more convergence between builds (so #2). However, I'm worried about negative developer impacts of stripping Nightly. My understanding is things like the Gecko Profiler will break as well. If we do this, I'd like to know what all will need retrofitting to handle symbols. And that will likely require a post to dev-platform to see what unknown tools leveraging symbols are in use. I think it will be a non-trivial amount of work to add symbol fetching into every tool that needs it. Is that work worth doing? Probably. But depending on the scope bloat, it may not be realistic as a quick and easy stop-gap.
Comment 122•7 years ago
|
||
IPC has been exonerated. This is the realm of the build system.
Component: IPC → Build Config
Comment 123•7 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #120) > 2) Change the current nightly to be stripped (like beta), and publish a > script that downloads the symbols when required (or better still, integrate > that functionality into mach) I think I mentioned offhand in a meeting, but I wrote such a script years ago: https://hg.mozilla.org/users/jwatt_jwatt.org/fetch-symbols/file/tip/fetch-symbols.py We can take that code and use it as needed. The Gecko profiler should actually already be OK, since it can symbolicate profiles using symbolapi.mozilla.org. The bigger issue is other local native profilers, which would need local symbols to do anything useful. The fetch-symbols script should suffice for that use case, but it does mean people will have to wind up downloading a few hundred MB of symbol files to do profiling.
Comment 124•7 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #121) > I like the idea of having more convergence between builds (so #2). However, > I'm worried about negative developer impacts of stripping Nightly. My > understanding is things like the Gecko Profiler will break as well. If we do > this, I'd like to know what all will need retrofitting to handle symbols. > And that will likely require a post to dev-platform to see what unknown > tools leveraging symbols are in use. I posted to dev.platform: https://groups.google.com/d/msg/mozilla.dev.platform/7pPlfH0a7Ro/evWkUDjHAQAJ Given Ted's comments, I'm less worried about fetching symbols.
Comment 125•7 years ago
|
||
As I understand it the situations is: - Native build (fast) - Native build - stripped (fast) - Cross build (slow) - Cross build - stripped (fast) For those of us just catching up here can someone post the performance results and links to the builds for these 4 configurations as well as the clang svn revision and compiler binaries? Also, from comment 92 it seems like there are some differences in the generated code for cross builds. Do we have any idea where these differences are coming from?
Comment 126•7 years ago
|
||
sewardj asked for CPU specs for the machine. http://www.everymac.com/systems/apple/mac_mini/specs/mac-mini-core-i7-3.0-late-2014-specs.html Model Identifier: Macmini7,1 Processor Name: Intel Core i7 Processor Speed: 3 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 4 MB Memory: 16 GB APPLE SSD SM0256G
Comment 127•7 years ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #125) > As I understand it the situations is: > - Native build (fast) > - Native build - stripped (fast) > - Cross build (slow) > - Cross build - stripped (fast) > > For those of us just catching up here can someone post the performance > results and links to the builds for these 4 configurations as well as the > clang svn revision and compiler binaries? Native vs cross: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f18d021a32a5 Native vs cross, stripped: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7ec5a7fda7079bdbb4927c8e9af76a03abbfc82f Perfherder comparison of the above: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f18d021a32a5&newProject=try&newRevision=7ec5a7fda7079bdbb4927c8e9af76a03abbfc82f&framework=1&showOnlyImportant=0 Compiler info, native: https://hg.mozilla.org/try/file/f18d021a32a526f02f4188197fd848f9819a51b7/browser/config/tooltool-manifests/macosx64/releng.manifest Compiler info, cross: https://hg.mozilla.org/try/file/f18d021a32a526f02f4188197fd848f9819a51b7/browser/config/tooltool-manifests/macosx64/cross-releng.manifest > Also, from comment 92 it seems like there are some differences in the > generated code for cross builds. Do we have any idea where these differences > are coming from? The tagging of the jump tables comes about because we explicitly strip that information in the linker on native builds: http://dxr.mozilla.org/mozilla-central/source/build/macosx/local-mozconfig.common#34 The jump offsets come about because of small (immaterial) differences in the Mach-O object headers, I believe.
Comment 128•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #127) Thanks. It seems like we could turn this on in both builds to further minimize the difference. Are we using the same version of the linker in both places? One guess I have as to the performance difference is differences in section offsets between the builds. There different between the native and cross builds, and they change during the stripping process.
Comment 129•7 years ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #128) > It seems like we could turn this on in both builds to further minimize the > difference. I've been building with -Wl,-no_data_in_code_info in the cross builds recently, though it doesn't change the performance. I believe wcosta is working on that (and other patches) in order to make sure everything is identical going forward. With this flag in place I compared *.o with otool -t (for the text segment) and otool -d (for the data segment). Almost all files matched, except the following: text section differences (otool -t): - ./db/sqlite3/src/sqlite3.o - ./gfx/skia/GrGLGpu.o - ./gfx/skia/SkMatrix.o - ./gfx/skia/Unified_cpp_gfx_skia21.o - ./layout/forms/Unified_cpp_layout_forms0.o - ./media/libjpeg/jchuff.o - ./media/libvpx/vp9_firstpass.o - ./security/nss/lib/certdb/certdb_certdb/crl.o - if AcquireDPCache is commented out, diff is exact (even if just the return statement is commented out!) data section differences (otool -d) - ./layout/forms/Unified_cpp_layout_forms0.o - ./security/nss/lib/certdb/certdb_certdb/crl.o I couldn't exactly figure out why these files were generated differently - the .i files are identical in all cases. For sqlite3.c I found I could make them produce the same output with just this diff: $ cat sqlite3-diff.patch diff --git a/db/sqlite3/src/sqlite3.c b/db/sqlite3/src/sqlite3.c index 657a54e..443fdfb 100644 --- a/db/sqlite3/src/sqlite3.c +++ b/db/sqlite3/src/sqlite3.c @@ -123826,6 +123826,7 @@ SQLITE_PRIVATE void sqlite3ColumnDefault(Vdbe *v, Table *pTab, int i, int iReg){ ** \_______/ \________/ \______/ \________________/ * onError pTabList pChanges pWhere */ +extern int foo; SQLITE_PRIVATE void sqlite3Update( Parse *pParse, /* The parser context */ SrcList *pTabList, /* The table in which we should change things */ @@ -124435,7 +124436,7 @@ SQLITE_PRIVATE void sqlite3Update( }else if( eOnePass==ONEPASS_MULTI ){ sqlite3VdbeResolveLabel(v, labelContinue); sqlite3WhereEnd(pWInfo); - }else if( pPk ){ + }else if( foo ){ sqlite3VdbeResolveLabel(v, labelContinue); sqlite3VdbeAddOp2(v, OP_Next, iEph, addrTop); VdbeCoverage(v); }else{ But I don't know if that tells us anything. In any case, swapping out these cross .o's for natively-compiled versions resulted in no performance difference. So I suspect even if we figured out why they were generated differently, it wouldn't help here. > Are we using the same version of the linker in both places? cctools/bin/{x86_64-apple-darwin11-}ld reports version 274.2 in both the cross and native toolchains. As a test I tried copying the objdir from a cross build onto a mac and linking XUL natively, then injected it back into the cross build. Unfortunately that had no affect on the performance. > One guess I have as to the performance difference is differences in section > offsets between the builds. There different between the native and cross > builds, and they change during the stripping process. Would this be affected by the CWD of the build? I've been using a mac loaner to more directly compare the output of the compiler on a native build and a cross build (see .o files above). In order to make things easier to diff I built the native build using Taskcluster-style paths (/home/worker/workspace/build/src instead of /builds/slave/try-m64-0000000000000000000000/build/src). Curiously when I ran talos tests on this native build, the performance was bad. When I checked the same tree and toolchain out into the /builds/slave/... style path, the performance was good. This is the same machine, same revision, same tooltool manifest, just a different CWD. So I tried to hack taskcluster/scripts/builder/build-linux.sh to use /builds/slave/... instead of /home/worker/workspace, and the cross-performance is now good (without stripping). https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1bbe1db6f9349f24ae7d188a40c6a577f7cf39a0&newProject=try&newRevision=464c24f7ec9eaeb4a01bce6b0b2c5bdaf5e43ba3&framework=1&showOnlyImportant=0 This is the only difference between the two: https://hg.mozilla.org/try/rev/a70be9c6dc1401061adc4b30858120c8cde1c256
Comment 130•7 years ago
|
||
So... comment 129 is confirming the last paragraph of comment 128...
Comment 131•7 years ago
|
||
The interesting part is that the path is longer on buildbot vs. taskcluster. It seems it's sheer luck that it's faster with buildbot.
Comment 132•7 years ago
|
||
So it sounds like some performance sensitive code is being negatively impacted by page or cache line alignment and that the existing build path length just so happens to result in alignment with optimal performance? Fun. If so, yeah, we'll want to change how we invoke the linker to ensure optimal alignment going forward. mshal: one caveat to moving things to /builds/slave is that in TC that path is an AUFS volume whereas /home/worker/workspace is an EXT4 volume (at least outside of Try). This is relevant because AUFS has horrible performance for concurrent I/O. See bug 1291940. It doesn't need to block the transition, but at some point we'll want to be sure we're building from a TC cache volume using EXT4. Perhaps if you pad the path in /home/worker/workspace so it has the same length as /builds/slave/... it will just work?
Comment 133•7 years ago
|
||
Yeah, I say we just make a longer path name under the existing `/home/worker/workspace` directory to get paths that are equally as long as the buildbot paths, and then file a followup to figure out how we can make this less sucky, maybe with explicit linker instructions or something.
Comment 134•7 years ago
|
||
It seems it's not the length of the path, but whether or not /home is in the path. I tried just bumping the path length in a TC build, but the performance is still slow: https://treeherder.mozilla.org/#/jobs?repo=try&revision=acb807794ae8a4e53988abf52194c9fead4a0c84 -: WORKSPACE ${WORKSPACE:=/home/worker/workspace} +: WORKSPACE ${WORKSPACE:=/home/worker/workspace0000000000000000000000} After fiddling around with various paths, it seems that if we have /home in the path, it's slow, otherwise it's fast. On the OSX machines we have /home mounted as auto_home (something about autofs?): $ mount map auto_home on /home (autofs, automounted, nobrowse) ... With this mounted, I get bad performance on the cross build and good performance on the buildbot build. If I disable it in /etc/auto_master and unmount it, I get good performance in both: --- with-autohome-try-cross-f18d021a32a526f02f4188197fd848f9819a51b7.log --- a11yr: 576.89531924 --- with-autohome-try-native-f18d021a32a526f02f4188197fd848f9819a51b7.log --- a11yr: 420.733328064 --- no-autohome-try-cross-f18d021a32a526f02f4188197fd848f9819a51b7.log --- a11yr: 421.343462125 --- no-autohome-try-native-f18d021a32a526f02f4188197fd848f9819a51b7.log --- a11yr: 418.669810685 (The with-autohome-* logs match the configuration we have in automation, while no-autohome* is with /home unmounted) You can see a slight performance difference in accessing /home when it's mounted that way (here's 1M stats in a loop): /Users: 0m0.880s /home no auto_home: 0m0.867s /home with auto_home: 0m1.004s I guess with the debug information in, it is trying to check all the source files or something? I tried to check this with dtruss but couldn't find any evidence that that is what was happening. In any case, I think we should try to remove this line from /etc/auto_master in the OSX tester image and see if that fixes it: /home auto_home -nobrowse,hidefromfinder (I'm assuming we don't actually need autofs for anything on the test machines) I don't have /home mounted this way on my local mac, which I guess is why I don't see any performance difference running these talos tests locally.
Comment 135•7 years ago
|
||
Ah, that makes a lot more sense! I'm not sure exactly what it's looking for, but it's trying to stat *something* from the original build location and failing, clearly. FWIW, my Mac has /home mounted (I don't remember ever fiddling with that, so it must be the default?), so it would probably behoove us to fix this, because it might actually show up on users' machines. On TC Windows builds we build from a symlink dir because the default task directory has the Task ID encoded in it, and we wanted consistent paths so we could get sccache cache hits. Maybe we can just make a /builds -> /home/worker/workspace symlink, and build with $PWD in /builds?
Comment 136•7 years ago
|
||
Here's what we do for Windows generic worker builds: https://dxr.mozilla.org/mozilla-central/rev/f8bb96fb5c4f9960c5f754f877eceb677df18ddc/taskcluster/taskgraph/transforms/job/mozharness.py#238 We symlink z:\build to the task dir, then `cd /d z:\build` and build from there.
Comment 137•7 years ago
|
||
I tried the symlink approach, but it seems somewhere along the way we're normalizing the path back to /home/worker, so the same paths end up in the final executables. I think we should just be able to change /home/worker to /builds/worker (or something) everywhere in taskcluster and get the performance boost, though I'm having trouble sorting out what to change in order to get the base image built with the new paths. Just changing /home/worker to /builds/worker in the whole tree did not work :). wcosta, would you be able to take a look at what would be involved to do this?
Flags: needinfo?(wcosta)
Assignee | ||
Comment 138•7 years ago
|
||
(In reply to Michael Shal [:mshal] from comment #137) > I tried the symlink approach, but it seems somewhere along the way we're > normalizing the path back to /home/worker, so the same paths end up in the > final executables. > > I think we should just be able to change /home/worker to /builds/worker (or > something) everywhere in taskcluster and get the performance boost, though > I'm having trouble sorting out what to change in order to get the base image > built with the new paths. Just changing /home/worker to /builds/worker in > the whole tree did not work :). wcosta, would you be able to take a look at > what would be involved to do this? I am working on a patch, but it is trickier than expected. It seems nobody predicted we could change home directory someday.
Flags: needinfo?(wcosta)
Comment hidden (mozreview-request) |
Comment 141•7 years ago
|
||
mozreview-review |
Comment on attachment 8881632 [details] Bug 1338651 part 1: Change docker image home dir to /build. https://reviewboard.mozilla.org/r/152796/#review157944 This is looking awesome! I'd like to re-review afterward to make sure we didn't miss anything. I think the only main issue is making sure we don't have both a /build and a /builds. Were you able to get an actual build task completed with this? If so I can give it a quick test on my loaner as a sanity check. ::: taskcluster/ci/toolchain/linux.yml:23 (Diff revision 1) > run: > using: toolchain-script > script: build-clang-linux.sh > tooltool-downloads: public > resources: > - - 'build/build-clang/**' > + - '/build/build-clang/**' Did you need to change these for some reason? I believe these are supposed to be tree-relative paths (not full paths), so 'build/build-clang/\*\*' is correct here. (Same with the rest of this file, and macosx.yml and windows.yml) ::: taskcluster/docker/android-gradle-build/Dockerfile:39 (Diff revision 1) > # the directory where mozharness is run (not its --work-dir) > -ADD buildprops.json /home/worker/ > +ADD buildprops.json /build/worker/ > > # install tooltool directly from github where tooltool_wrapper.sh et al. expect > # to find it > RUN wget -O /builds/tooltool.py https://raw.githubusercontent.com/mozilla/build-tooltool/master/tooltool.py Does this mean we have both /builds and /build now? If so we should just pick one and put everything in there. ::: taskcluster/taskgraph/transforms/job/mozharness_test.py:67 (Diff revision 1) > worker['max-run-time'] = test['max-run-time'] > worker['retry-exit-status'] = test['retry-exit-status'] > > artifacts = [ > # (artifact name prefix, in-image path) > - ("public/logs/", "/home/worker/workspace/build/upload/logs/"), > + ("public/logs/", "build/worker/workspace/build/upload/logs/"), Do these need to have a '/' in the front? ::: taskcluster/taskgraph/transforms/job/mozharness_test.py:86 (Diff revision 1) > > worker['caches'] = [{ > 'type': 'persistent', > 'name': 'level-{}-{}-test-workspace'.format( > config.params['level'], config.params['project']), > - 'mount-point': "/home/worker/workspace", > + 'mount-point': "build/worker/workspace", Also here for a '/' in front. ::: taskcluster/taskgraph/transforms/job/mozharness_test.py:202 (Diff revision 1) > > # jittest doesn't have blob_upload_dir > if test['test-name'] != 'jittest': > artifacts.append({ > 'name': 'public/test_info', > - 'path': 'build/blobber_upload_dir', > + 'path': '/build/blobber_upload_dir', I don't think this is supposed to be a full path, but rather correspond to the 'build' directory that mozharness uses (eg: /build/worker/workspace/build/blobber_upload_dir), so this change should be removed. ::: taskcluster/taskgraph/transforms/job/spidermonkey.py:81 (Diff revision 1) > - '--chown-recursive', '/home/worker/tooltool-cache', > - '--vcs-checkout', '/home/worker/workspace/build/src', > + '--chown-recursive', '/build/worker/tooltool-cache', > + '--vcs-checkout', '/build/worker/workspace/build/src', > '--', > '/bin/bash', > '-c', > - 'cd /home/worker && workspace/build/src/taskcluster/scripts/builder/%s' % script > + 'cd build/worker && workspace/build/src/taskcluster/scripts/builder/%s' % script 'cd /build/worker' instead of 'cd build/worker' ::: testing/mozharness/configs/single_locale/linux32.py:1 (Diff revision 1) > -linux.py > +import os I think something got messed up with the diff here, or maybe mozreview is doing something funky. What was actually changed here? ::: testing/mozharness/configs/single_locale/linux32_devedition.py:1 (Diff revision 1) > -linux_devedition.py > +import os This file seems weird too. The old version is just the filename, and the new version is the entire file.
Attachment #8881632 -
Flags: review?(mshal)
Assignee | ||
Comment 142•7 years ago
|
||
My build try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=adeb2d9777fe386c08db116378a6e6b411bb75fc&selectedJob=110176395 When backlog gets more manageable will do a full try push.
Comment 143•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #142) > My build try push: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=adeb2d9777fe386c08db116378a6e6b411bb75fc&selectedJob=1 > 10176395 I grabbed the opt TC build and ran a11yr and sessionrestore on my loaner - the numbers look good to me! Confirmation from an actual talos job would be nice too, of course :). Nice work!
Comment hidden (mozreview-request) |
Comment 145•7 years ago
|
||
mozreview-review |
Comment on attachment 8881632 [details] Bug 1338651 part 1: Change docker image home dir to /build. https://reviewboard.mozilla.org/r/152796/#review158290 ::: commit-message-8f80d:1 (Diff revision 2) > +Bug 1338651: Change docker image home dir to /build. r=mshal,dustin Can you double-check with gps or TC members to make sure that https://bugzilla.mozilla.org/show_bug.cgi?id=1338651#c132 won't be a problem? ::: taskcluster/taskgraph/transforms/job/common.py:31 (Diff revision 2) > 'name': 'level-{}-{}-build-{}-{}-workspace'.format( > config.params['level'], config.params['project'], > taskdesc['attributes']['build_platform'], > taskdesc['attributes']['build_type'], > ), > - 'mount-point': "/home/worker/workspace", > + 'mount-point': "build/worker/workspace", This should be "/builds..." instead of "build..." ::: taskcluster/taskgraph/transforms/job/common.py:44 (Diff revision 2) > def docker_worker_add_tc_vcs_cache(config, job, taskdesc): > taskdesc['worker'].setdefault('caches', []).append({ > 'type': 'persistent', > 'name': 'level-{}-{}-tc-vcs'.format( > config.params['level'], config.params['project']), > - 'mount-point': "/home/worker/.tc-vcs", > + 'mount-point': "build/worker/.tc-vcs", /builds ::: taskcluster/taskgraph/transforms/job/mozharness.py:170 (Diff revision 2) > - '--chown-recursive', '/home/worker/tooltool-cache', > - '--vcs-checkout', '/home/worker/workspace/build/src', > - '--tools-checkout', '/home/worker/workspace/build/tools', > + '--chown-recursive', '/builds/worker/tooltool-cache', > + '--vcs-checkout', '/builds/worker/workspace/build/src', > + '--tools-checkout', '/builds/worker/workspace/build/tools', > '--', > ] > - command.append("/home/worker/workspace/build/src/{}".format( > + command.append("build/worker/workspace/build/src/{}".format( /builds ::: taskcluster/taskgraph/transforms/job/run_task.py:22 (Diff revision 2) > > # if true, add a cache at ~worker/.cache, which is where things like pip > # tend to hide their caches. This cache is never added for level-1 jobs. > Required('cache-dotcache', default=False): bool, > > - # if true (the default), perform a checkout in /home/worker/checkouts/gecko > + # if true (the default), perform a checkout in build/worker/checkouts/gecko /builds ::: taskcluster/taskgraph/transforms/job/spidermonkey.py:43 (Diff revision 2) > if int(config.params['level']) > 1: > worker['caches'].append({ > 'type': 'persistent', > 'name': 'level-{}-{}-build-spidermonkey-workspace'.format( > config.params['level'], config.params['project']), > - 'mount-point': "/home/worker/workspace", > + 'mount-point': "build/worker/workspace", /builds ::: taskcluster/taskgraph/transforms/job/toolchain.py:124 (Diff revision 2) > worker['command'] = [ > - '/home/worker/bin/run-task', > + '/builds/worker/bin/run-task', > # Various caches/volumes are default owned by root:root. > - '--chown-recursive', '/home/worker/workspace', > - '--chown-recursive', '/home/worker/tooltool-cache', > - '--vcs-checkout=/home/worker/workspace/build/src', > + '--chown-recursive', '/builds/worker/workspace', > + '--chown-recursive', '/builds/worker/tooltool-cache', > + '--vcs-checkout=build/worker/workspace/build/src', /build ::: taskcluster/taskgraph/transforms/job/toolchain.py:128 (Diff revision 2) > - '--chown-recursive', '/home/worker/tooltool-cache', > - '--vcs-checkout=/home/worker/workspace/build/src', > + '--chown-recursive', '/builds/worker/tooltool-cache', > + '--vcs-checkout=build/worker/workspace/build/src', > '--', > 'bash', > '-c', > - 'cd /home/worker && ' > + 'cd build/worker && ' /builds/worker ::: taskcluster/taskgraph/transforms/marionette_harness.py:34 (Diff revision 2) > > task['worker']['caches'] = [{ > 'type': 'persistent', > 'name': 'level-{}-{}-tc-vcs'.format( > config.params['level'], config.params['project']), > - 'mount-point': "/home/worker/.tc-vcs", > + 'mount-point': "build/worker/.tc-vcs", /builds ::: taskcluster/taskgraph/transforms/repackage.py:146 (Diff revision 2) > 'type': 'file', > - 'path': '/home/worker/workspace/build/artifacts/target.dmg', > + 'path': '/builds/worker/workspace/build/artifacts/target.dmg', > 'name': 'public/build/{}target.dmg'.format(locale_output_path), > }, { > 'type': 'file', > 'path': '/home/worker/workspace/build/artifacts/target.complete.mar', Looks like we'll need to update this too since it was just added. ::: testing/mozharness/configs/single_locale/linux32.py:1 (Diff revision 2) > -linux.py > +import os Ohhhh, I think the problem is that this file is in the tree as a symlink. Maybe your script changed it from a symlink to a normal file, so it shows up as a weird diff here? If so I think you can just undo that part of the patch and leave it as a symlink. ::: testing/mozharness/configs/single_locale/linux32_devedition.py:1 (Diff revision 2) > -linux_devedition.py > +import os Similarly here, this file should just stay as a symlink.
Attachment #8881632 -
Flags: review?(mshal)
Assignee | ||
Comment 146•7 years ago
|
||
mozreview-review-reply |
Comment on attachment 8881632 [details] Bug 1338651 part 1: Change docker image home dir to /build. https://reviewboard.mozilla.org/r/152796/#review158290 > Can you double-check with gps or TC members to make sure that https://bugzilla.mozilla.org/show_bug.cgi?id=1338651#c132 won't be a problem? I actually logged in a docker image with one click loaner and confirmed /builds/worker is mounted in a ext4 FS.
Comment 147•7 years ago
|
||
We were poking around a bit more and found where the accesses are coming from using dtrace on the whole system (suggested by ted): dtrace -n 'syscall::open*:entry { printf("%s %s",execname,copyinstr(arg0)); }' On a /builds build, the results look like this: 0 151 open:entry sandboxd /builds/slave/try-m64-0000000000000000000000/build/src/obj-firefox/mozglue/build/../misc/TimeStamp.o 0 151 open:entry sandboxd /builds/slave/try-m64-0000000000000000000000/build/src/obj-firefox/mozglue/build/../misc/TimeStamp.o 0 151 open:entry sandboxd /builds/slave/try-m64-0000000000000000000000/build/src/obj-firefox/mozglue/build/../misc/TimeStamp.o 0 151 open:entry sandboxd /builds/slave/try-m64-0000000000000000000000/build/src/obj-firefox/mozglue/build/../misc/TimeStamp.o 1 151 open:entry sandboxd /builds/slave/try-m64-0000000000000000000000/build/src/obj-firefox/mozglue/build/../misc/TimeStamp_darwin.o 0 151 open:entry sandboxd /builds/slave/try-m64-0000000000000000000000/build/src/obj-firefox/mozglue/build/../misc/TimeStamp_darwin.o 0 151 open:entry sandboxd /builds/slave/try-m64-0000000000000000000000/build/src/obj-firefox/mozglue/build/../misc/TimeStamp_darwin.o 0 151 open:entry sandboxd /builds/slave/try-m64-0000000000000000000000/build/src/obj-firefox/mozglue/build/../misc/TimeStamp_darwin.o So sandboxd is doing 4 opens on each of the ~5000 .o files in XUL. On a /home build, the sandboxd lookup relies on automountd and opendirectoryd, which is why those two processes spike (bug 1376143). Each lookup looks like this: 0 151 open:entry sandboxd /home/worker/workspace/build/src/obj-firefox/mozglue/build/../misc/TimeStamp.o 0 937 open_nocancel:entry automountd /etc/auto_home 1 937 open_nocancel:entry automountd /etc/auto_home 1 937 open_nocancel:entry automountd /usr/libexec/od_user_homes 3 151 open:entry od_user_homes /dev/dtracehelper 2 151 open:entry od_user_homes /System/Library/OpenDirectory/request-schema.plist 2 937 open_nocancel:entry od_user_homes /usr/libexec 2 937 open_nocancel:entry od_user_homes /usr/libexec 2 151 open:entry od_user_homes /usr/libexec/od_user_homes sandboxd tries to do the open, which automountd picks up because of the auto_home mount, and automountd uses opendirectory to do some sort of lookup: automountd[895]: od_search for worker in auto_home failed automountd[895]: od_search for * in auto_home failed (That's from syslog with AUTOMOUNTD_VERBOSE=TRUE set in /etc/autofs.conf) So far I haven't found much info on why sandboxd would be trying to open every file listed in the debug info when the process isn't being debugged, but I wanted to document it here for future reference.
Comment hidden (mozreview-request) |
Assignee | ||
Comment 149•7 years ago
|
||
Hopefully the last patch catches up everything. Here is my try push https://treeherder.mozilla.org/#/jobs?repo=try&revision=641ce431e0c5&selectedJob=110866341 macosx tests probably won't finish before tomorrow morning
Assignee | ||
Updated•7 years ago
|
Status: NEW → ASSIGNED
Comment 150•7 years ago
|
||
mozreview-review |
Comment on attachment 8881632 [details] Bug 1338651 part 1: Change docker image home dir to /build. https://reviewboard.mozilla.org/r/152796/#review158708
Attachment #8881632 -
Flags: review?(dustin) → review+
Comment 151•7 years ago
|
||
mozreview-review |
Comment on attachment 8881632 [details] Bug 1338651 part 1: Change docker image home dir to /build. https://reviewboard.mozilla.org/r/152796/#review158730 Awesome!
Attachment #8881632 -
Flags: review?(mshal) → review+
Comment 152•7 years ago
|
||
Pushed by wcosta@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/feedcde68c2a Change docker image home dir to /build. r=dustin,mshal
Comment 153•7 years ago
|
||
(In reply to Michael Shal [:mshal] from comment #147) > So far I haven't found much info on why sandboxd would be trying to open > every file listed in the debug info when the process isn't being debugged, > but I wanted to document it here for future reference. It looks sandboxd tries to symbolicate sandbox violations. If this is the case, we should probably stop our violations.
Comment 154•7 years ago
|
||
backed out for failures like https://treeherder.mozilla.org/logviewer.html#?job_id=111051404&repo=autoland
Flags: needinfo?(wcosta)
Comment 155•7 years ago
|
||
Backout by cbook@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c7c5ffd2eb39 Backed out changeset feedcde68c2a for bustage
Assignee | ||
Comment 156•7 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #154) > backed out for failures like > https://treeherder.mozilla.org/logviewer.html#?job_id=111051404&repo=autoland I don't think this is a bug in this patch. Re-triggering failed jobs made them to work. From the errors I see, I suspect that these failed jobs were scheduled in running instances, and they didn't download newer docker images, causing the job trying to access an nonexistent directory. I am investigating to see if this is really the case.
Flags: needinfo?(wcosta)
Assignee | ||
Comment 157•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #156) > (In reply to Carsten Book [:Tomcat] from comment #154) > > backed out for failures like > > https://treeherder.mozilla.org/logviewer.html#?job_id=111051404&repo=autoland > > I don't think this is a bug in this patch. Re-triggering failed jobs made > them to work. From the errors I see, I suspect that these failed jobs were > scheduled in running instances, and they didn't download newer docker > images, causing the job trying to access an nonexistent directory. I am > investigating to see if this is really the case. So, it feels like something cache related, I compared a failed to a succeeded job. In the failed job, it find a repo cache: [vcs 2017-06-30T06:41:17.576319Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--branch', 'default', 'https://hg.mozilla.org/build/tools', '/builds/worker/workspace/build/tools'] [vcs 2017-06-30T06:41:17.640157Z] ensuring https://hg.mozilla.org/build/tools@default is available at /builds/worker/workspace/build/tools [vcs 2017-06-30T06:41:17.640347Z] (existing repository shared store: /home/worker/checkouts/hg-store/7ae7fb134bf7aec6ec96a062ff47a69053dd2973/.hg) [vcs 2017-06-30T06:41:17.640375Z] (shared store does not exist; deleting) [vcs 2017-06-30T06:41:18.213912Z] (sharing from existing pooled repository 7ae7fb134bf7aec6ec96a062ff47a69053dd2973) [vcs 2017-06-30T06:41:18.415607Z] searching for changes [vcs 2017-06-30T06:41:18.416449Z] no changes found [vcs 2017-06-30T06:41:18.609292Z] (pulling to obtain default) [vcs 2017-06-30T06:41:19.190337Z] (remote resolved default to d5a286aff22cc87361797f2aefc08c1ad1d85889; result is not deterministic) [vcs 2017-06-30T06:41:19.192146Z] (revision already present locally; not pulling) [vcs 2017-06-30T06:41:19.353057Z] 1054 files updated, 0 files merged, 0 files removed, 0 files unresolved [vcs 2017-06-30T06:41:19.353862Z] updated to d5a286aff22cc87361797f2aefc08c1ad1d85889 While in the success job, no repo cache found: [vcs 2017-06-30T14:47:53.519313Z] executing ['hg', 'robustcheckout', '--sharebase', '/builds/worker/checkouts/hg-store', '--purge', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--revision', '0c16e7a1a10f61010eb9e5cb35c1fe1dc4c95fb5', 'https://hg.mozilla.org/integration/autoland/', '/builds/worker/workspace/build/src'] [vcs 2017-06-30T14:47:53.592334Z] ensuring https://hg.mozilla.org/integration/autoland/@0c16e7a1a10f61010eb9e5cb35c1fe1dc4c95fb5 is available at /builds/worker/workspace/build/src [vcs 2017-06-30T14:47:53.592501Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified) [vcs 2017-06-30T14:47:54.231734Z] (sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29) Later, the failed job detects a python exe inside venv directory and "thinks" virtualenv is already created: [task 2017-06-30T06:41:19.604118Z] 06:41:19 INFO - Creating virtualenv /builds/worker/workspace/build/venv [task 2017-06-30T06:41:19.606797Z] 06:41:19 INFO - Virtualenv /builds/worker/workspace/build/venv/bin/python appears to already exist; skipping virtualenv creation. [task 2017-06-30T06:41:19.606984Z] 06:41:19 INFO - Getting output from command: ['/builds/worker/workspace/build/venv/bin/pip', '--version'] [task 2017-06-30T06:41:19.607126Z] 06:41:19 INFO - Copy/paste: /builds/worker/workspace/build/venv/bin/pip --version [task 2017-06-30T06:41:19.609710Z] Traceback (most recent call last): [task 2017-06-30T06:41:19.609755Z] File "/builds/worker/workspace/build/src/testing/mozharness/scripts/fx_desktop_build.py", line 258, in <module> [task 2017-06-30T06:41:19.609775Z] fx_desktop_build = FxDesktopBuild() [task 2017-06-30T06:41:19.609804Z] File "/builds/worker/workspace/build/src/testing/mozharness/scripts/fx_desktop_build.py", line 106, in __init__ [task 2017-06-30T06:41:19.609826Z] super(FxDesktopBuild, self).__init__(**buildscript_kwargs) [task 2017-06-30T06:41:19.609856Z] File "/builds/worker/workspace/build/src/testing/mozharness/mozharness/mozilla/building/buildbase.py", line 647, in __init__ [task 2017-06-30T06:41:19.609905Z] self.create_virtualenv() [task 2017-06-30T06:41:19.609939Z] File "/builds/worker/workspace/build/src/testing/mozharness/mozharness/base/python.py", line 437, in create_virtualenv [task 2017-06-30T06:41:19.609979Z] halt_on_failure=True) [task 2017-06-30T06:41:19.610014Z] File "/builds/worker/workspace/build/src/testing/mozharness/mozharness/base/script.py", line 1564, in get_output_from_command [task 2017-06-30T06:41:19.610180Z] cwd=cwd, stderr=tmp_stderr, env=env) [task 2017-06-30T06:41:19.610204Z] File "/usr/lib/python2.7/subprocess.py", line 710, in __init__ [task 2017-06-30T06:41:19.610284Z] errread, errwrite) [task 2017-06-30T06:41:19.610309Z] File "/usr/lib/python2.7/subprocess.py", line 1335, in _execute_child [task 2017-06-30T06:41:19.610457Z] raise child_exception [task 2017-06-30T06:41:19.610477Z] OSError: [Errno 2] No such file or directory The success job just creates virtualenv from scratch. :gps, any clue if robustcheckout has to do anything with this?
Flags: needinfo?(gps)
Comment 158•7 years ago
|
||
I think this has to do with virtualenv interaction in mozharness, not robustcheckout.
Flags: needinfo?(gps)
Comment 159•7 years ago
|
||
Looks like we already have bug 1306663 for these specific sandbox violations. In any case, we still need to play nicer when we do have violations.
See Also: → 1306663
Assignee | ||
Comment 160•7 years ago
|
||
:Tomcat, may we try to land it again? Seems like it is a transient problem (it is a bug, but it is cache related and seems to be gone away) unrelated to the patch.
Flags: needinfo?(cbook)
Comment 161•7 years ago
|
||
Pushed by ryanvm@gmail.com: https://hg.mozilla.org/integration/autoland/rev/485d1af7805a Change docker image home dir to /build. r=dustin,mshal
Comment 162•7 years ago
|
||
I added a CLOBBER touch to the commit. We'll see if that helps.
Flags: needinfo?(cbook)
Comment 163•7 years ago
|
||
Pushed by kwierso@gmail.com: https://hg.mozilla.org/mozilla-central/rev/f8a62747c51c Change docker image home dir to /build. r=dustin,mshal a=hopefullyavoidingmergebustagelater
Grafted this over to m-c so we can hopefully get all trunk branches using this, on the theory that having workers shared between different trees might cause issues if things are cached differently.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox56:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Comment 165•7 years ago
|
||
:njn, would we need to update the valgrind suppression files since we changed the path that builds are happening in in Taskcluster? They seem to be failing since landing the patch in this bug (which changes /home/worker to /builds/worker) https://public-artifacts.taskcluster.net/WXpSmrWYSBmXHM9pJO52Aw/0/public/logs/live_backing.log
Flags: needinfo?(n.nethercote)
Comment 166•7 years ago
|
||
The Valgrind errors are leaks that look like this: > ==42858== 16 bytes in 1 blocks are definitely lost in loss record 46 of 256 > ==42858== at 0x4C2B525: operator new(unsigned long) (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) > ==42858== by 0x17509424: ??? > ==42858== by 0x1779BBA5: ??? > ==42858== by 0x16ADCCAA: ??? > ==42858== by 0xFFF0001B2: ??? > ==42858== by 0x772F72656B726F76: ??? > ==42858== by 0x65636170736B726E: ??? > ==42858== by 0x732F646C6975622E: ??? Only the first stack frame has a function name. It's quite possible that this is a leak for which we have a suppression, but without the other function names in the stack trace the suppression won't apply. There are other (unimportant) Valgrind warnings that *do* have stack traces, like this one: > --42853-- WARNING: unhandled syscall: 318 > ==42853== at 0x5CEE419: syscall (in /lib64/libc-2.12.so) > ==42853== by 0x11360270: GenerateRandomSeed (jsmath.cpp:743) > ==42853== by 0x11360270: GenerateXorShift128PlusSeed (jsmath.cpp:767) > ==42853== by 0x11360270: JSCompartment::ensureRandomNumberGenerator() (jsmath.cpp:777) > ==42853== by 0x10F4E250: JSCompartment::randomHashCode() (jscompartment.cpp:1408) > ==42853== by 0x111321BC: JS::Symbol::new_(JSContext*, JS::SymbolCode, JSString*) So it appears the debuginfo isn't completely messed up for the process, just partly messed up. It's possible that the missing symbols are for system libraries, while symbols for Firefox itself are ok. In short, could this change have affected symbols access somehow, particularly for system libraries?
Flags: needinfo?(n.nethercote)
Backed out from autoland in https://hg.mozilla.org/integration/autoland/rev/68e430f20dfd7c6984ef0de3a8203a0862970445 and m-c in https://hg.mozilla.org/mozilla-central/rev/587daa4bdc4b40b7053f4ca3b74723ca747f3b52 for the valgrind issues: https://treeherder.mozilla.org/logviewer.html#?job_id=111185089&repo=autoland
Flags: needinfo?(wcosta)
Comment 168•7 years ago
|
||
The Valgrind failures /might/ be due to rebuilding the Docker images and picking up newer package versions. This was first observed in bug 1272629 comment 34. Basically, we can't touch aspects of the desktop-build image. If we do a rebuild from scratch and pick up the latest yum packages, we inherit some yet-to-be-triaged Valgrind bustage. The proper fix is to run our own package server configured in such a way to allow deterministic operations over time. Bug 1289812 tracks and has more details.
Updated•7 years ago
|
Assignee | ||
Comment 169•7 years ago
|
||
Reopened to fix valgrind errors.
Status: RESOLVED → REOPENED
Flags: needinfo?(wcosta)
Resolution: FIXED → ---
Assignee | ||
Comment 170•7 years ago
|
||
So, it feels like it is just a single leak case. If I add a suppression for it [1], the task succeeds [2]. I ran valgrind under strace to track all "open" calls, for current good builds [3] and the failed [4], but I couldn't spot a difference between the two. I also tried to map which libs the stack trace is from, but it seems valgrind doesn't provide something like map procfile. :njn :jseward ideas? [1] https://hg.mozilla.org/try/rev/6cf065d95bef6c9c9b966b7fbb8bd6aa4ebfdce8 [2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=43362462339c99c5ba8e00aef5ee051ad9d77051&selectedJob=113957343 [3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=27a0847991eabb1ee85e90b73bbef17c7a860771&selectedJob=113798780 [4] https://treeherder.mozilla.org/#/jobs?repo=try&revision=220d84752e367d10fa38ec723883e014ebd9909d&selectedJob=113909336
Flags: needinfo?(n.nethercote)
Flags: needinfo?(jseward)
Comment 171•7 years ago
|
||
My observations in comment 166 still apply. This appears to be a issue with symbols and/or debuginfo.
Flags: needinfo?(n.nethercote)
Comment 173•7 years ago
|
||
Does the current fix result in the browser not trying to open all the object files using the Linux build paths, or will they still be opened but the paths are moved out of /home? If they're still opened, that might cause lots of sandbox violations, depending on the path and whether or not we do the open before the sandbox is enabled in plugin-container. If we can't stop them from being opened, we can have the sandbox allow reads from those paths, but we'll have to come up with a path scheme that makes sense keeping security in mind.
Comment 174•7 years ago
|
||
(In reply to Haik Aftandilian [:haik] from comment #173) > Does the current fix result in the browser not trying to open all the object > files using the Linux build paths, or will they still be opened but the > paths are moved out of /home? If they're still opened, that might cause lots > of sandbox violations, depending on the path and whether or not we do the > open before the sandbox is enabled in plugin-container. If we can't stop > them from being opened, we can have the sandbox allow reads from those > paths, but we'll have to come up with a path scheme that makes sense keeping > security in mind. That's the other way around, actually. It's *because* there are sandbox violations (and they are old, from bug 1306663) that those paths are touched, because the sandbox daemon is trying to symbolicate the violations.
Comment 175•7 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #174) > (In reply to Haik Aftandilian [:haik] from comment #173) > > Does the current fix result in the browser not trying to open all the object > > files using the Linux build paths, or will they still be opened but the > > paths are moved out of /home? If they're still opened, that might cause lots > > of sandbox violations, depending on the path and whether or not we do the > > open before the sandbox is enabled in plugin-container. If we can't stop > > them from being opened, we can have the sandbox allow reads from those > > paths, but we'll have to come up with a path scheme that makes sense keeping > > security in mind. > > That's the other way around, actually. It's *because* there are sandbox > violations (and they are old, from bug 1306663) that those paths are > touched, because the sandbox daemon is trying to symbolicate the violations. Thanks for the clarification. If that's the case, I don't think the paths will make a difference as they're being opened by sandboxd and not plugin-container. Disabling logging of those specific violations may avoid the symbolification. That's something we can look into until we can avoid them entirely.
Assignee | ||
Comment 176•7 years ago
|
||
Should we close this as it was already fixed by bug 1383841?
Flags: needinfo?(ted)
Comment 177•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #176) > Should we close this as it was already fixed by bug 1383841? It's less critical now, but we still need to change the /home paths despite bug 1383841 being fixed. This problem also appears to cause Activity Monitor to take a long time to sample a Nightly content process built on automation. For a local build, it takes about 10 seconds on my 2015 MacBook compared to over a minute with my Nightly channel install. During that time, opendirectoryd and automountd's CPU utilization is spiking like what is reported in bug 1376143. This probably affects the Mac crash reporter tool too. There may be other tools that do symbolification that will trigger it too. Less importantly, if violation logging is turned on for debugging, we'll hit the problem. Although they shouldn't be logged now due to bug 1383841, it will be some time before our sandboxed processes run without triggering any violations. And it's always possible that an OS X update will change the sandboxing implementation in an unexpected way that triggers a violation to be logged.
Comment 178•7 years ago
|
||
I wonder if this could also be why the vp9 playback benchmark was always under-reporting macbook performance.
Updated•7 years ago
|
User Story: (updated)
Flags: needinfo?(ted)
Summary: taskcluster cross-compiled OS X builds show Talos performance regressions vs. buildbot builds → taskcluster cross-compiled OS X builds create perf problems when not stripped (Talos performance regressions vs. buildbot builds)
Comment 179•7 years ago
|
||
The User Story seems like as good a place as any to summarize the actual problem here.
Updated•7 years ago
|
Flags: needinfo?(jseward)
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 185•7 years ago
|
||
mozreview-review |
Comment on attachment 8896446 [details] Bug 1338651 part 2: Add suppression for leaks in libllvm-3.6-mesa. https://reviewboard.mozilla.org/r/167694/#review176928 ::: build/valgrind/x86_64-redhat-linux-gnu.sup:644 (Diff revision 2) > +{ > + static-object-leaks-in-libLLVM-3.6-mesa.so. See bug 1338651. > + Memcheck:Leak > + match-leak-kinds: definite > + fun:_Znwm > + ... I think you should remove the ... here. That way we just specify 2 frames of -mesa.so and then a call to new. This at least rules out the possibility of suppressing a leak where -mesa.so calls some other object which calls onto new.
Attachment #8896446 -
Flags: review?(jseward) → review+
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 188•7 years ago
|
||
jseward, I updated the patch, could you please give another look?
Flags: needinfo?(jseward)
Comment 189•7 years ago
|
||
mozreview-review |
Comment on attachment 8896446 [details] Bug 1338651 part 2: Add suppression for leaks in libllvm-3.6-mesa. https://reviewboard.mozilla.org/r/167694/#review177232 ::: build/valgrind/x86_64-redhat-linux-gnu.sup:648 (Diff revision 3) > + fun:_Znwm > + obj:/*/lib*/libLLVM-3.6-mesa.so > + obj:/*/lib*/libLLVM-3.6-mesa.so > +} > + > +{ AFAICS you don't need this one. The one above is strictly more general, because _ZN4llvm14object_creator* and _ZNK4llvm17ManagedStaticBase* are both functions in libLLVM-3.6-mesa.so, so the general one should match whenever this one does. That said, I think the existing supp ("apparent leaks in libLLVM-3.6-mesa.so, August 2017. See bug 1382280." is also made redundant by the general one. The best would be if you could remove both of the specific ones and check that the new, general one, catches all of the leaks. Sorry -- I should have been clearer about this earlier on.
Attachment #8896446 -
Flags: review+
Updated•7 years ago
|
Flags: needinfo?(jseward)
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 192•7 years ago
|
||
mozreview-review |
Comment on attachment 8896446 [details] Bug 1338651 part 2: Add suppression for leaks in libllvm-3.6-mesa. https://reviewboard.mozilla.org/r/167694/#review178490
Attachment #8896446 -
Flags: review?(jseward) → review+
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 195•7 years ago
|
||
Pushed by wcosta@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/8cf125b4fa04 part 1: Change docker image home dir to /build. r=dustin,mshal https://hg.mozilla.org/integration/autoland/rev/2f07b13f4517 part 2: Add suppression for leaks in libllvm-3.6-mesa. r=jseward
Comment 196•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/8cf125b4fa04 https://hg.mozilla.org/mozilla-central/rev/2f07b13f4517
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
status-firefox57:
--- → fixed
Resolution: --- → FIXED
Comment 197•7 years ago
|
||
This improved some of our platform micro tests. On OSX it turned them from bi-modal to uni-modal: == Change summary for alert #9128 (as of August 31 2017 09:30 UTC) == Improvements: 27% Strings PerfStripCRLF osx-10-10 opt 209,566.33 -> 152,491.00 24% Strings PerfCompressWhitespace osx-10-10 opt 324,455.04 -> 246,625.42 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9128
Updated•7 years ago
|
Product: Core → Firefox Build System
You need to log in
before you can comment on or make changes to this bug.
Description
•