check_binary.py is a major bottleneck for incremental builds
Categories
(Firefox Build System :: General, defect)
Tracking
(firefox-esr60 fixed, firefox64 fixed)
People
(Reporter: kmag, Assigned: padenot)
References
Details
Attachments
(2 files, 1 obsolete file)
My libxul.so link times have been pretty unmanageable for incremental builds lately. I thought the linker was the bottleneck, but when I looked closer, I realized that lld takes about 3 seconds, and then check_binary.py takes at least 25 seconds. For debug builds, it takes 2-3x as long. It would probably help to split the various checks into parallel tasks, but according to the python profiler, the biggest bottlenecks are the helper apps we call, and the per-call overhead for those is ~8s, so it would still be pretty slow: 2738706 function calls (2738670 primitive calls) in 24.352 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 24.352 24.352 <string>:1(<module>) 1 0.000 0.000 24.352 24.352 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:318(main) 1 0.000 0.000 24.351 24.351 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:287(checks) 6 0.000 0.000 22.940 3.823 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/util.py:933(__call__) 3 0.004 0.001 22.940 7.647 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:52(get_output) 3 0.000 0.000 22.818 7.606 /usr/lib64/python2.7/subprocess.py:194(check_output) 9 0.000 0.000 22.813 2.535 /usr/lib64/python2.7/subprocess.py:122(_eintr_retry_call) 3 0.000 0.000 22.803 7.601 /usr/lib64/python2.7/subprocess.py:452(communicate) 5 22.803 4.561 22.803 4.561 {method 'read' of 'file' objects} 1 0.000 0.000 11.077 11.077 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:139(check_textrel) 74 0.000 0.000 11.077 0.150 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:67(at_least_one) 74 0.000 0.000 11.076 0.150 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:101(iter_readelf_dynamic) 1 0.000 0.000 11.044 11.044 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:251(check_pt_load) 1 0.492 0.492 2.229 2.229 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:165(check_nsmodules) 1089718 0.720 0.000 0.720 0.000 {method 'endswith' of 'str' objects} 547616 0.141 0.000 0.141 0.000 {method 'split' of 'str' objects} 3 0.118 0.039 0.118 0.039 {method 'splitlines' of 'str' objects} 1095186/1095176 0.056 0.000 0.056 0.000 {len} 3 0.000 0.000 0.014 0.005 /usr/lib64/python2.7/subprocess.py:334(__init__) 3 0.000 0.000 0.014 0.005 /usr/lib64/python2.7/subprocess.py:900(_execute_child) 3 0.010 0.003 0.010 0.003 {posix.read} 3 0.003 0.001 0.003 0.001 {posix.fork} The best option I can think of at this point is to add a build flag to skip those checks when not running in automation.
![]() |
||
Comment 1•3 years ago
|
||
I guess a corollary of the per-call overhead numbers means that it's not really worth turning off the "expensive" checks, because they're all relatively expensive, when measured against lld link times? We've always done these checks (well, not quite as completely as the Python script does), I wonder whether the Python is just significantly more expensive, or what? Making them dependent on --enable-release seems like a reasonable option, if an unfortunate one.
Reporter | ||
Comment 2•3 years ago
|
||
I'm not sure how much the migration to a python script affected the run times, but they definitely do seem to have gotten slower fairly recently, to the point that I decided to track down the cause. Either way, I don't think that adding at least 25 seconds to each incremental build is good for developer productivity. Maybe we could move the binary checks to a separate task/build stage, so they would still run in full builds, but not in `build binaries`?
Comment 3•3 years ago
|
||
One difference that bug 1471132 did is that it's now looking at the .symtab symbol table rather than the .dynsym one, and the former is much larger than the latter. And there's not really a way around that because the start/stop symbols are not exported, so only appear in .symtab. I guess there are two questions that follow: is nm particularly slow to display that list, and would something else (e.g. readelf or objdump) be faster? And is python slow to read the list, and would pre-filtering with e.g. grep help?
Comment 4•3 years ago
|
||
Kris, can you check the timings on your machine with bug 1471132 backed out?
Updated•3 years ago
|
Comment 5•3 years ago
|
||
So, on my machine this is actually even slower, and it's due to readelf stalling while sucking 100% CPU for some reason, and that's fixed in next version of binutils. Past that, it takes less than 2s.
Comment 6•3 years ago
|
||
Assuming this doesn't break often, let's make this test conditional on --enable-release. If the checks are made faster (a few seconds), we can make it enabled by default. But that can be done in a follow-up. Taking bug to author patch to make conditional on --enable-release.
Comment hidden (mozreview-request) |
Comment 8•3 years ago
|
||
Here are facts to help make an informed decision: - While the changes in bug 1471132 made the setup less fragile than it was, when things go wrong, the result is either a crash at startup (best case), or no direct crash, but random xpcom components aren't registered and initialized, meaning that things relying on them would fail. Example of components that could fail to exist because of this: nsPrefModule, nsUConvModule, nsComposerModule, nsCookieModule, etc. - The only substantial change, from both bug 1471132 and bug 1470127, that could explain a *change* in behavior and thus performance, is the flags change on nm, making it output all symbols rather than only those exported. - The only thing that happened to be slow when I tried to reproduce this was *readelf*, and I'm not even sure it would happen with an actual build (because I was downloading a build from automation, so using a different toolchain than what I had locally). It has been run as part of those post-linkage checks for years. - The actually important test doesn't use readelf, it uses nm. - Actually, now looking again at comment 0, it's clear :kmag's problem is readelf too. See how check_textrel and check_ptloads, which are the two tests using readelf, are taking 11s each. check_nsmodules, which is the one using nm, only takes 2s. So, if touching this at all, rather than disable the whole thing, I'd rather disable the tests that use readelf on non --enable-release. But I'm also not sure we should work around what, ultimately, is a problem in one specific version of binutils, by disabling tests for everybody.
Reporter | ||
Comment 9•3 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #5) > So, on my machine this is actually even slower, and it's due to readelf > stalling while sucking 100% CPU for some reason, and that's fixed in next > version of binutils. Past that, it takes less than 2s. That would probably explain why it seems to have gotten slower, yeah. (In reply to Mike Hommey [:glandium] from comment #8) > So, if touching this at all, rather than disable the whole thing, I'd rather > disable the tests that use readelf on non --enable-release. > > But I'm also not sure we should work around what, ultimately, is a problem > in one specific version of binutils, by disabling tests for everybody. I think that's probably reasonable, but I also suspect that this affects most Linux users at the moment, so perhaps we should at least disable it for users who have a problematic version. That said, I also think that skipping this check when we don't specify --enable-release is probably reasonable, at least for `build binaries` rather than a full build. Even with a fixed version of readelf, it seems like this step will still be much slower than the actual linking. And the latest version of binutils available for Arch (which is known for its aggressive updates) still has this issue, which means it will probably still be affecting most developers for a long time to come.
Comment 10•3 years ago
|
||
mozreview-review |
Comment on attachment 8989513 [details] Bug 1472496 - Disable check_binary in developer builds; https://reviewboard.mozilla.org/r/254550/#review262640 ::: config/rules.mk:568 (Diff revision 1) > # 2-second granularity > touch -t `date +%Y%m%d%H%M.%S -d 'now+5seconds'` pgo.relink > endif > else # !WINNT || GNU_CC > $(call EXPAND_CC_OR_CXX,$@) -o $@ $(COMPUTED_CXX_LDFLAGS) $(PGO_CFLAGS) $($(notdir $@)_$(OBJS_VAR_SUFFIX)) $(RESFILE) $(WIN32_EXE_LDFLAGS) $(LDFLAGS) $(STATIC_LIBS) $(MOZ_PROGRAM_LDFLAGS) $(SHARED_LIBS) $(OS_LIBS) > +ifndef DEVELOPER_OPTIONS Per #c8, I think we should put this inside check_binary.py and check buildconfig.substs['DEVELOPER_OPTIONS'] to enable check_stdcxx, check_glibc, and check_textrel. Though if it's not too far off to just make them all fast enough to run all the time, I'm also fine landing this for now since we can back it out then.
Updated•3 years ago
|
Reporter | ||
Comment 11•3 years ago
|
||
The longer this goes without being fixed, the higher the chance of me accidentally pushing a patch with a hack to disable check_binary.py...
Comment 12•3 years ago
|
||
I've been preempted by other work.
Comment 13•3 years ago
|
||
Is switching to llvm-readelf a viable solution? Also note that the readelf bug affects dependentlibs.py as well, not just check_binary.py. So even if we disable the tests we still have to contend with a much slower dependentlibs.list generation (on my machine, this goes from <200ms to 8 seconds with the buggy readelf).
Assignee | ||
Comment 14•3 years ago
|
||
Updated•3 years ago
|
Assignee | ||
Comment 15•3 years ago
|
||
Assignee | ||
Comment 16•3 years ago
|
||
We can pick one of the two patches here, number in the review comments.
Comment 17•2 years ago
|
||
Comment on attachment 9012110 [details] Bug 1472496 - Don't call out to readelf in check_binary.py. r?mshal Michael Shal [:mshal] has approved the revision.
Updated•2 years ago
|
Comment 18•2 years ago
|
||
Pushed by padenot@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1b9e4b90413f Don't call out to readelf in check_binary.py. r=mshal
Comment 19•2 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/1b9e4b90413f
Updated•2 years ago
|
Comment 20•2 years ago
|
||
Comment on attachment 9012110 [details]
Bug 1472496 - Don't call out to readelf in check_binary.py. r?mshal
This is one of series of patches I am requesting uplift to esr60. Please don't uplift any if the entire series won't go. The whole series will need to go in one push.
This try run (applied on tip-of-esr60 as of an hour ago; and beginning with 'Bug 1491901 - move MK*SHLIB to moz.configure') represents the patch series. It must be applied in that order: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0659d6e265f3b624ad6fbac0c9cd7ce246094596 (If this try run doesn't complete successfully, I will investigate and figure out why)
The uplift request form is the same for all of the patch series; see https://bugzilla.mozilla.org/show_bug.cgi?id=1491901#c10
Comment 21•2 years ago
|
||
Comment on attachment 9012110 [details]
Bug 1472496 - Don't call out to readelf in check_binary.py. r?mshal
approved for 60.5esr
Comment 22•2 years ago
|
||
bugherderuplift |
Description
•