Closed Bug 1472496 Opened Last year Closed Last year

check_binary.py is a major bottleneck for incremental builds

Categories

(Firefox Build System :: General, defect)

defect
Not set

Tracking

(firefox-esr60 fixed, firefox64 fixed)

RESOLVED FIXED
mozilla64
Tracking Status
firefox-esr60 --- fixed
firefox64 --- fixed

People

(Reporter: kmag, Assigned: padenot)

References

Details

Attachments

(2 files, 1 obsolete file)

My libxul.so link times have been pretty unmanageable for incremental builds lately. I thought the linker was the bottleneck, but when I looked closer, I realized that lld takes about 3 seconds, and then check_binary.py takes at least 25 seconds. For debug builds, it takes 2-3x as long.

It would probably help to split the various checks into parallel tasks, but according to the python profiler, the biggest bottlenecks are the helper apps we call, and the per-call overhead for those is ~8s, so it would still be pretty slow:

         2738706 function calls (2738670 primitive calls) in 24.352 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   24.352   24.352 <string>:1(<module>)
        1    0.000    0.000   24.352   24.352 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:318(main)
        1    0.000    0.000   24.351   24.351 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:287(checks)
        6    0.000    0.000   22.940    3.823 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/util.py:933(__call__)
        3    0.004    0.001   22.940    7.647 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:52(get_output)
        3    0.000    0.000   22.818    7.606 /usr/lib64/python2.7/subprocess.py:194(check_output)
        9    0.000    0.000   22.813    2.535 /usr/lib64/python2.7/subprocess.py:122(_eintr_retry_call)
        3    0.000    0.000   22.803    7.601 /usr/lib64/python2.7/subprocess.py:452(communicate)
        5   22.803    4.561   22.803    4.561 {method 'read' of 'file' objects}
        1    0.000    0.000   11.077   11.077 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:139(check_textrel)
       74    0.000    0.000   11.077    0.150 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:67(at_least_one)
       74    0.000    0.000   11.076    0.150 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:101(iter_readelf_dynamic)
        1    0.000    0.000   11.044   11.044 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:251(check_pt_load)
        1    0.492    0.492    2.229    2.229 /home/kris/code/mozilla-central/python/mozbuild/mozbuild/action/check_binary.py:165(check_nsmodules)
  1089718    0.720    0.000    0.720    0.000 {method 'endswith' of 'str' objects}
   547616    0.141    0.000    0.141    0.000 {method 'split' of 'str' objects}
        3    0.118    0.039    0.118    0.039 {method 'splitlines' of 'str' objects}
1095186/1095176    0.056    0.000    0.056    0.000 {len}
        3    0.000    0.000    0.014    0.005 /usr/lib64/python2.7/subprocess.py:334(__init__)
        3    0.000    0.000    0.014    0.005 /usr/lib64/python2.7/subprocess.py:900(_execute_child)
        3    0.010    0.003    0.010    0.003 {posix.read}
        3    0.003    0.001    0.003    0.001 {posix.fork}


The best option I can think of at this point is to add a build flag to skip those checks when not running in automation.
I guess a corollary of the per-call overhead numbers means that it's not really worth turning off the "expensive" checks, because they're all relatively expensive, when measured against lld link times?

We've always done these checks (well, not quite as completely as the Python script does), I wonder whether the Python is just significantly more expensive, or what?

Making them dependent on --enable-release seems like a reasonable option, if an unfortunate one.
Blocks: 1470127
I'm not sure how much the migration to a python script affected the run times, but they definitely do seem to have gotten slower fairly recently, to the point that I decided to track down the cause.

Either way, I don't think that adding at least 25 seconds to each incremental build is good for developer productivity.

Maybe we could move the binary checks to a separate task/build stage, so they would still run in full builds, but not in `build binaries`?
One difference that bug 1471132 did is that it's now looking at the .symtab symbol table rather than the .dynsym one, and the former is much larger than the latter. And there's not really a way around that because the start/stop symbols are not exported, so only appear in .symtab. I guess there are two questions that follow: is nm particularly slow to display that list, and would something else (e.g. readelf or objdump) be faster? And is python slow to read the list, and would pre-filtering with e.g. grep help?
Kris, can you check the timings on your machine with bug 1471132 backed out?
Flags: needinfo?(kmaglione+bmo)
So, on my machine this is actually even slower, and it's due to readelf stalling while sucking 100% CPU for some reason, and that's fixed in next version of binutils. Past that, it takes less than 2s.
Assuming this doesn't break often, let's make this test conditional on --enable-release. If the checks are made faster (a few seconds), we can make it enabled by default. But that can be done in a follow-up.

Taking bug to author patch to make conditional on --enable-release.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Here are facts to help make an informed decision:
- While the changes in bug 1471132 made the setup less fragile than it was, when things go wrong, the result is either a crash at startup (best case), or no direct crash, but random xpcom components aren't registered and initialized, meaning that things relying on them would fail. Example of components that could fail to exist because of this: nsPrefModule, nsUConvModule, nsComposerModule, nsCookieModule, etc.
- The only substantial change, from both bug 1471132 and bug 1470127, that could explain a *change* in behavior and thus performance, is the flags change on nm, making it output all symbols rather than only those exported.
- The only thing that happened to be slow when I tried to reproduce this was *readelf*, and I'm not even sure it would happen with an actual build (because I was downloading a build from automation, so using a different toolchain than what I had locally). It has been run as part of those post-linkage checks for years.
- The actually important test doesn't use readelf, it uses nm.
- Actually, now looking again at comment 0, it's clear :kmag's problem is readelf too. See how check_textrel and check_ptloads, which are the two tests using readelf, are taking 11s each. check_nsmodules, which is the one using nm, only takes 2s.

So, if touching this at all, rather than disable the whole thing, I'd rather disable the tests that use readelf on non --enable-release.

But I'm also not sure we should work around what, ultimately, is a problem in one specific version of binutils, by disabling tests for everybody.
(In reply to Mike Hommey [:glandium] from comment #5)
> So, on my machine this is actually even slower, and it's due to readelf
> stalling while sucking 100% CPU for some reason, and that's fixed in next
> version of binutils. Past that, it takes less than 2s.

That would probably explain why it seems to have gotten slower, yeah.

(In reply to Mike Hommey [:glandium] from comment #8)
> So, if touching this at all, rather than disable the whole thing, I'd rather
> disable the tests that use readelf on non --enable-release.
> 
> But I'm also not sure we should work around what, ultimately, is a problem
> in one specific version of binutils, by disabling tests for everybody.

I think that's probably reasonable, but I also suspect that this affects most Linux users at the moment, so perhaps we should at least disable it for users who have a problematic version.

That said, I also think that skipping this check when we don't specify --enable-release is probably reasonable, at least for `build binaries` rather than a full build. Even with a fixed version of readelf, it seems like this step will still be much slower than the actual linking.

And the latest version of binutils available for Arch (which is known for its aggressive updates) still has this issue, which means it will probably still be affecting most developers for a long time to come.
Flags: needinfo?(kmaglione+bmo)
Comment on attachment 8989513 [details]
Bug 1472496 - Disable check_binary in developer builds;

https://reviewboard.mozilla.org/r/254550/#review262640

::: config/rules.mk:568
(Diff revision 1)
>  # 2-second granularity
>  	touch -t `date +%Y%m%d%H%M.%S -d 'now+5seconds'` pgo.relink
>  endif
>  else # !WINNT || GNU_CC
>  	$(call EXPAND_CC_OR_CXX,$@) -o $@ $(COMPUTED_CXX_LDFLAGS) $(PGO_CFLAGS) $($(notdir $@)_$(OBJS_VAR_SUFFIX)) $(RESFILE) $(WIN32_EXE_LDFLAGS) $(LDFLAGS) $(STATIC_LIBS) $(MOZ_PROGRAM_LDFLAGS) $(SHARED_LIBS) $(OS_LIBS)
> +ifndef DEVELOPER_OPTIONS

Per #c8, I think we should put this inside check_binary.py and check buildconfig.substs['DEVELOPER_OPTIONS'] to enable check_stdcxx, check_glibc, and check_textrel. Though if it's not too far off to just make them all fast enough to run all the time, I'm also fine landing this for now since we can back it out then.
Attachment #8989513 - Flags: review+
Attachment #8989513 - Flags: review?(core-build-config-reviews)
The longer this goes without being fixed, the higher the chance of me accidentally pushing a patch with a hack to disable check_binary.py...
I've been preempted by other work.
Assignee: gps → nobody
Status: ASSIGNED → NEW
Is switching to llvm-readelf a viable solution?

Also note that the readelf bug affects dependentlibs.py as well, not just check_binary.py. So even if we disable the tests we still have to contend with a much slower dependentlibs.list generation (on my machine, this goes from <200ms to 8 seconds with the buggy readelf).
Attachment #9006304 - Attachment description: Bug 1472496 - Don't do the checks in check_binary.py in developer builds. r?mshal → Bug 1472496 - Don't call out to check_binary.py in developer builds. r?mshal
We can pick one of the two patches here, number in the review comments.
Comment on attachment 9012110 [details]
Bug 1472496 - Don't call out to readelf in check_binary.py. r?mshal

Michael Shal [:mshal] has approved the revision.
Attachment #9012110 - Flags: review+
Attachment #9006304 - Attachment is obsolete: true
Pushed by padenot@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1b9e4b90413f
Don't call out to readelf in check_binary.py. r=mshal
https://hg.mozilla.org/mozilla-central/rev/1b9e4b90413f
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Assignee: nobody → padenot

Comment on attachment 9012110 [details]
Bug 1472496 - Don't call out to readelf in check_binary.py. r?mshal

This is one of series of patches I am requesting uplift to esr60. Please don't uplift any if the entire series won't go. The whole series will need to go in one push.

This try run (applied on tip-of-esr60 as of an hour ago; and beginning with 'Bug 1491901 - move MK*SHLIB to moz.configure') represents the patch series. It must be applied in that order: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0659d6e265f3b624ad6fbac0c9cd7ce246094596 (If this try run doesn't complete successfully, I will investigate and figure out why)

The uplift request form is the same for all of the patch series; see https://bugzilla.mozilla.org/show_bug.cgi?id=1491901#c10

Attachment #9012110 - Flags: approval-mozilla-esr60?

Comment on attachment 9012110 [details]
Bug 1472496 - Don't call out to readelf in check_binary.py. r?mshal

approved for 60.5esr

Attachment #9012110 - Flags: approval-mozilla-esr60? → approval-mozilla-esr60+
You need to log in before you can comment on or make changes to this bug.