Figure out the differences between normal Windows builds and cross builds
Categories
(Firefox Build System :: General, task, P3)
Tracking
(Not tracked)
People
(Reporter: glandium, Unassigned)
References
Details
So far, what I have identified:
- missing manifests in binaries (bug 1618766)
- differences in generated STL wrappers (bug 1618760)
- differences in python preprocessor output (bug 1618775)
- differences in line terminations in generated files, but it probably doesn't matter
- differences in order in dependentlibs.list, but I'm actually wondering if the MSVC runtime libraries should appear at all in there.
- some differences in order in telemetry related headers that need some further investigation.
- difference in file modes in xpis and omni.ja, presumably due to https://searchfox.org/mozilla-central/rev/b2ccce862ef38d0d150fcac2b597f7f20091a0c7/python/mozbuild/mozpack/copier.py#586,598 and I'm not 100% sure what to do about it yet, or whether to do anything.
- differences in buildconfig.html, duh
- missing signature files for NSS, required for FIPS, and we'll have to decide what to do about this (for reference, we decided to not care on Mac, so Mac builds can't enable FIPS mode)
- other differences in .dlls and .exes, that don't go away with -Brepro
David, can you take a look at the last item?
This try has reduced differences, and -Brepro set (although I forgot to add it to JS, but that only matter for xul.dll and the others dlls/exes have differences): https://treeherder.mozilla.org/#/jobs?repo=try&revision=6df39670221e96794b1a4af8831337871716fed9
Reporter | ||
Comment 1•3 years ago
|
||
Oh I forgot to add:
- one #include difference in a C file generated by midl, which is weird:
diff -ruNw a/accessible/ipc/win/handler/HandlerData_c.c b/accessible/ipc/win/handler/HandlerData_c.c
--- a/accessible/ipc/win/handler/HandlerData_c.c 2016-01-01 09:00:00.000000000 +0900
+++ b/accessible/ipc/win/handler/HandlerData_c.c 2016-01-01 09:00:00.000000000 +0900
@@ -30,6 +30,7 @@
#include <string.h>
+#include <malloc.h>
#include "HandlerData.h"
#define TYPE_FORMAT_STRING_SIZE 1259
- differences in rust hashes for crates, which will impact symbols, presumably, and makes it harder to check the differences between the generated files from rust, which I haven't looked at yet.
Comparing dumpbin /headers
on AccessibleHandler.dll:
Debug Directories
Time Type Size RVA Pointer
-------- ------- -------- -------- --------
- 724C1E60 cv 68 000211B6 205B6 Format: RSDS, {6AC21CE1-1D39-9641-4C4C-44205044422E}, 1, z:\build\build\src\obj-firefox\accessible\ipc\win\handler\AccessibleHandler.pdb
+ A1CA305B cv 78 000211B6 205B6 Format: RSDS, {69CFE670-87BF-E3C0-4C4C-44205044422E}, 1, /builds/worker/workspace/build/src/obj-firefox/accessible/ipc/win/handler/AccessibleHandler.pdb
Also the header for .rdata says it's 0x10 bytes larger in the cross build. Since the pdb path is exactly 16 chars longer, I wonder if that's why.
(I'll keep looking at more files.)
Reporter | ||
Comment 3•3 years ago
|
||
Mmmm I'll try to do a try build in /builds/worker/src
instead of /builds/worker/workspace/build/src
. That will make them the same length even if different. That should make other causes of differences, if there are any left, more apparent.
I ran Chromium's ShowGlobals
on AccessibleHandler.dll, and both builds have the same globals and functions, in the same order, with the same size -- so the .text is looking really good!
I think at this point I'll pause and wait for your next build, that ought to cut down a lot of noise in the diffs.
Some of the tips at http://blog.llvm.org/2019/11/deterministic-builds-with-clang-and-lld.html might be helpful. We might be able to cook up some relative PDB paths resolved against a fixed root, although there's still the matter of the slashes.
Reporter | ||
Comment 5•3 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=291038154&revision=7b27fdd39508e58c5debba7865e826cd1d1af684
(windows native build is not finished yet)
(In reply to Mike Hommey [:glandium] from comment #5)
https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=291038154&revision=7b27fdd39508e58c5debba7865e826cd1d1af684
(windows native build is not finished yet)
This still shows
- F5E13421 cv 68 000211B6 205B6 Format: RSDS, {47253FEE-FEA0-98CC-4C4C-44205044422E}, 1, z:\build\build\src\obj-firefox\accessible\ipc\win\handler\AccessibleHandler.pdb
+ 16DFD631 cv 78 000211B6 205B6 Format: RSDS, {A1C6D171-E64B-9330-4C4C-44205044422E}, 1, /builds/worker/workspace/build/src/obj-firefox/accessible/ipc/win/handler/AccessibleHandler.pdb
Was that push supposed to fix that?
Reporter | ||
Comment 7•3 years ago
|
||
gah, apparently mozharness is trying to be clever.
Reporter | ||
Comment 8•3 years ago
|
||
Updated•3 years ago
|
firefox.exe has paths appear in __FILE__
s of chromium CHECK
s. And AccessibleHandler.dll seems to have some differences in .pdata, but I haven't investigated yet.
Reporter | ||
Comment 10•3 years ago
|
||
Here's another with -Brepro set on js too:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=634d8e6d0ab288364054d76fb38f29c0747d6fd2
As noted on Matrix, comparing two independent cross builds on the same changeset yields only differences in uninstall/helper.exe
: https://firefoxci.taskcluster-artifacts.net/UT7nz8GQQU2JB2E9usWAOA/0/public/diff.html (and that was without -Brepro in js/)
Reporter | ||
Updated•3 years ago
|
Reporter | ||
Comment 11•3 years ago
|
||
Fresh try push now that windows opt builds are cross builds, re-adding the native ones: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa4440c7078176cd7b6a734f8ed60a09ec611cec
The telemetry differences are gone, presumably thanks to bug 1620140.
Reporter | ||
Comment 12•3 years ago
|
||
makes it harder to check the differences between the generated files from rust, which I haven't looked at yet.
Now looked at them. The only differences are:
CROSS_COMPILE
being defined in the bindings generated by neqo-crypto, presumably from mozilla-config.h.- baldrdash has bindings generated twice instead of once, presumably once for the host and once for the target.
- style has obvious paths differences in the
#[path]
,include!
andinclude_str!
it contains.
![]() |
||
Comment 13•3 years ago
|
||
Fresh try push now that windows opt builds are cross builds, re-adding the native ones: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa4440c7078176cd7b6a734f8ed60a09ec611cec
In that push, the target.crashreporter.symbols-full.zip from the cross build is missing symbols for the .exe files.
Reporter | ||
Comment 14•3 years ago
|
||
(In reply to :dmajor from comment #13)
Fresh try push now that windows opt builds are cross builds, re-adding the native ones: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa4440c7078176cd7b6a734f8ed60a09ec611cec
In that push, the target.crashreporter.symbols-full.zip from the cross build is missing symbols for the .exe files.
It's not limited to that push. Filed bug 1621488.
Reporter | ||
Comment 15•3 years ago
|
||
Here's another try with bug 1621488 fixed: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd15b5259a896dc81ca915027ed0ae15bfe61cb7
![]() |
||
Comment 16•3 years ago
|
||
Here's another try with bug 1621488 fixed: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd15b5259a896dc81ca915027ed0ae15bfe61cb7
For that push, here are the differences in size of xul.dll functions:
< 0 0 19762 1 celt_encode_with_ec xul.pdb
> 0 0 19778 1 celt_encode_with_ec xul.pdb
< 0 0 11713 1 av1_predict_intra_block xul.pdb
> 0 0 11641 1 av1_predict_intra_block xul.pdb
> 0 0 5311 1 vorbis_synthesis_blockin xul.pdb
< 0 0 5271 1 vorbis_synthesis_blockin xul.pdb
< 0 0 4292 1 spirv_cross::CompilerGLSL::to_texture_op xul.pdb
> 0 0 4220 1 spirv_cross::CompilerGLSL::to_texture_op xul.pdb
< 0 0 2106 1 mozilla::dom::SVGRectElement::GetGeometryBounds xul.pdb
> 0 0 2101 1 mozilla::dom::SVGRectElement::GetGeometryBounds xul.pdb
< 0 0 967 1 compute_intersection xul.pdb
> 0 0 961 1 compute_intersection xul.pdb
< 0 0 801 1 GetUnretargetedOffsetsFor xul.pdb
> 0 0 799 1 GetUnretargetedOffsetsFor xul.pdb
< 0 0 792 1 core::ptr::real_drop_in_place<std::sync::mutex::Mutex<webrender::platform::windows::font::FontContext>> xul.pdb
> 0 0 707 1 SkMatrix::Poly4Proc xul.pdb
< 0 0 705 1 SkMatrix::Poly4Proc xul.pdb
> 0 0 696 1 core::ptr::real_drop_in_place<webrender::platform::windows::font::FontContext> xul.pdb
> 0 0 534 1 SkBaseShadowTessellator::handleLine xul.pdb
< 0 0 530 1 SkBaseShadowTessellator::handleLine xul.pdb
> 0 0 516 1 alloc::sync::Arc<webrender::glyph_rasterizer::FontContexts>::drop_slow<webrender::glyph_rasterizer::FontContexts> xul.pdb
< 0 0 279 1 alloc::sync::Arc<webrender::glyph_rasterizer::FontContexts>::drop_slow<webrender::glyph_rasterizer::FontContexts> xul.pdb
I looked at a few of these, and they are mostly innocuous things like
xul!celt_encode_with_ec+0x2d9d [z:\build\build\src\media\libopus\celt\celt_encoder.c @ 2040]:
2040 00000001`82eb94bd 83bd8000000000 cmp dword ptr [rbp+80h],0
xul!celt_encode_with_ec+0x2d9d [/builds/worker/src/media/libopus/celt/celt_encoder.c @ 2040]:
2040 00000001`82eb949d 4585ed test r13d,r13d
where r13d already equals [rbp+80h] in both cases, but the compiler only chose to make use of that fact in one case, which led to a shorter instruction.
The only suspicious thing was the last pair, where a function went from 279 bytes to 516, but it looks like just a single inlining decision.
All in all, I don't see any of the blatant differences that would be expected if we forgot to include some component altogether. These builds look good to me.
![]() |
||
Comment 17•3 years ago
|
||
If you were so inclined, the LLVM folks would be interested in a bug report about significantly shorter encodings being chosen based on the host system for compilation.
![]() |
||
Comment 18•3 years ago
|
||
I think host system is a red herring, I suspect this was nondeterminism that could have happened just as well in multiple runs on the same OS. Maybe we could test that with some rebuilds.
Reporter | ||
Comment 19•3 years ago
•
|
||
I think host system is not a red herring. I have 11 cross builds on this try that only differ in uninstall/helper.exe (which comes from nsis): https://treeherder.mozilla.org/#/jobs?repo=try&revision=0ceaca09a0af18c62ecac1fdb14aab8da7cacaf5
(and that's with less extra flags than the try pushes with cross vs. native)
Reporter | ||
Comment 20•3 years ago
|
||
I wonder if some things are implicitly -march=native, and if the difference could come from different host CPUs rather than host OS. On try, all Linux builds happen on the same AWS instance type.
Reporter | ||
Comment 21•3 years ago
|
||
So, when doing the same on Windows, different retries get different set of differences. That doesn't seem to correlate to different AWS instance types.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1a207fc50da0a6c890a511cf8287aa6ffa5b9520
Reporter | ||
Comment 22•3 years ago
|
||
Although I now realize I forgot to disable sccache, so that factors in...
Reporter | ||
Comment 23•3 years ago
|
||
This is a try with 11 native Windows builds with minimized differences and sccache disabled:
https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=293447249&revision=76092018c3a3d461e5d1cd987a47cd5fa5324d90
There seems to be two different outcomes:
- One where the differences are small: https://firefoxci.taskcluster-artifacts.net/eCA0mldhRRq_hcU2RmNFew/0/public/diff.html
- One where they aren't: https://firefoxci.taskcluster-artifacts.net/QY6fsm6RT6yBxlm9Jtlwfg/0/public/diff.html
There doesn't seem to be any correlation with the AWS instance type.
![]() |
||
Comment 24•3 years ago
|
||
In the case where the differences are small, the first difference is the TimeDateStamp
field of the COFF header (and again a couple times later in the file) and the last difference is in the PDB70 signature -- although I'm kind of surprised lld is in that codepath, does it think we're building mingw? And, you'd think the hashes would be reproducible, too...
![]() |
||
Comment 25•3 years ago
|
||
In the case of large differences, going by the "XUL section sizes opt native-repr", libxul is always 880 bytes larger.
In one pair of builds that I looked at, there is an extra copy of a Rust function:
0:000> x /v xul!core::iter::adapters*spirv_cross_internal::spirv::EntryPoint,spirv_cross_internal::ErrorCode>
prv func 00000001`83f05bb0 33b xul!core::iter::adapters::{{impl}}::next<core::iter::adapters::Map<core::ops::range::Range<usize>, closure-0>,spirv_cross_internal::spirv::EntryPoint,spirv_cross_internal::ErrorCode> (struct core::iter::adapters::ResultShunt<core::iter::adapters::Map<core::ops::range::Range<usize>, closure-0>, spirv_cross_internal::ErrorCode> *)
vs
0:000> x /v xul!core::iter::adapters*spirv_cross_internal::spirv::EntryPoint,spirv_cross_internal::ErrorCode>
prv func 00000001`83f05bb0 33b xul!core::iter::adapters::{{impl}}::next<core::iter::adapters::Map<core::ops::range::Range<usize>, closure-0>,spirv_cross_internal::spirv::EntryPoint,spirv_cross_internal::ErrorCode> (struct core::iter::adapters::ResultShunt<core::iter::adapters::Map<core::ops::range::Range<usize>, closure-0>, spirv_cross_internal::ErrorCode> *)
prv func 00000001`83f10140 33b xul!core::iter::adapters::{{impl}}::next<core::iter::adapters::Map<core::ops::range::Range<usize>, closure-0>,spirv_cross_internal::spirv::EntryPoint,spirv_cross_internal::ErrorCode> (struct core::iter::adapters::ResultShunt<core::iter::adapters::Map<core::ops::range::Range<usize>, closure-0>, spirv_cross_internal::ErrorCode> *)
I only looked at that one pair, but based on sizes I assume it's always the same function duplicated.
All three copies of the function (the one in the base binary, and the two in the larger binary) have identical code, modulo jump offsets. So this looks like an incomplete ICF in the larger binary.
![]() |
||
Comment 26•3 years ago
|
||
Maybe we could try setting -linkrepro
to the artifact upload directory to see if we can reproduce the sometimes-ICF in multiple runs on a local machine.
![]() |
||
Comment 27•3 years ago
|
||
![]() |
||
Comment 28•3 years ago
|
||
(In reply to :dmajor from comment #27)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5ece95aae8ea41cab244b6f1b8e896edf5360c3b
In 15 rebuilds of the repro.tar from the Bn
task, I always got the same size xul.dll. But building the repro.tar from the Bnr
task (I only tried once), I got a slightly larger xul.dll. It appears that the lld-link step is fine from a determinism standpoint; the differences are already present in the inputs to the linker. I do see a difference in size in the two gkrust.lib
s.
Updated•8 months ago
|
Description
•