Open Bug 1618784 Opened 1 month ago Updated 13 days ago

Figure out the differences between normal Windows builds and cross builds

Categories

(Firefox Build System :: General, task, P3)

task

Tracking

(Not tracked)

People

(Reporter: glandium, Unassigned)

References

Details

So far, what I have identified:

  • missing manifests in binaries (bug 1618766)
  • differences in generated STL wrappers (bug 1618760)
  • differences in python preprocessor output (bug 1618775)
  • differences in line terminations in generated files, but it probably doesn't matter
  • differences in order in dependentlibs.list, but I'm actually wondering if the MSVC runtime libraries should appear at all in there.
  • some differences in order in telemetry related headers that need some further investigation.
  • difference in file modes in xpis and omni.ja, presumably due to https://searchfox.org/mozilla-central/rev/b2ccce862ef38d0d150fcac2b597f7f20091a0c7/python/mozbuild/mozpack/copier.py#586,598 and I'm not 100% sure what to do about it yet, or whether to do anything.
  • differences in buildconfig.html, duh
  • missing signature files for NSS, required for FIPS, and we'll have to decide what to do about this (for reference, we decided to not care on Mac, so Mac builds can't enable FIPS mode)
  • other differences in .dlls and .exes, that don't go away with -Brepro

David, can you take a look at the last item?
This try has reduced differences, and -Brepro set (although I forgot to add it to JS, but that only matter for xul.dll and the others dlls/exes have differences): https://treeherder.mozilla.org/#/jobs?repo=try&revision=6df39670221e96794b1a4af8831337871716fed9

Flags: needinfo?(dmajor)

Oh I forgot to add:

  • one #include difference in a C file generated by midl, which is weird:
diff -ruNw a/accessible/ipc/win/handler/HandlerData_c.c b/accessible/ipc/win/handler/HandlerData_c.c
--- a/accessible/ipc/win/handler/HandlerData_c.c        2016-01-01 09:00:00.000000000 +0900
+++ b/accessible/ipc/win/handler/HandlerData_c.c        2016-01-01 09:00:00.000000000 +0900
@@ -30,6 +30,7 @@
 
 #include <string.h>
 
+#include <malloc.h>
 #include "HandlerData.h"
 
 #define TYPE_FORMAT_STRING_SIZE   1259                              
  • differences in rust hashes for crates, which will impact symbols, presumably, and makes it harder to check the differences between the generated files from rust, which I haven't looked at yet.

Comparing dumpbin /headers on AccessibleHandler.dll:

   Debug Directories
 
         Time Type        Size      RVA  Pointer
     -------- ------- -------- -------- --------
-    724C1E60 cv            68 000211B6    205B6    Format: RSDS, {6AC21CE1-1D39-9641-4C4C-44205044422E}, 1, z:\build\build\src\obj-firefox\accessible\ipc\win\handler\AccessibleHandler.pdb
+    A1CA305B cv            78 000211B6    205B6    Format: RSDS, {69CFE670-87BF-E3C0-4C4C-44205044422E}, 1, /builds/worker/workspace/build/src/obj-firefox/accessible/ipc/win/handler/AccessibleHandler.pdb

Also the header for .rdata says it's 0x10 bytes larger in the cross build. Since the pdb path is exactly 16 chars longer, I wonder if that's why.

(I'll keep looking at more files.)

Mmmm I'll try to do a try build in /builds/worker/src instead of /builds/worker/workspace/build/src. That will make them the same length even if different. That should make other causes of differences, if there are any left, more apparent.

I ran Chromium's ShowGlobals on AccessibleHandler.dll, and both builds have the same globals and functions, in the same order, with the same size -- so the .text is looking really good!

I think at this point I'll pause and wait for your next build, that ought to cut down a lot of noise in the diffs.

Some of the tips at http://blog.llvm.org/2019/11/deterministic-builds-with-clang-and-lld.html might be helpful. We might be able to cook up some relative PDB paths resolved against a fixed root, although there's still the matter of the slashes.

Flags: needinfo?(dmajor)

(In reply to Mike Hommey [:glandium] from comment #5)

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=291038154&revision=7b27fdd39508e58c5debba7865e826cd1d1af684
(windows native build is not finished yet)

This still shows

-    F5E13421 cv            68 000211B6    205B6    Format: RSDS, {47253FEE-FEA0-98CC-4C4C-44205044422E}, 1, z:\build\build\src\obj-firefox\accessible\ipc\win\handler\AccessibleHandler.pdb
+    16DFD631 cv            78 000211B6    205B6    Format: RSDS, {A1C6D171-E64B-9330-4C4C-44205044422E}, 1, /builds/worker/workspace/build/src/obj-firefox/accessible/ipc/win/handler/AccessibleHandler.pdb

Was that push supposed to fix that?

gah, apparently mozharness is trying to be clever.

Priority: -- → P3

firefox.exe has paths appear in __FILE__s of chromium CHECKs. And AccessibleHandler.dll seems to have some differences in .pdata, but I haven't investigated yet.

Here's another with -Brepro set on js too:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=634d8e6d0ab288364054d76fb38f29c0747d6fd2

As noted on Matrix, comparing two independent cross builds on the same changeset yields only differences in uninstall/helper.exe: https://firefoxci.taskcluster-artifacts.net/UT7nz8GQQU2JB2E9usWAOA/0/public/diff.html (and that was without -Brepro in js/)

Fresh try push now that windows opt builds are cross builds, re-adding the native ones: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa4440c7078176cd7b6a734f8ed60a09ec611cec

The telemetry differences are gone, presumably thanks to bug 1620140.

makes it harder to check the differences between the generated files from rust, which I haven't looked at yet.

Now looked at them. The only differences are:

  • CROSS_COMPILE being defined in the bindings generated by neqo-crypto, presumably from mozilla-config.h.
  • baldrdash has bindings generated twice instead of once, presumably once for the host and once for the target.
  • style has obvious paths differences in the #[path], include! and include_str! it contains.

Fresh try push now that windows opt builds are cross builds, re-adding the native ones: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa4440c7078176cd7b6a734f8ed60a09ec611cec

In that push, the target.crashreporter.symbols-full.zip from the cross build is missing symbols for the .exe files.

(In reply to :dmajor from comment #13)

Fresh try push now that windows opt builds are cross builds, re-adding the native ones: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa4440c7078176cd7b6a734f8ed60a09ec611cec

In that push, the target.crashreporter.symbols-full.zip from the cross build is missing symbols for the .exe files.

It's not limited to that push. Filed bug 1621488.

Here's another try with bug 1621488 fixed: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd15b5259a896dc81ca915027ed0ae15bfe61cb7

For that push, here are the differences in size of xul.dll functions:

< 0	0	 19762	1	celt_encode_with_ec	xul.pdb
> 0	0	 19778	1	celt_encode_with_ec	xul.pdb
< 0	0	 11713	1	av1_predict_intra_block	xul.pdb
> 0	0	 11641	1	av1_predict_intra_block	xul.pdb
> 0	0	  5311	1	vorbis_synthesis_blockin	xul.pdb
< 0	0	  5271	1	vorbis_synthesis_blockin	xul.pdb
< 0	0	  4292	1	spirv_cross::CompilerGLSL::to_texture_op	xul.pdb
> 0	0	  4220	1	spirv_cross::CompilerGLSL::to_texture_op	xul.pdb
< 0	0	  2106	1	mozilla::dom::SVGRectElement::GetGeometryBounds	xul.pdb
> 0	0	  2101	1	mozilla::dom::SVGRectElement::GetGeometryBounds	xul.pdb
< 0	0	   967	1	compute_intersection	xul.pdb
> 0	0	   961	1	compute_intersection	xul.pdb
< 0	0	   801	1	GetUnretargetedOffsetsFor	xul.pdb
> 0	0	   799	1	GetUnretargetedOffsetsFor	xul.pdb
< 0	0	   792	1	core::ptr::real_drop_in_place<std::sync::mutex::Mutex<webrender::platform::windows::font::FontContext>>	xul.pdb
> 0	0	   707	1	SkMatrix::Poly4Proc	xul.pdb
< 0	0	   705	1	SkMatrix::Poly4Proc	xul.pdb
> 0	0	   696	1	core::ptr::real_drop_in_place<webrender::platform::windows::font::FontContext>	xul.pdb
> 0	0	   534	1	SkBaseShadowTessellator::handleLine	xul.pdb
< 0	0	   530	1	SkBaseShadowTessellator::handleLine	xul.pdb
> 0	0	   516	1	alloc::sync::Arc<webrender::glyph_rasterizer::FontContexts>::drop_slow<webrender::glyph_rasterizer::FontContexts>	xul.pdb
< 0	0	   279	1	alloc::sync::Arc<webrender::glyph_rasterizer::FontContexts>::drop_slow<webrender::glyph_rasterizer::FontContexts>	xul.pdb

I looked at a few of these, and they are mostly innocuous things like

xul!celt_encode_with_ec+0x2d9d [z:\build\build\src\media\libopus\celt\celt_encoder.c @ 2040]:
 2040 00000001`82eb94bd 83bd8000000000  cmp     dword ptr [rbp+80h],0

xul!celt_encode_with_ec+0x2d9d [/builds/worker/src/media/libopus/celt/celt_encoder.c @ 2040]:
 2040 00000001`82eb949d 4585ed          test    r13d,r13d

where r13d already equals [rbp+80h] in both cases, but the compiler only chose to make use of that fact in one case, which led to a shorter instruction.

The only suspicious thing was the last pair, where a function went from 279 bytes to 516, but it looks like just a single inlining decision.

All in all, I don't see any of the blatant differences that would be expected if we forgot to include some component altogether. These builds look good to me.

If you were so inclined, the LLVM folks would be interested in a bug report about significantly shorter encodings being chosen based on the host system for compilation.

I think host system is a red herring, I suspect this was nondeterminism that could have happened just as well in multiple runs on the same OS. Maybe we could test that with some rebuilds.

I think host system is not a red herring. I have 11 cross builds on this try that only differ in uninstall/helper.exe (which comes from nsis): https://treeherder.mozilla.org/#/jobs?repo=try&revision=0ceaca09a0af18c62ecac1fdb14aab8da7cacaf5
(and that's with less extra flags than the try pushes with cross vs. native)

I wonder if some things are implicitly -march=native, and if the difference could come from different host CPUs rather than host OS. On try, all Linux builds happen on the same AWS instance type.

So, when doing the same on Windows, different retries get different set of differences. That doesn't seem to correlate to different AWS instance types.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1a207fc50da0a6c890a511cf8287aa6ffa5b9520

Although I now realize I forgot to disable sccache, so that factors in...

This is a try with 11 native Windows builds with minimized differences and sccache disabled:
https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=293447249&revision=76092018c3a3d461e5d1cd987a47cd5fa5324d90
There seems to be two different outcomes:

There doesn't seem to be any correlation with the AWS instance type.

In the case where the differences are small, the first difference is the TimeDateStamp field of the COFF header (and again a couple times later in the file) and the last difference is in the PDB70 signature -- although I'm kind of surprised lld is in that codepath, does it think we're building mingw? And, you'd think the hashes would be reproducible, too...

In the case of large differences, going by the "XUL section sizes opt native-repr", libxul is always 880 bytes larger.

In one pair of builds that I looked at, there is an extra copy of a Rust function:

0:000> x /v xul!core::iter::adapters*spirv_cross_internal::spirv::EntryPoint,spirv_cross_internal::ErrorCode>
prv func   00000001`83f05bb0  33b xul!core::iter::adapters::{{impl}}::next<core::iter::adapters::Map<core::ops::range::Range<usize>, closure-0>,spirv_cross_internal::spirv::EntryPoint,spirv_cross_internal::ErrorCode> (struct core::iter::adapters::ResultShunt<core::iter::adapters::Map<core::ops::range::Range<usize>, closure-0>, spirv_cross_internal::ErrorCode> *)

vs

0:000> x /v xul!core::iter::adapters*spirv_cross_internal::spirv::EntryPoint,spirv_cross_internal::ErrorCode>
prv func   00000001`83f05bb0  33b xul!core::iter::adapters::{{impl}}::next<core::iter::adapters::Map<core::ops::range::Range<usize>, closure-0>,spirv_cross_internal::spirv::EntryPoint,spirv_cross_internal::ErrorCode> (struct core::iter::adapters::ResultShunt<core::iter::adapters::Map<core::ops::range::Range<usize>, closure-0>, spirv_cross_internal::ErrorCode> *)
prv func   00000001`83f10140  33b xul!core::iter::adapters::{{impl}}::next<core::iter::adapters::Map<core::ops::range::Range<usize>, closure-0>,spirv_cross_internal::spirv::EntryPoint,spirv_cross_internal::ErrorCode> (struct core::iter::adapters::ResultShunt<core::iter::adapters::Map<core::ops::range::Range<usize>, closure-0>, spirv_cross_internal::ErrorCode> *)

I only looked at that one pair, but based on sizes I assume it's always the same function duplicated.

All three copies of the function (the one in the base binary, and the two in the larger binary) have identical code, modulo jump offsets. So this looks like an incomplete ICF in the larger binary.

Maybe we could try setting -linkrepro to the artifact upload directory to see if we can reproduce the sometimes-ICF in multiple runs on a local machine.

(In reply to :dmajor from comment #27)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5ece95aae8ea41cab244b6f1b8e896edf5360c3b

In 15 rebuilds of the repro.tar from the Bn task, I always got the same size xul.dll. But building the repro.tar from the Bnr task (I only tried once), I got a slightly larger xul.dll. It appears that the lld-link step is fine from a determinism standpoint; the differences are already present in the inputs to the linker. I do see a difference in size in the two gkrust.libs.

You need to log in before you can comment on or make changes to this bug.