Closed Bug 1350001 Opened 8 years ago Closed 8 years ago

Build failure without --disable-webrender (32-bit builds on 64-bit windows machines)

Categories

(Core :: Graphics: WebRender, defect)

All
Windows
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: padenot, Assigned: kats)

References

Details

(Whiteboard: [workaround in comment 2])

Attachments

(3 files)

Attached file log-build-failure.txt
I can't build the tree anymore on Windows 10, 64bits, today's mozilla-central, Visual Studio 2015. Relevant logs attached. I've run ./mach bootstrap, and rebuilt, same result. --disable-webrender allows me to build again.

$ rustc --version
rustc 1.15.1 (021bd294c 2017-02-08)
Blocks: 1342450
Flags: needinfo?(bugmail)
I'm fairly sure doing a "rustup default i686-pc-windows-msvc" to switch to a 32-bit rust toolchain will fix this, but I haven't verified it locally. I'll do that later today, assigning bug to myself.
Assignee: nobody → bugmail
Flags: needinfo?(bugmail)
Summary: Build failure without --disbale-webrender → Build failure without --disable-webrender (32-bit builds on 64-bit windows machines)
So doing "rustup default stable-i686-pc-windows-msvc" will install the 32-bit rust toolchain as the default, and then the build works.

However I guess this does seem like a workaround, since the build system already requires having the correct target installed. That is, if you're on a 64-bit machine, you'll have the 64-bit rust toolchain installed by default. And if you're doing a 32-bit firefox build, the build system will use the 32-bit *target* of the 64-bit toolchain, and complain if you don't have it. So I guess really the problem is that something inside one of webrender's dependencies is not respecting the target properly, and is emitting 64-bit stuff which then fails later.
See Also: → 1320741
Whiteboard: [workaround in comment 2]
(Also, if you're on a 64-bit machine, you could do 64-bit firefox builds, which works now, and shouldn't have this problem).
It looks like the kernel32-sys crate fails cross-compilation, which causes heapsize to fail cross-compilation, which causes the firefox build to fail. I filed https://github.com/retep998/winapi-rs/issues/411 for the kernel32-sys cross-compilation problem although getting that fixed and having the fix propagate back to firefox will likely take a while.
(Cross-posted from https://github.com/retep998/winapi-rs/issues/411#issuecomment-288875058):

I talked to @retep998 on IRC about this. He explained that this happens because mozillabuild uses vcvars.bat to set up the build environment. Basically the 32-bit mozillabuild shell set up the MSVC environment to always produce 32-bit output. However rustc, when cross-compiling, uses the "host" environment for the build scripts. Therefore rustc is producing 64-bit artifacts and MSVC is trying to link with 32-bit artifacts, and this causes the breakage. I verified that trying to build kernel32-sys without vcvars (i.e. from a plain windows cmd prompt) works fine. So this is not really a bug in kernel32-sys per se but interaction between how rust does builds and the mozillabuild system using vcvars.

So for now using the 32-bit rust toolchain when doing 32-bit builds is the way to go. So what we should do is update mozillabuild to automatically use the 32-bit rust toolchain when starting the 32-bit shell and the 64-bit rust toolchain when starting the 64-bit shell.
I don't think fixing this in MozillaBuild is a great idea. We haven't done a MozillaBuild release in a long time, and getting people to update MozillaBuild is a huge hassle. We should find some workaround in the build system. Can we force cargo to do the right thing here?

I'm still not entirely sure what the exact bug is--the build script is using the 64-bit toolchain because that's the default, but it winds up trying to link with the 32-bit linker? That feels like it could be a cargo bug if so.
The problem occurs when doing a 32-bit build on a 64-bit machine. In this configuration, the default rust toolchain is the 64-bit one. That means the "host" parts of the build (e.g. build.rs scripts) get compiled to 64-bit object files regardless of what --target option is passed to cargo/rustc. However, because we are using a 32-bit MozillaBuild shell, it has set up vcvars to always link as 32-bit. Therefore the 64-bit object files fail linking.

Note that this error doesn't happen on the C++ side because the build doesn't allow you to cross-compile. If you set a --host that's different from a --target the configure step fails entirely as that is not supported. Yet on the rust side it goes ahead and tries to do it, which fails. So maybe the right fix here is to just fail the build during configure if we detect the default rust toolchain does not match the build shell, like we do for --host and --target mismatches.
Thanks for the cc Ted! So as background we have strived to make the rustc compiler usable outside of the MSVC shells, basically without running vcvars.bat. That means that we have a ton of inference [1] for finding the right MSVC compiler when rustc in a bare command line situations. Additionally, as you've found, cargo will compile build scripts for the native host architecture (e.g. it won't pass `--target` to the compiler). That means that if you have a 64-bit rustc toolchain and you're generating code for a 32-bit target, then the toolchain will actually be producing binaries for both 32-bit (the final artifact) and 64-bit (intermediate artifacts like build scripts).

So what's happening here I believe is that you've run vcvars.bat which means that all the relevant VS env vars are configured when you're running rustc. It just so happens that rustc detects this [2] and bypasses its inference entirely to assume the user has configured everything necessary. Unfortunately any one instance of vcvars.bat env vars are not compatible with cross compilation (they're only usable to build one target) so when Cargo attempts to do cross compilation then everything goes haywire as rustc uses the user-provided 32-bit compiler to produce a 64-bit executable, generating the error you're seeing.

I could imagine a few possible fixes to this:

* Gecko could require that rustc has the right host architecture and no cross-compilation happens. Basically it'd reject x86_64-pc-windows-msvc host compilers when producing an i686-pc-windows-msvc target.
* Cargo could support explicitly specifying a host target (or build target I guess is the autotools name) where it "cross compiles" build scripts. 
* Rustc could detect clearly erroneous situations such as this (e.g. vcvars.bat says to compile to 32-bit but the target is 64-bit) and avoid using the env vars and fall back to its own inference (which should work in this case I believe, maybe less so if there's a bunch of native libraries in play)
* Gecko could remove the env vars when executing rustc. So long as the user only has one VS installation and rustc can find it then the env vars aren't necessary for rustc itself.

I'm sort of ambivalent about which path to take here. I'd be hesitant to implement the second only because it's a new feature where the others are arguably just bug fixes, but I feels like the cleanest option here out of all.

[1]: https://github.com/rust-lang/rust/blob/master/src/librustc_trans/back/msvc/mod.rs
[2]: https://github.com/rust-lang/rust/blob/ccce2c6eb914a66571f60fa0afe8a46faa9fb3bd/src/librustc_trans/back/msvc/mod.rs#L81
Consensus seems to be to implement Alex's fourth proposal above to strip the env vars which are confusing cargo when the gecko build system calls it. The third proposal is essentially to do the same thing in cargo, so it's similarly valid, but a slower way to fix the issue for gecko.
How is that going to interact with CI builds, where we don't have MSVC installed and we set all the env vars to point to the unpacked tooltool package?
So clearing out the env vars can be done by adding a -i option to the env wrapper at [1]. This fixes the problem for me locally. I kicked off a try push to see what breaks in CI [2]. I see two possible solutions for comment 11: one is to redefine the necessary env vars inside the `env -i` wrapper, just like we explicitly set CARGO_TARGET_DIR and RUSTC and those other variables. The other option is to only add the -i argument outside of automation (e.g. if MOZ_AUTOMATION is not set, or if the MOZ_MSVCBITS var is set, which indicates we're in MozillaBuild, or some other appropriate condition).

[1] http://searchfox.org/mozilla-central/rev/381a7b8f8a78b481336cfa384cb0a5536e217e4a/config/rules.mk#949
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=48e508903d5fb21417ffe4701ba1ad7cfa3a2829
Unsurprisingly that broke everything on CI. Here's one where we do it conditionally based on MOZ_MSVCBITS: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee77673d2a90f975bb118fc123357cd96293c05c
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #14)
> *** Bug 1353932 has been marked as a duplicate of this bug. ***

Is this default toolchain setting stored outside of the shell environment? (I often have two shells open, one 64-bit and one 32-bit, both building off the same repo with separate obj dir targets.)
(In reply to Jim Mathies [:jimm] from comment #15)
> Is this default toolchain setting stored outside of the shell environment?

Yes, it is stored by rustup.

> (I often have two shells open, one 64-bit and one 32-bit, both building off
> the same repo with separate obj dir targets.)

Yeah in that case the workaround from comment 2 isn't going to work for you. The patch in comment 13 will probably do it. I'll put that up for review.
Comment on attachment 8855308 [details]
Bug 1350001 - Run cargo in a clean environment when running in a MozillaBuild shell.

https://reviewboard.mozilla.org/r/127188/#review130022

This seems like a reasonable workaround given the constraints. Thanks!
Attachment #8855308 - Flags: review?(ted) → review+
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4c06dea25ef7
Run cargo in a clean environment when running in a MozillaBuild shell. r=ted
https://hg.mozilla.org/mozilla-central/rev/4c06dea25ef7
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
This patch breaks Servo build on Windows because the build script can no longer locate Python executable.
(In reply to Xidorn Quan [:xidorn] UTC+10 (less responsive 15/Apr-3/May) from comment #22)
> This patch breaks Servo build on Windows because the build script can no
> longer locate Python executable.

I'm assuming you mean the stylo build (or whatever it is you're calling stylo-enabled gecko), rather than the servo build. Are there instructions somewhere on how to do that build locally? I'd like to reproduce and see if I can find a workaround.
Yes, Stylo is broken on Windows with this. Instructions for building are here: https://public.etherpad-mozilla.org/p/stylo
For Windows you need to install LLVM 3.9 or else disable bindgen. I think you can probably use something like the following in your mozconfig:

ac_add_options --enable-stylo
ac_add_options --disable-stylo-build-bindgen
ac_add_options --target=x86_64-pc-mingw32
ac_add_options --host=x86_64-pc-mingw32
This fixes it for me locally. Xidorn/Brian, can you verify this patch fixes it for you guys as well?

For reference the bit of the stylo build that looks for the python executable is at http://searchfox.org/mozilla-central/rev/eace920a0372051a11d8d275bd9b8f14f3024ecd/servo/components/style/build.rs#67
Attachment #8856510 - Flags: review?(ted)
Attachment #8856510 - Flags: feedback?(xidorn+moz)
Attachment #8856510 - Flags: feedback?(bbirtles)
That one looks promising. I don't have a Windows machine at the moment, but am happy to check tomorrow.
Comment on attachment 8856510 [details] [diff] [review]
Let stylo know where python is

Works for me.
Attachment #8856510 - Flags: feedback?(bbirtles) → feedback+
Comment on attachment 8856510 [details] [diff] [review]
Let stylo know where python is

Sorry, I only tried an incremental build when I first checked. After doing a full build with this patch applied, the Stylo bindgen build script crashed because we are resetting some environment that means it fails to find certain Win 10 SDK files. It seems like we actually depend on the environment setup by vcvars.
Attachment #8856510 - Flags: feedback+ → feedback-
Comment on attachment 8856510 [details] [diff] [review]
Let stylo know where python is

Ok, dropping review on this patch then. I'm going to spin the stylo bustage off into a new bug and we can discuss there.
Attachment #8856510 - Flags: review?(ted)
Attachment #8856510 - Flags: feedback?(xidorn+moz)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: