Closed Bug 1329858 Opened 7 years ago Closed 7 years ago

Intermittent Taskcluster Mac build gmake[7]: *** [stage-package] Error 1 after a "No such file or directory" for some large part of the browser

Categories

(Firefox Build System :: General, defect)

defect
Not set
normal

Tracking

(firefox52 unaffected, firefox53 fixed, firefox54 fixed)

RESOLVED FIXED
mozilla54
Tracking Status
firefox52 --- unaffected
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: chmanchester)

References

Details

(Keywords: intermittent-failure)

Filed by: philringnalda [at] gmail.com

https://treeherder.mozilla.org/logviewer.html#?job_id=67266442&repo=mozilla-inbound

https://queue.taskcluster.net/v1/task/U6BboOTaSSuEPRUBwZLAQw/runs/0/artifacts/public/logs/live_backing.log

Starting suspiciously close together and on different trees, one on a mozilla-inbound job that started at 07:13 Pacific and one on an autoland job that started (on a different instance) at 07:29. So far, the large parts of the browser I've seen missing have been XUL and libmozglue.dylib, and the only case where I've seen two failures on one instance was  i-0e8c06663ffd26d62 failing to find XUL on autoland, then later failing to find mozglue on mozilla-inbound.
Also,

[task 2017-01-09T15:31:06.737725Z] 15:31:06     INFO -  IOError: [Errno 2] No such file or directory: '../../dist/Nightly.app/Contents/MacOS/XUL'

I wonder if this is premature garbage collection?
Component: General → Docker-Worker
This looks like build system bustage of some sort. From that log I can see XUL being linked here:
[task 2017-01-09T15:26:54.731809Z] 15:26:54     INFO -  XUL
[task 2017-01-09T15:26:54.731840Z] 15:26:54     INFO -  rm -f XUL
<really long link command for XUL>

but then later in the log (right after the error from stage-package, actually), I see:
[task 2017-01-09T15:31:18.004119Z] 15:31:18     INFO -  gmake[8]: Entering directory '/home/worker/workspace/build/src/obj-firefox/toolkit/library'
[task 2017-01-09T15:31:18.004282Z] 15:31:18     INFO -  XUL
[task 2017-01-09T15:31:18.004426Z] 15:31:18     INFO -  rm -f XUL
<XUL link again>

and then below that we link the gtest XUL:
[task 2017-01-09T15:31:32.328829Z] 15:31:32     INFO -  gmake[8]: Entering directory '/home/worker/workspace/build/src/obj-firefox/toolkit/library/gtest'
[task 2017-01-09T15:31:32.329058Z] 15:31:32     INFO -  XUL
[task 2017-01-09T15:31:32.329274Z] 15:31:32     INFO -  rm -f XUL

We definitely shouldn't be re-linking libxul a second time there! I'd guess that's happening during the "link gtest libxul" bit of test packaging, but I don't know why.
My best guess would be that this is a regression from bug 1323581. Looking at a random OS X opt taskcluster inbound build I can see that it is in fact linking XUL twice (three hits if you grep for '-o XUL', one of which is gtest):
https://public-artifacts.taskcluster.net/U6A94nVhS6WN6ZZC1qirjg/0/public/logs/live_backing.log
Blocks: 1323581
Component: Docker-Worker → Build Config
Product: Taskcluster → Core
Looking at a random older build job, I think we have been double-linking libxul on the taskcluster mac builds for an unknown amount of time:
https://public-artifacts.taskcluster.net/AyDUqH4RT4ycFyKUPF2Ywg/0/public/logs/live_backing.log

[task 2016-12-23T22:20:31.239434Z] 22:20:31     INFO -  XUL
[task 2016-12-23T22:20:31.240523Z] 22:20:31     INFO -  rm -f XUL
...
[task 2016-12-23T22:21:22.839805Z] 22:21:22     INFO -  XUL
[task 2016-12-23T22:21:22.840058Z] 22:21:22     INFO -  rm -f XUL

chmanchester's patch just turned this into a race with packaging.
Well, I don't know why we end up re-compiling random parts of the browser when we build the gtest xul on mac cross builds, but we could probably sidestep this by building the gtest xul as part of the main build rather than test packaging, which we should probably be doing anyway.
Note this is not limited to XUL, this has just happened to me for libmozglue.dylib.
So, for dynamic libraries, what's happening is that building gtest recurses through the build system kind of mach build binaries, and things like this happen:

[task 2017-01-19T10:04:01.996976Z] 10:04:01     INFO -  Rebuilding libcrmf.a (1484819437.000000) because respcmn.o (1484819437.194101), servget.o (1484819437.506101) changed

(this is the output from REBUILD_CHECK, with a modification to print out the mtime as well)

libcrmf.a is built with:

/home/worker/workspace/build/src/obj-firefox/_virtualenv/bin/python /home/worker/workspace/build/src/config/expandlibs_exec.py --extract -- /home/worker/workspace/build/src/cctools/bin/x86_64-apple-darwin10-ar crs libcrmf.a asn1cmn.o challcli.o cmmfasn1.o cmmfchal.o cmmfrec.o cmmfresp.o crmfcont.o crmfdec.o crmfenc.o crmfget.o crmfpop.o crmfreq.o crmftmpl.o encutil.o respcli.o respcmn.o servget.o

which suggests x86_64-apple-darwin10-ar is truncating the mtime of the files is creates.

I bet adding a $(TOUCH) after that command would "fix" the problem.
And for all the non dynamic libraries files, the problem is the

$(MAKE) -C $(DEPTH)/browser/app repackage

in the gtest rule in testing/gtest/Makefile.in, after $(MAKE) -C $(DEPTH)/toolkit/library gtestxul.

That repackage rule redoes the rsync between dist/bin and dist/Firefox.app, and that races with stage-package, when both stage-package and gtest are running in parallel.

I think the repackage is only there to copy the gtest XUL library into the dist/Firefox.app directory. So doing that instead of a full rsync should fix the remaining issue.
The patch in bug 1200311 should take care of both of those issues.
(In reply to Mike Hommey [:glandium] from comment #14)
> FWIW:
> 
> unpatched:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=e107fb954d6b6d5c86ad6acfde7206ad26121eb3
> 19 reds for 21 builds
> 
> with a TOUCH for libraries:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=23b6f642747c322aaa239fc93f4798f9e86e8f52
> 0 reds for 50 builds

Can we go ahead and make this change now to stop the bleeding?
(In reply to Chris Manchester (:chmanchester) from comment #13)
> The patch in bug 1200311 should take care of both of those issues.

Yeah, those commands happen in the testing/gtest Makefile, and that patch makes us not invoke that for automation builds. It'd still be a problem for local developer builds. I'll file a separate bug to clean that up.
Looking good on branches bug 1200311 has merged around to.
Assignee: nobody → cmanchester
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.