Closed Bug 670951 Opened 9 years ago Closed 8 years ago

Mac XULRunner builds fail intermittently during unification

Categories

(Firefox Build System :: General, defect, critical)

defect
Not set
critical

Tracking

(firefox7+ wontfix)

RESOLVED FIXED
mozilla9
Tracking Status
firefox7 + wontfix

People

(Reporter: khuey, Assigned: espindola)

References

Details

Attachments

(3 files, 4 obsolete files)

The Mac XULRunner builds are failing intermittently during unify with 

sent 72078642 bytes  received 23648 bytes  6866884.76 bytes/sec
total size is 71997252  speedup is 1.00
Linking XPT files...
/tools/buildbot/bin/python2.6 /builds/slave/m-cen-osx64-xr/build/config/optimizejars.py --optimize /builds/slave/m-cen-osx64-xr/build/obj-firefox/x86_64/xulrunner/installer/mac/../../../jarlog//en-US ../../../dist/bin/chrome ../../../dist/xulrunner/XUL.framework/Versions/Current/chrome
Removing unpackaged files...
cd ../../../dist/xulrunner/XUL.framework/Versions/Current; rm -rf xulrunner-config regchrome* regxpcom* xpcshell* xpidl* xpt_dump* xpt_link*  core bsdecho gtscc js js-config jscpucfg nsinstall viewer TestGtkEmbed codesighs* elf-dynstr-gc mangle* maptsv* mfc* mkdepend* msdump* msmap* nm2tsv* nsinstall* res/samples res/throbber shlibsign* ssltunnel* certutil* pk12util* winEmbed.exe chrome/chrome.rdf chrome/app-chrome.manifest chrome/overlayinfo components/compreg.dat components/xpti.dat content_unit_tests necko_unit_tests *.dSYM 
Packaging JavaScript Shell...
rm -f ../../../dist/jsshell-mac64.zip
/usr/bin/zip -9j ../../../dist/jsshell-mac64.zip ../../../dist/bin/js  ../../../dist/bin/libnspr4.dylib ../../../dist/bin/libplds4.dylib ../../../dist/bin/libplc4.dylib 
  adding: js (deflated 67%)
  adding: libnspr4.dylib (deflated 65%)
  adding: libplds4.dylib (deflated 70%)
  adding: libplc4.dylib (deflated 70%)
rm -f obj-firefox/i386/dist/xulrunner/XUL.framework/Versions/Current/*.chk \
	      obj-firefox/x86_64/dist/xulrunner/XUL.framework/Versions/Current/*.chk
/builds/slave/m-cen-osx64-xr/build/build/macosx/universal/fix-buildconfig file \
	  obj-firefox/i386/dist/xulrunner/XUL.framework/Versions/Current/chrome/toolkit/ \
	  obj-firefox/x86_64/dist/xulrunner/XUL.framework/Versions/Current/chrome/toolkit/
mkdir -p obj-firefox/i386/dist/universal/xulrunner
rm -f obj-firefox/x86_64/dist/universal
ln -s obj-firefox/i386/dist/universal obj-firefox/x86_64/dist/universal
rm -rf obj-firefox/i386/dist/universal/xulrunner/XUL.framework
/builds/slave/m-cen-osx64-xr/build/build/macosx/universal/unify \
          --unify-with-sort "\.manifest$" \
          --unify-with-sort "components\.list$" \
	  obj-firefox/i386/dist/xulrunner/XUL.framework \
	  obj-firefox/x86_64/dist/xulrunner/XUL.framework \
	  obj-firefox/i386/dist/universal/xulrunner/XUL.framework
/builds/slave/m-cen-osx64-xr/build/build/macosx/universal/unify: warning: makeUniversalDirectory: only in x86 obj-firefox/x86_64/dist/xulrunner/XUL.framework/Versions/8.0a1:
  xulrunner
Can't call method "path" on an undefined value at /builds/slave/m-cen-osx64-xr/build/build/macosx/universal/unify line 1092.
make[2]: *** [postflight_all] Error 255
make[1]: *** [realbuild] Error 2
make: *** [build] Error 2
program finished with exit code 2
I really don't know what's going on here. Maybe some bad ordering of packaging targets?
rs said that we ran into a similar problem for Firefox builds in the past, and that it was fixed by running some step twice.
Definitely looks like a build system problem. The m-c builds on 2011-07-17 thru 19 were all on moz2-darwin10-slave06, but the first two succeeded and the last on failed. Also happening on Aurora.

The message
 only in x86 obj-firefox/x86_64/dist/xulrunner/XUL.framework/Versions/8.0a1:
is a bit odd, seems to be looking for the 32bit file in the 64bit dir.
I think that's a red herring. `unify` was originally written to splice together x86+ppc halves, and I think when we switched it to splice x86+x86_64 it retained a few strings explicitly referring to "x86" and "ppc":
http://mxr.mozilla.org/mozilla-central/source/build/macosx/universal/unify#724
Duplicate of this bug: 679766
Bumping severity as this prevented us from shipping XULRunner 7.0b1.
Severity: normal → critical
Can't you just rerun the build?
That failed too.
How many times should I rerun it?
Sorry if it is a silly question, but how do I reproduce this?
|make -f client.mk build| in mozilla-beta, with http://hg.mozilla.org/build/buildbot-configs/file/de8369b8cd85/mozilla2/macosx64/mozilla-beta/xulrunner/mozconfig as your mozconfig should probably do it.
nominating for tracking even though no patch ready yet, because this prevented us from generating xulrunner builds for 7.0beta1, and will be problem for 7.0betaN & 7.0 release.
Rafael - any luck reproducing this or thoughts on how to approach it (or thoughts on who we should punt it towards)?
joey, ted: any suggestions?

espindola: does comment#10 answer your question?
(In reply to Johnathan Nightingale [:johnath] from comment #12)
> Rafael - any luck reproducing this or thoughts on how to approach it (or
> thoughts on who we should punt it towards)?

Not yet, sorry. Will try to take a look today. The analysis and opening the rdar's in relation to bug 678607 took some time.
Failed for 7.0b3
Do we have a recent xulrunner for Firefox 7 beta working?
Between it sometimes working first time, and rebuilding when it doesn't, we've had mac SDKs for 7.0b2 through b5, eg
  https://ftp.mozilla.org/pub/mozilla.org/xulrunner/releases/7.0b5/sdk/
xulrunner failed to build and the rebuild also failed for Firefox 7.0b6
what OS X version is used for the build, 10.6?
System Version: Mac OS X 10.6.2 (10C540)

Darwin moz2-darwin10-slave09.build.mozilla.org 10.2.0 Darwin Kernel Version 10.2.0: Tue Nov  3 10:37:10 PST 2009; root:xnu-1486.2.11~1/RELEASE_I386 i386
Attached patch test patch (obsolete) — Splinter Review
Looking at the code I noticed that the script would fail if only the x86 file was defined. It is really strange that it fails sometimes only.

I am trying to reproduce the bug with this patch. Should at least give a bit more information.
Attached patch s/else if/elsif/ (obsolete) — Splinter Review
Attachment #561237 - Attachment is obsolete: true
Attached patch new patchSplinter Review
We already had a waring for a file existing in only of of the arches.
Attachment #561250 - Attachment is obsolete: true
summary so far:

the unify script has a bug and crashes if a file exists only on X86 (as opposed to PPC, the internal names the script uses).

The attached patch fixes the crash, but the log of a failed build shows

/builds/slave/rel-m-beta-xr-osx64-bld/build/build/macosx/universal/unify: warning: makeUniversalDirectory: only in x86 obj-firefox/x86_64/dist/xulrunner/XUL.framework/Versions/7.0:
  xulrunner

We probably still have to figure out what xulrunner was created only for x86_64.
In packager.mk we run:

 rsync -auv --copy-unsafe-links $(_APPNAME) $(MOZ_PKG_DIR)

doing a diff of both x86_64 and i386 output shows:

< XUL.framework/Versions/7.0/xpidl
< XUL.framework/Versions/7.0/xulrunner

but we do run nsinstall for xulrunner on both x86_64 and i386. Trying to figure out what went wrong between one and the other.

I cannot reproduce this on my laptop, trying to build on a bot.
Attached patch loggingSplinter Review
This patch makes a rsync invocation verbose to try do bisect where the problem is.
Attached patch proposed patch (obsolete) — Splinter Review
I have not been able to reproduce the problem, but I think this missing dependency could be the cause.
Attachment #561311 - Flags: review?(ted.mielczarek)
Comment on attachment 561282 [details] [diff] [review]
logging

Making this more verbose is probably a good thing anyway.
Attachment #561282 - Flags: review?(ted.mielczarek)
Comment on attachment 561252 [details] [diff] [review]
new patch

I am not so sure about this one. While this patch fixes a bug, this bug is probably what saved us from shipping a package with a non universal xulrunner.

Can we just drop support for having files in one dir but not in the other?
Attachment #561252 - Flags: review?(ted.mielczarek)
Attached patch proposed patch (obsolete) — Splinter Review
Attachment #561311 - Attachment is obsolete: true
Attachment #561334 - Flags: review?(ted.mielczarek)
Attachment #561311 - Flags: review?(ted.mielczarek)
Comment on attachment 561334 [details] [diff] [review]
proposed patch

The joys of recursive makefiles :-(

I now realize that all this patch does is move the failure earlier in the build. If $(DIST)/bin/xulrunner already exists, everything works. If it doesn't, this particular makefile process has no idea how to create it:

make[7]: *** No rule to make target `../../dist/bin/xulrunner'.  Stop.
Attachment #561334 - Flags: review?(ted.mielczarek)
Attached patch proposed patchSplinter Review
I think the correct way to fix this is to install the files directly in the location we will use and drop the use of rsync, but that is a fairly big change. We would have to change $(DEST)/bin to $(DEST_BIN) everywhere and have xulrunner on OS X set it to $(DIST)/$(FRAMEWORK_NAME).framework/Versions/$(FRAMEWORK_VERSION).

This patch is a really small step in that direction (but also a hack). The Makefile responsible for building xulrunner now also puts it in the framework directory. It may or may not be overwritten by rsync, but it will be there in the end.
Assignee: nobody → respindola
Attachment #561334 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #561635 - Flags: review?(ted.mielczarek)
Comment on attachment 561252 [details] [diff] [review]
new patch

Review of attachment 561252 [details] [diff] [review]:
-----------------------------------------------------------------

This is fine, but it doesn't really solve the underlying problem, so maybe we should not take this patch?
Attachment #561252 - Flags: review?(ted.mielczarek) → review+
Comment on attachment 561635 [details] [diff] [review]
proposed patch

Review of attachment 561635 [details] [diff] [review]:
-----------------------------------------------------------------

::: xulrunner/app/Makefile.in
@@ +208,5 @@
> +FRAMEWORK_DIR = \
> +   $(DIST)/$(FRAMEWORK_NAME).framework/Versions/$(FRAMEWORK_VERSION)
> +
> +$(FRAMEWORK_DIR)/Resources:
> +	$(NSINSTALL) -D $@

n.b.: this could use the work from bug 680246 when that lands.

::: xulrunner/stub/Makefile.in
@@ +114,5 @@
> +$(FRAMEWORK_DIR):
> +	$(NSINSTALL) -D $@
> +
> +$(FRAMEWORK_DIR)/$(PROGRAM): $(PROGRAM) $(FRAMEWORK_DIR)
> +	$(NSINSTALL) $(PROGRAM) $(FRAMEWORK_DIR)

You could just write "$(NSINSTALL) $?" in the rule here, but perhaps that's not as clear.
Attachment #561635 - Flags: review?(ted.mielczarek) → review+
https://hg.mozilla.org/mozilla-central/rev/cf051f97c093
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla9
One more patch to go.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
https://hg.mozilla.org/mozilla-central/rev/3c6a26f33adf
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Just hit this on 7.0.1 macosx64 xulrunner

sent 72125488 bytes  received 23648 bytes  5344380.44 bytes/sec
total size is 72043998  speedup is 1.00
Linking XPT files...
/tools/buildbot/bin/python2.6 /builds/slave/rel-m-rel-xr-osx64-bld/build/config/optimizejars.py --optimize /builds/slave/rel-m-rel-xr-osx64-bld/build/obj-firefox/x86_64/xulrunner/installer/mac/../../../jarlog//en-US ../../../dist/bin/chrome ../../../dist/xulrunner/XUL.framework/Versions/Current/chrome
Removing unpackaged files...
cd ../../../dist/xulrunner/XUL.framework/Versions/Current; rm -rf xulrunner-config regchrome* regxpcom* xpcshell* xpidl* xpt_dump* xpt_link*  core bsdecho gtscc js js-config jscpucfg nsinstall viewer TestGtkEmbed codesighs* elf-dynstr-gc mangle* maptsv* mfc* mkdepend* msdump* msmap* nm2tsv* nsinstall* res/samples res/throbber shlibsign* ssltunnel* certutil* pk12util* winEmbed.exe chrome/chrome.rdf chrome/app-chrome.manifest chrome/overlayinfo components/compreg.dat components/xpti.dat content_unit_tests necko_unit_tests *.dSYM 
Packaging JavaScript Shell...
rm -f ../../../dist/jsshell-mac64.zip
/usr/bin/zip -9j ../../../dist/jsshell-mac64.zip ../../../dist/bin/js  ../../../dist/bin/libnspr4.dylib ../../../dist/bin/libplds4.dylib ../../../dist/bin/libplc4.dylib 
  adding: js (deflated 67%)
  adding: libnspr4.dylib (deflated 65%)
  adding: libplds4.dylib (deflated 70%)
  adding: libplc4.dylib (deflated 70%)
rm -f obj-firefox/i386/dist/xulrunner/XUL.framework/Versions/Current/*.chk \
	      obj-firefox/x86_64/dist/xulrunner/XUL.framework/Versions/Current/*.chk
/builds/slave/rel-m-rel-xr-osx64-bld/build/build/macosx/universal/fix-buildconfig file \
	  obj-firefox/i386/dist/xulrunner/XUL.framework/Versions/Current/chrome/toolkit/ \
	  obj-firefox/x86_64/dist/xulrunner/XUL.framework/Versions/Current/chrome/toolkit/
mkdir -p obj-firefox/i386/dist/universal/xulrunner
rm -f obj-firefox/x86_64/dist/universal
ln -s obj-firefox/i386/dist/universal obj-firefox/x86_64/dist/universal
rm -rf obj-firefox/i386/dist/universal/xulrunner/XUL.framework
/builds/slave/rel-m-rel-xr-osx64-bld/build/build/macosx/universal/unify \
          --unify-with-sort "\.manifest$" \
          --unify-with-sort "components\.list$" \
	  obj-firefox/i386/dist/xulrunner/XUL.framework \
	  obj-firefox/x86_64/dist/xulrunner/XUL.framework \
	  obj-firefox/i386/dist/universal/xulrunner/XUL.framework
/builds/slave/rel-m-rel-xr-osx64-bld/build/build/macosx/universal/unify: warning: makeUniversalDirectory: only in x86 obj-firefox/x86_64/dist/xulrunner/XUL.framework/Versions/7.0.1:
  xulrunner
Can't call method "path" on an undefined value at /builds/slave/rel-m-rel-xr-osx64-bld/build/build/macosx/universal/unify line 1092.
make[2]: *** [postflight_all] Error 255
make[1]: *** [realbuild] Error 2
make: *** [build] Error 2
program finished with exit code 2
elapsedTime=8399.688690
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Right, because this only landed in time for Firefox 9. I'm not sure why this got marked status-firefox7: fixed.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Blocks: 700688
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.