Closed Bug 898642 Opened 11 years ago Closed 8 years ago

Release repacks compound in size after errors

Categories

(Release Engineering :: Release Automation: Other, defect, P3)

x86
Windows Server 2008
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhopkins, Unassigned)

Details

Attachments

(1 file)

Similar to bug 647259 which affected nightly repacks, the Firefox 23.0b9 win32 repack 10/10 has been creating very large firefox*.exe files, doubling in size with each locale and eventually being rejected by the signing server with "error uploading file for signing: File too large".
ftr, this tripped up FF23.0b9 today. I know of at least bug#898587.
I've stopped the buildbot slave on mw32-ix-slave07 and disabled it in slavealloc so we can investigate further.  The problem did not manifest itself on mw32-ix-slave03.
(In reply to John Hopkins (:jhopkins) from comment #2)
> I've stopped the buildbot slave on mw32-ix-slave07 and disabled it in
> slavealloc so we can investigate further.  The problem did not manifest
> itself on mw32-ix-slave03.

The fact that this doesn't manifest everywhere is very, very strange...it makes me think that mw32-ix-slave07 had some weird state on it. If it was a pure build system issue we should see it everywhere.
There are a few other bugs about this exact symptom for nightly builds, and a bug for the underlying issue (which I can't find).

Something in the build system fails to clean up under some pretty rare conditions, and then triggers this chain reaction on the same slave.

If we still have the logs from this slave, I would recommend we go over all the repack logs very carefully to see if we can spot something.
The first failed repack build was at: http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/release-mozilla-beta-win32_repack_10%2F10/builds/2

I've saved the logs and build directory in jhopkins@dev-master01:/builds/buildbot/jhopkins/fx23.0b9/  (note: the logs are the 2-* files; ignore the others)
Aki hit this in staging on w64-ix-slave04. See https://bugzilla.mozilla.org/show_bug.cgi?id=903116#c15
Product: mozilla.org → Release Engineering
Can someone go through these logs?
Priority: -- → P3
The errors that lead to this bug (23.0b9) can be seen in http://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/archived/23.0b9-candidates/build1/release-mozilla-beta-win32_repack_10-bm58-build1-build2.txt.gz. There's something wrong the particular slave, and it fails to create the complete mar file for the first locale ta-LK (search for 'make', 'installers-ta-LK']' returned non-zero exit status 2). The release automation tries to repack the next locale, but it fails for the same reason. Eventually enough content from previous locales is present we can't even repack the full installer, because it's bigger than the limit we set to sign.

We hit similar instance of this in bug 938075 - the full log is at http://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/25.0.1-candidates/build1/logs/release-mozilla-release-win32_repack_8-bm86-build1-build0.txt.gz.

That should be repacking the locales or,pa-IN,pl,pt-BR,pt-PT,rm,ro,ru,si in sequence. It gets as far as pt-PT fine, and then in rm it tries to create the complete mar and bash crashes:
     remove: searchplugins/wikipedia-it.src
     remove: searchplugins/wikipedia-it.png
      0 [main] bash 3040 handle_exceptions: Exception: STATUS_ACCESS_VIOLATION
  46061 [main] bash 3040 open_stackdumpfile: Dumping stack trace to bash.exe.stackdump

The next locale doesn't clean up a working directory and rm content is included when repacking the ro locale. And the problem (partly) persists on all later locales too, based on these details of what was uploaded:
caf1cf54848e66f36dfdce9426250b98 md5 23339520 win32/or/Firefox Setup 25.0.1.exe
21353c5fb71db47f9a8c47f66b9c7475 md5 23160056 win32/pa-IN/Firefox Setup 25.0.1.exe
ba810908e20b1f66e58eb47d46ac7c13 md5 24053488 win32/pl/Firefox Setup 25.0.1.exe
20ffb290db1928cdc42182389a62fa1c md5 23266720 win32/pt-BR/Firefox Setup 25.0.1.exe
6d36e7a384f4ddf1e715b3d2b1c95147 md5 23303408 win32/pt-PT/Firefox Setup 25.0.1.exe
< no rm here because it failed >
084a11002c14a86815b139c17affdaf2 md5 75826888 win32/ro/Firefox Setup 25.0.1.exe
48d0ed56c9ecd381531b5e3e9828f149 md5 46955752 win32/ru/Firefox Setup 25.0.1.exe
012b1332c25456e7c1054ccd21185ed4 md5 46586096 win32/si/Firefox Setup 25.0.1.exe

I'm guessing something is wrong in http://mxr.mozilla.org/mozilla-release/source/toolkit/locales/l10n.mk, but I'm not sure how to find what that is. glandium, I'm hoping you might have this code already swapped into your head.
Flags: needinfo?(mh+mozilla)
Summary: Release repacks compounding in size → Release repacks compound in size after errors
Relevant part of the log:

Everything is Ok
MAR=e:/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/obj-l10n/dist/host/bin/mar.exe \
	  /e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/tools/update-packaging/make_full_update.sh \
	  "/e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/obj-l10n/browser/locales/../../dist/update/win32/ta-LK//firefox-23.0b9.complete.mar" \
	  "/e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/obj-l10n/browser/locales/../../dist/l10n-stage/firefox"
python /e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/scripts/release/signing/signtool.py --cachedir /e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/signing_cache -t /e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/token -n /e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/nonce -c /e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/scripts/release/signing/host.cert -H signing4.srv.releng.scl3.mozilla.com:9120 -H signing5.srv.releng.scl3.mozilla.com:9120 -H signing6.srv.releng.scl3.mozilla.com:9120 -f mar "/e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/obj-l10n/browser/locales/../../dist/update/win32/ta-LK//firefox-23.0b9.complete.mar"
Traceback (most recent call last):
  File "e:/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/scripts/release/signing/signtool.py", line 194, in <module>
    main()
  File "e:/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/scripts/release/signing/signtool.py", line 182, in main
    if not remote_signfile(options, urls, f, fmt, token, dest):
  File "e:\builds\moz2_slave\rel-m-beta-w32_rpk_10-00000000\scripts\lib\python\signing\client.py", line 44, in remote_signfile
    filehash = sha1sum(filename)
  File "e:\builds\moz2_slave\rel-m-beta-w32_rpk_10-00000000\scripts\lib\python\util\file.py", line 44, in sha1sum
    fp = open(f, 'rb')
IOError: [Errno 2] No such file or directory: 'e:/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/obj-l10n/browser/locales/../../dist/update/win32/ta-LK//firefox-23.0b9.complete.mar'
make[3]: Leaving directory `/e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/obj-l10n/tools/update-packaging'
make[2]: Leaving directory `/e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/obj-l10n/browser/locales'
make[1]: Leaving directory `/e/builds/moz2_slave/rel-m-beta-w32_rpk_10-00000000/mozilla-beta/obj-l10n/browser/locales'
make[3]: *** [complete-patch] Error 1
make[2]: *** [repackage-zip] Error 2
make[1]: *** [repackage-win32-installer] Error 2
make: *** [repackage-win32-installer-ta-LK] Error 2

make_full_update.sh is supposed to create the file that is said not to exist later... That being said... why are these steps using make instead of pymake (or, at least, the same thing as the rest of the builds)?
Flags: needinfo?(mh+mozilla)
The make vs pymake may just be that 23.0b9 was some time ago now. The 25.0.1 log looks like a pymake call.

I'm less concerned about the intermittent causes of the errors (wonky build slave and bash crashing, respectively), than the l10n build system being robust against those failures. Do we need another rm on l10n-stage very early in make-installers-% target perhaps ?
tools/update-packaging/make_full_update.sh is probably not failing when it should.
I would characterize it like this:
* the release repacks are generated by looping over a list of locales
* if there is an error in make_full_update.sh then the current locale is aborted (it does fail properly)
* the loop continues to the next locale but the state is not reset properly, and the installer contains more than one locale
make_full_update.sh itself doesn't fail properly, since it's apparently not creating the file, and it's something else using that file later on that makes the build fail "properly". So there *is* something wrong happening that is not being reported. Whether it's make_full_update.sh itself or something that run after it and before the actual failure.
in 25.0.1 case bash.exe segfaulted iirc
I believe this is fixed in mozharness based l10n repacks.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: