Closed Bug 1008052 Opened 11 years ago Closed 10 years ago

Mac repacks sometimes fail, due to signing server issue

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Unassigned)

References

Details

Attachments

(4 files)

eg this for 30.0b3 repacks for mac, ko locale: cd ./firefox && python /builds/slave/rel-m-beta-osx64_rpk_6-0000000/scripts/release/signing/signtool.py --cachedir /builds/slave/rel-m-beta-osx64_rpk_6-0000000/signing_cache -t /builds/slave/rel-m-beta-osx64_rpk_6-0000000/token -n /builds/slave/rel-m-beta-osx64_rpk_6-0000000/nonce -c /builds/slave/rel-m-beta-osx64_rpk_6-0000000/scripts/release/signing/host.cert -H mac-signing3.build.scl1.mozilla.com:9120 -H mac-signing4.build.scl1.mozilla.com:9120 -H mac-signing1.srv.releng.scl3.mozilla.com:9120 -H mac-signing2.srv.releng.scl3.mozilla.com:9120 -f dmg Firefox.app && rm Firefox.app/Contents/CodeResources && cp Firefox.app/Contents/_CodeSignature/CodeResources Firefox.app/Contents && cd /builds/slave/rel-m-beta-osx64_rpk_6-0000000/mozilla-beta/obj-l10n/dist/l10n-stage && /builds/slave/rel-m-beta-osx64_rpk_6-0000000/mozilla-beta/build/package/mac_osx/pkg-dmg --source 'firefox' --target 'mac/ko/Firefox 30.0b3.dmg' --volname 'Firefox' --copy '/builds/slave/rel-m-beta-osx64_rpk_6-0000000/mozilla-beta/obj-l10n/dist/branding/dsstore:/.DS_Store' --mkdir /.background --copy '/builds/slave/rel-m-beta-osx64_rpk_6-0000000/mozilla-beta/obj-l10n/dist/branding/background.png:/.background' --icon '/builds/slave/rel-m-beta-osx64_rpk_6-0000000/mozilla-beta/obj-l10n/dist/branding/disk.icns' --symlink '/Applications:/ ' 2014-05-08 14:47:46,950 - 518a7ff8d0ff9fb3629ecb680d763e6904d22c95: processing Firefox.app.tar.gz on https://mac-signing4.build.scl1.mozilla.com:9120 2014-05-08 14:47:49,597 - 518a7ff8d0ff9fb3629ecb680d763e6904d22c95: uploading for signing 2014-05-08 14:49:59,486 - 518a7ff8d0ff9fb3629ecb680d763e6904d22c95: processing Firefox.app.tar.gz on https://mac-signing4.build.scl1.mozilla.com:9120 2014-05-08 14:51:05,944 - 518a7ff8d0ff9fb3629ecb680d763e6904d22c95: processing Firefox.app.tar.gz on https://mac-signing4.build.scl1.mozilla.com:9120 2014-05-08 14:52:16,401 - 518a7ff8d0ff9fb3629ecb680d763e6904d22c95: OK 2014-05-08 14:52:16,401 - Copying Firefox.app.tar.gz to cache /builds/slave/rel-m-beta-osx64_rpk_6-0000000/signing_cache/dmg/518a7ff8d0ff9fb3629ecb680d763e6904d22c95 Firefox.app/Contents/MacOS/updater.app/Contents/MacOS/updater: (Empty error message) tar: Error exit delayed from previous errors. 2014-05-08 14:52:17,631 - Error unpacking tar file /builds/slave/rel-m-beta-osx64_rpk_6-0000000/mozilla-beta/obj-l10n/dist/l10n-stage/firefox/Firefox.app.tar.gz to /builds/slave/rel-m-beta-osx64_rpk_6-0000000/mozilla-beta/obj-l10n/dist/l10n-stage/firefox We also hit this the end of last week on Aurora when it was time to re-enable updates after the merge. Looking in the signing cache for the log above (perms + ownership removed for brevity): $ ls -lS total 605984 63412144 8 May 14:37 4a7b103cd717df7c189fda40815507f1ade56d4d 63379898 8 May 14:26 557a8c19653236e6ed3f0ff40fe2d27e47a56627 63178808 8 May 14:43 cfb82d1f71ef47c3992b25d52302abdcb0b7d344 63166019 8 May 14:32 5e8da215d3d8c28770633352bd6f7c83edaeb7f1 57118720 8 May 14:52 518a7ff8d0ff9fb3629ecb680d763e6904d22c95 ie. the file is truncated. Also: $ tar tfz 518a7ff8d0ff9fb3629ecb680d763e6904d22c95 | tail Firefox.app/Contents/MacOS/updater.app/Contents/Resources/English.lproj/MainMenu.nib/classes.nib Firefox.app/Contents/MacOS/updater.app/Contents/Resources/English.lproj/MainMenu.nib/info.nib Firefox.app/Contents/MacOS/updater.app/Contents/Resources/English.lproj/MainMenu.nib/keyedobjects.nib Firefox.app/Contents/MacOS/updater.app/Contents/MacOS/updater tar: Truncated input file (needed 160768 bytes, only 0 available) tar: Error exit delayed from previous errors.
Somehow this also causes later jobs to burn, even though we'd be signing a different tarball due to localized content. eg for the 30.0b3 example: mv /builds/slave/rel-m-beta-osx64_rpk_6-0000000/mozilla-beta/obj-l10n/dist/l10n-stage/firefox/Firefox.app/Contents/Resources/en.lproj /builds/slave/rel-m-beta-osx64_rpk_6-0000000/mozilla-beta/obj-l10n/dist/l10n-stage/firefox/Firefox.app/Contents/Resources/ku.lproj mv: rename /builds/slave/rel-m-beta-osx64_rpk_6-0000000/mozilla-beta/obj-l10n/dist/l10n-stage/firefox/Firefox.app/Contents/Resources/en.lproj to /builds/slave/rel-m-beta-osx64_rpk_6-0000000/mozilla-beta/obj-l10n/dist/l10n-stage/firefox/Firefox.app/Contents/Resources/ku.lproj: No such file or directory make[1]: *** [repackage-zip] Error 1 make: *** [repackage-zip-ku] Error 2 Full log: http://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/30.0b3-candidates/build1/logs/release-mozilla-beta-macosx64_repack_6-bm84-build1-build9.txt.gz
Grepping over the signing server log, we have a timeout on the signing call: (egrep '^2014-05-08.*(12029|12334|10.26.52.74|518a7ff8d0ff9fb3629ecb680d763e6904d22c95)' signing.log.1) # has it been signed before ? 2014-05-08 14:47:49,594 - INFO - 10.26.52.74 GET /sign/dmg/518a7ff8d0ff9fb3629ecb680d763e6904d22c95 2014-05-08 14:47:49,595 - DEBUG - Looking for /builds/signing/rel-key-signing-server/signed-files/dmg/518a7ff8d0ff9fb3629ecb680d763e6904d22c95 2014-05-08 14:47:49,595 - DEBUG - /builds/signing/rel-key-signing-server/signed-files/dmg/518a7ff8d0ff9fb3629ecb680d763e6904d22c95 is missing 2014-05-08 14:47:49,595 - INFO - 10.26.52.74 - - [2014-05-08 14:47:49] "GET /sign/dmg/518a7ff8d0ff9fb3629ecb680d763e6904d22c95 HTTP/1.1" 404 96 0.001664 # nope, I'll post it for signing then 2014-05-08 14:48:19,875 - INFO - 10.26.52.74 POST /sign/dmg 2014-05-08 14:49:42,059 - INFO - Request to dmg sign Firefox.app.tar.gz (518a7ff8d0ff9fb3629ecb680d763e6904d22c95) from 10.26.52.74 2014-05-08 14:49:58,355 - DEBUG - Putting (u'518a7ff8d0ff9fb3629ecb680d763e6904d22c95', u'Firefox.app.tar.gz', 'dmg', <gevent.event.Event object at 0x101344810>) on the queue 2014-05-08 14:49:58,447 - INFO - 10.26.52.74 - - [2014-05-08 14:49:58] "POST /sign/dmg HTTP/1.1" 202 98 98.572016 # signing starts here, I think 2014-05-08 14:49:58,577 - INFO - Signing Firefox.app.tar.gz (dmg - 518a7ff8d0ff9fb3629ecb680d763e6904d22c95) 2014-05-08 14:49:58,776 - DEBUG - 12029: ['/builds/signing/rel-key-signing-server/bin/python2.7', [redacted] # is it signed yet ? no, but we're working on it 2014-05-08 14:50:04,579 - INFO - 10.26.52.74 GET /sign/dmg/518a7ff8d0ff9fb3629ecb680d763e6904d22c95 2014-05-08 14:51:04,753 - DEBUG - Looking for /builds/signing/rel-key-signing-server/signed-files/dmg/518a7ff8d0ff9fb3629ecb680d763e6904d22c95 (Firefox.app.tar.gz) 2014-05-08 14:51:04,881 - DEBUG - /builds/signing/rel-key-signing-server/signed-files/dmg/518a7ff8d0ff9fb3629ecb680d763e6904d22c95 is missing 2014-05-08 14:51:04,882 - DEBUG - Pending: {(u'518a7ff8d0ff9fb3629ecb680d763e6904d22c95', 'dmg'): <gevent.event.Event object at 0x101344810>, (u'c697447bd62f33c7e39ae18f6e7435beee14aee9', 'dmg'): <gevent.event.Event object at 0x101344cd0>} # how about nao ? patience, grasshopper 2014-05-08 14:51:04,883 - INFO - 10.26.52.74 - - [2014-05-08 14:51:04] "GET /sign/dmg/518a7ff8d0ff9fb3629ecb680d763e6904d22c95 HTTP/1.1" 404 113 60.303492 2014-05-08 14:51:06,467 - INFO - 10.26.52.74 GET /sign/dmg/518a7ff8d0ff9fb3629ecb680d763e6904d22c95 # Y U take longer than two mins to sign !?!? 2014-05-08 14:51:58,910 - DEBUG - 12029: Exceeded timeout 2014-05-08 14:51:59,911 - INFO - 12029: run_signscript: Failed with rc -1; retrying in a bit # try to sign again 2014-05-08 14:52:09,175 - DEBUG - 12334: ['/builds/signing/rel-key-signing-server/bin/python2.7', [redacted] # partial content written ? definitely the wrong number of bytes returned # we sent 57118859, slave got 57118720 2014-05-08 14:52:09,280 - DEBUG - Looking for /builds/signing/rel-key-signing-server/signed-files/dmg/518a7ff8d0ff9fb3629ecb680d763e6904d22c95 (Firefox.app.tar.gz) 2014-05-08 14:52:14,099 - DEBUG - /builds/signing/rel-key-signing-server/signed-files/dmg/518a7ff8d0ff9fb3629ecb680d763e6904d22c95 is OK 2014-05-08 14:52:16,144 - INFO - 10.26.52.74 - - [2014-05-08 14:52:16] "GET /sign/dmg/518a7ff8d0ff9fb3629ecb680d763e6904d22c95 HTTP/1.1" 200 57118859 69.679274 2014-05-08 14:54:10,192 - DEBUG - Got message: ('done', (u'518a7ff8d0ff9fb3629ecb680d763e6904d22c95', u'Firefox.app.tar.gz', 'dmg', <gevent.event.Event object at 0x101344810>), 'eb9e324616f6b0d92a47f0157734fa5eb3cf54e2') 2014-05-08 14:55:22,647 - INFO - Deleting /builds/signing/rel-key-signing-server/signed-files/dmg/518a7ff8d0ff9fb3629ecb680d763e6904d22c95.out with no unsigned file Perhaps there is a problem handling the retry case. Might need to ensure that the destination file doesn't exist after killing the processes. Chris, any idea what's going on ?
Flags: needinfo?(catlee)
Summary: Mac repacks sometimes fail → Mac repacks sometimes fail, due to signing server issue
FWIW, there are 161 matches to 'Exceeded timeout' for the nightly signing server on mac-signing4, over a period of 7 days. Seems to start on 2014-01-14 03:01:07; curiously bug 945751 landed the day before.
Other servers have different dates for first 'Exceeded timeout' messages, looking at just the nightly case: mac-signing1: 2014-02-12 02:52:44 mac-signing2: 2014-01-14 03:00:26 mac-signing3: 2014-01-21 02:12:55 mac-signing4: 2014-01-14 03:01:07, 1383 times since signing4: 2014-02-24, only 6 in total signing5: 2014-01-14, only 14 in total signing6: 2014-01-27, only 5 in total We have less timeouts on mac-signing4's dep signer, but still 54 times in the last 3 days, so this seems be load related.
CCing Tracy since he's driving Firefox 30.
We may have hit a new symptom of this in 30.0b4, where a repack failed to unpack the signed complete mar: command: START command: perl ../../../tools/update-packaging/unwrap_full_update.pl /builds/slave/rel-m-beta-osx64_rpk_3-0000000/mozilla-beta/obj-l10n/dist/update/mac/en-GB/firefox-30.0b4.complete.mar command: cwd: /builds/slave/rel-m-beta-osx64_rpk_3-0000000/mozilla-beta/obj-l10n/dist/current command: env: {'MOZ_MAKE_COMPLETE_MAR': '1', 'AB_CD': 'en-GB', 'MOZ_SIGN_CMD': 'python /builds/slave/rel-m-beta-osx64_rpk_3-0000000/scripts/release/signing/signtool.py --cachedir /builds/slave/rel-m-beta-osx64_rpk_3-0000000/signing_cache -t /builds/slave/rel-m-beta-osx64_rpk_3-0000000/token -n /builds/slave/rel-m-beta-osx64_rpk_3-0000000/nonce -c /builds/slave/rel-m-beta-osx64_rpk_3-0000000/scripts/release/signing/host.cert -H mac-signing3.build.scl1.mozilla.com:9120 -H mac-signing4.build.scl1.mozilla.com:9120 -H mac-signing1.srv.releng.scl3.mozilla.com:9120 -H mac-signing2.srv.releng.scl3.mozilla.com:9120', 'UPLOAD_SSH_KEY': '~/.ssh/ffxbld_dsa', 'LOCALE_MERGEDIR': '/builds/slave/rel-m-beta-osx64_rpk_3-0000000/mozilla-beta/obj-l10n/browser/locales/merged', 'MOZ_PKG_PLATFORM': 'mac', 'COMM_REV': 'FIREFOX_30_0b4_RELEASE', 'MOZ_OBJDIR': 'obj-l10n', 'MAR': '/builds/slave/rel-m-beta-osx64_rpk_3-0000000/mozilla-beta/obj-l10n/dist/host/bin/mar', 'UPLOAD_TO_TEMP': '1', 'LD_LIBRARY_PATH': '', 'ZIP_IN': '/builds/slave/rel-m-beta-osx64_rpk_3-0000000/firefox.dmg', 'UPLOAD_USER': 'ffxbld', 'MBSDIFF': '/builds/slave/rel-m-beta-osx64_rpk_3-0000000/mozilla-beta/obj-l10n/dist/host/bin/mbsdiff', 'UPLOAD_EXTRA_FILES': 'update/mac/el/firefox-30.0b3-30.0b4.partial.mar update/mac/el/firefox-30.0b3-30.0b4.partial.mar.asc update/mac/el/firefox-30.0b2-30.0b4.partial.mar update/mac/el/firefox-30.0b2-30.0b4.partial.mar.asc', 'UPLOAD_HOST': 'stage.mozilla.org', 'POST_UPLOAD_CMD': 'post_upload.py -p firefox -n 1 -v 30.0b4 --release-to-candidates-dir --signed', 'MOZILLA_REV': 'FIREFOX_30_0b4_RELEASE', 'MOZ_PKG_PRETTYNAMES': '1', 'MOZ_PKG_VERSION': '30.0b4'} command: output: bzip2: Data integrity error when decompressing. Input file = Contents/MacOS/XUL.bz2, output file = Contents/MacOS/XUL It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. bzip2: Deleting output file Contents/MacOS/XUL, if it exists. Couldn't decompress "Contents/MacOS/XUL" at ../../../tools/update-packaging/unwrap_full_update.pl line 64.
Some signing server logs from the last failure: signing.log.1:2014-05-13 08:09:31,825 - INFO - 10.26.52.108 GET /sign/mar/9f24babe97415d9d5fe2cc88f871852fdeb4b22e signing.log.1:2014-05-13 08:09:31,825 - DEBUG - Looking for /builds/signing/rel-key-signing-server/signed-files/mar/9f24babe97415d9d5fe2cc88f871852fdeb4b22e signing.log.1:2014-05-13 08:09:31,825 - DEBUG - /builds/signing/rel-key-signing-server/signed-files/mar/9f24babe97415d9d5fe2cc88f871852fdeb4b22e is missing signing.log.1:2014-05-13 08:09:31,826 - INFO - 10.26.52.108 - - [2014-05-13 08:09:31] "GET /sign/mar/9f24babe97415d9d5fe2cc88f871852fdeb4b22e HTTP/1.1" 404 96 0.001439 signing.log.1:2014-05-13 08:09:41,888 - INFO - Request to mar sign firefox-30.0b4.complete.mar (9f24babe97415d9d5fe2cc88f871852fdeb4b22e) from 10.26.52.108 signing.log.1:2014-05-13 08:09:43,570 - DEBUG - Putting (u'9f24babe97415d9d5fe2cc88f871852fdeb4b22e', u'firefox-30.0b4.complete.mar', 'mar', <gevent.event.Event object at 0x10134e990>) on the queue signing.log.1:2014-05-13 08:09:43,576 - INFO - Signing firefox-30.0b4.complete.mar (mar - 9f24babe97415d9d5fe2cc88f871852fdeb4b22e) signing.log.1:2014-05-13 08:09:43,586 - DEBUG - 9619: ['/builds/signing/rel-key-signing-server/bin/python2.7', '/builds/signing/rel-key-signing-server/tools/release/signing/signscript.py', '-c', '/builds/signing/rel-key-signing-server/signscript.ini', 'mar', u'/builds/signing/rel-key-signing-server/unsigned-files/9f24babe97415d9d5fe2cc88f871852fdeb4b22e', u'/builds/signing/rel-key-signing-server/signed-files/mar/9f24babe97415d9d5fe2cc88f871852fdeb4b22e', u'firefox-30.0b4.complete.mar'] signing.log.1:2014-05-13 08:09:45,275 - INFO - 10.26.52.108 GET /sign/mar/9f24babe97415d9d5fe2cc88f871852fdeb4b22e signing.log.1:2014-05-13 08:09:51,331 - DEBUG - Got message: ('done', (u'9f24babe97415d9d5fe2cc88f871852fdeb4b22e', u'firefox-30.0b4.complete.mar', 'mar', <gevent.event.Event object at 0x10134e990>), '17718e0fbdd188115a4373e390459f112b70b8a4') signing.log.1:2014-05-13 08:09:51,333 - DEBUG - Looking for /builds/signing/rel-key-signing-server/signed-files/mar/9f24babe97415d9d5fe2cc88f871852fdeb4b22e (firefox-30.0b4.complete.mar) signing.log.1:2014-05-13 08:09:51,615 - DEBUG - /builds/signing/rel-key-signing-server/signed-files/mar/9f24babe97415d9d5fe2cc88f871852fdeb4b22e is OK signing.log.1:2014-05-13 08:11:05,441 - INFO - 10.26.52.108 - - [2014-05-13 08:11:05] "GET /sign/mar/9f24babe97415d9d5fe2cc88f871852fdeb4b22e HTTP/1.1" 200 58067179 80.166659 signing.log.1:2014-05-13 08:11:05,984 - INFO - 10.26.52.108 GET /sign/mar/9f24babe97415d9d5fe2cc88f871852fdeb4b22e signing.log.1:2014-05-13 08:11:05,984 - DEBUG - Looking for /builds/signing/rel-key-signing-server/signed-files/mar/9f24babe97415d9d5fe2cc88f871852fdeb4b22e (firefox-30.0b4.complete.mar) signing.log.1:2014-05-13 08:11:06,253 - DEBUG - /builds/signing/rel-key-signing-server/signed-files/mar/9f24babe97415d9d5fe2cc88f871852fdeb4b22e is OK signing.log.1:2014-05-13 08:11:38,105 - INFO - 10.26.52.108 - - [2014-05-13 08:11:38] "GET /sign/mar/9f24babe97415d9d5fe2cc88f871852fdeb4b22e HTTP/1.1" 200 58067179 32.122056 signing.log.1:2014-05-13 08:19:22,733 - INFO - Deleting /builds/signing/rel-key-signing-server/signed-files/mar/9f24babe97415d9d5fe2cc88f871852fdeb4b22e.out with no unsigned file I don't see anything unusual there...
Flags: needinfo?(catlee)
I'm beginning to think this is a mac-signing4 issue. The early comments here are for that machine, and for 32.0b6 we hit several failures on it. There's also a weird "can't launch python" issue on bug 1037260. Right now the release signer is stopped on it.
See Also: → mac-v2-signing4
We keep logs for 7 days, and hit 50 failures in 884 jobs that tried to sign (1635 total, caching must be working). Signing server breakdown: 3 on mac-signing2 8 on mac-singing3 39 on mac-signing4 So mac-signing4 isn't the only bad guy, just the baddest. The distribution of requests is fairly even over the three servers (295/317/272 respectively). Probably dumb idea, but what about bumping the signing timeout and retest in a few days ?
(In reply to Nick Thomas [:nthomas] from comment #10) > Created attachment 8471992 [details] > Recent l10n nightly failures > > We keep logs for 7 days, and hit 50 failures in 884 jobs that tried to sign > (1635 total, caching must be working). Signing server breakdown: > 3 on mac-signing2 > 8 on mac-singing3 > 39 on mac-signing4 > > So mac-signing4 isn't the only bad guy, just the baddest. The distribution > of requests is fairly even over the three servers (295/317/272 > respectively). > > Probably dumb idea, but what about bumping the signing timeout and retest in > a few days ? Well, it looks like the timeout is a mere 120 seconds right now (http://mxr.mozilla.org/build-central/source/tools/lib/python/signing/server.py#77). Seems worthwhile to try, to me. Even a 5 minute timeout doesn't seem crazy.
Blocks: 1053813
I tried manually running signscript for dmg signing, it takes 26 seconds or so on an idle mac-signing4. We allow up to 4 workers per set of certs, so in theory there could be 12 workers running simultaneously over 3 signing servers per host. In practice that would closer to 4-6, as I expect the number of dep jobs to be 0 to 2 at any one time, and usually no release jobs; and some of those would be much cheaper signing like mar or gpg. We get queues in the nightly signer when l10n is coming through. As for how long it actually takes, I looked at the signing log and we'd have to parse for pairs of lines like 2014-08-10 04:01:19,288 - DEBUG - 78329: ['/builds/signing/nightly-key-signing-server/bin/python2.7', '/builds/signing/nightly-key-signing-server/tools/release/signing/signscript.py', '-c', '/builds/signing/nightly-key-signing-server/signscript.ini', 'gpg', u'/builds/signing/nightly-key-signing-server/unsigned-files/09d3271c5fef0bba6099865e4619939caabd25c7', u'/builds/signing/nightly-key-signing-server/signed-files/gpg/09d3271c5fef0bba6099865e4619939caabd25c7', u'thunderbird-33.0a2.gd.mac.checksums'] 2014-08-10 04:01:25,067 - DEBUG - 78329: Success! taking care to match the pids and check the signing type. Then track that over the day, including the # of concurrent processes. Urgh.
Interrupting the signscript process with SIGTERM leaves behind a <hash>.out file, so it seems that we won't return truncated content if we hit the 120s timeout as we're writing the signed tar.gz. The underlying cause is still unknown.
An arbitrary factor of 2, in lieu of actually understanding the problem.
Attachment #8473484 - Flags: review?(bhearsum)
Comment on attachment 8473484 [details] [diff] [review] [tools] bump timeout to 4 mins Review of attachment 8473484 [details] [diff] [review]: ----------------------------------------------------------------- Might I suggest some duct tape or glue to help this stick to the wall?
Attachment #8473484 - Flags: review?(bhearsum) → review+
(In reply to Nick Thomas [:nthomas] from comment #13) > I tried manually running signscript for dmg signing, it takes 26 seconds or > so on an idle mac-signing4. We allow up to 4 workers per set of certs, so in > theory there could be 12 workers running simultaneously over 3 signing > servers per host. In practice that would closer to 4-6, as I expect the > number of dep jobs to be 0 to 2 at any one time, and usually no release > jobs; and some of those would be much cheaper signing like mar or gpg. We > get queues in the nightly signer when l10n is coming through. I wonder how linearly load scales here. The 10.6 signing machines only have 2 cores, so we're probably thrashing the CPU if we have more than 2 workers running (especially during taring/untaring). I was wondering about I/O performance as well, but I feel like taring and untaring would be more CPU bound than I/O bound...
Bah! Can't deploy this to a new branch in the tools repo because it has the single hook on it.
Running into quite a bit of trouble with this today, so emergency halving of concurrency applied. https://hg.mozilla.org/build/puppet/rev/f7870cf2cbe2 https://hg.mozilla.org/build/puppet/rev/98eee0418b39 I think the issue might just be mac-v2-signing3, but this lets me recover better.
Attachment #8505233 - Flags: checked-in+
(In reply to Nick Thomas [:nthomas] from comment #21) > Created attachment 8505233 [details] [diff] [review] > [puppet] Halve concurrency to 2 on 10.9 mac signing servers > > Running into quite a bit of trouble with this today, so emergency halving of > concurrency applied. > > https://hg.mozilla.org/build/puppet/rev/f7870cf2cbe2 > https://hg.mozilla.org/build/puppet/rev/98eee0418b39 > > I think the issue might just be mac-v2-signing3, but this lets me recover > better. Thanks Nick. The last alert I see is about an hour after this landed (which is many hours ago at this point), so it looks like this helped. We're migrating one of the 10.6 signing machines to 10.9 to help take on load here - that's been tracked in bug 1082685. I'm thinking we should pull mac-v2-signing3 out of the pool and run some diagnostics when that's complete - it certainly does seem like it's the one having issues. I filed that as bug 1083156, though it could arguably be done here...
Good call on the mac-v2-signing3 diagnostics.
Hit this a few times in chemspills today.
This is hitting a second day in a row. We need to figure out why things are busted, and fix it.
Catlee realized that this block is eating all exceptions that happen during dmg_signpackage. It turns out that this was originally fixed in bug 757684 (https://github.com/mozilla/build-tools/commit/ca4907e889b45e1e79592d0f24fff070e9571fb6), but it got lost when the code was refactored in bug 730972 (https://github.com/mozilla/build-tools/commit/e863153c0760e4ea92b5c83df943fec26db68cc2#diff-95d18d8a67730330e3a45e4a5d75de72R258). This should re-fix the bug. It won't stay the signing from failing, but the signing server will return an error and the client will retry.
Attachment #8640065 - Flags: review?(catlee)
Attachment #8640065 - Flags: review?(catlee) → review+
Attachment #8640065 - Flags: checked-in+
I forgot to update the tools checkouts on the signing servers for this yesterday. I did this just now, so the fix should be live. I'm going to watch some l10n logs to see if I can catch them retrying for verification.
(In reply to Ben Hearsum [:bhearsum] from comment #28) > I forgot to update the tools checkouts on the signing servers for this > yesterday. I did this just now, so the fix should be live. I'm going to > watch some l10n logs to see if I can catch them retrying for verification. This looks like a series of retries: 06:29:33 INFO - 2015-07-29 06:29:33,989 - 12d3a5ad099aa4831c8858113badd7414f2403dc: uploading for signing 06:29:50 INFO - 2015-07-29 06:29:50,808 - 12d3a5ad099aa4831c8858113badd7414f2403dc: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing4.srv.releng.scl3.mozilla.com:9100 06:30:54 INFO - 2015-07-29 06:30:54,671 - 12d3a5ad099aa4831c8858113badd7414f2403dc: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing4.srv.releng.scl3.mozilla.com:9100 06:31:55 INFO - 2015-07-29 06:31:55,736 - 12d3a5ad099aa4831c8858113badd7414f2403dc: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing4.srv.releng.scl3.mozilla.com:9100 06:33:01 INFO - 2015-07-29 06:33:01,885 - 12d3a5ad099aa4831c8858113badd7414f2403dc: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing4.srv.releng.scl3.mozilla.com:9100 06:33:31 INFO - 2015-07-29 06:33:31,668 - 12d3a5ad099aa4831c8858113badd7414f2403dc: OK So that's a good sign.
All Aurora repacks are up-to-date today, and the mozilla-central ones are just about there (just waiting on two more chunks to complete). There's a few failures on central, but it looks totally unrelated to signing: 06:58:06 INFO - /builds/slave/m-cen-m64-l10n-ntly-1-00000000/build/mozilla-central/obj-l10n/_virtualenv/bin/python /builds/slave/m-cen-m64-l10n-ntly-1-00000000/build/mozilla-central/toolkit/mozapps/installer/l10n-repack.py /builds/slave/m-cen-m64-l10n-ntly-1-00000000/build/mozilla-central/obj-l10n/dist/l10n-stage/firefox/FirefoxNightly.app/Contents/Resources ../../dist/xpi-stage/locale-en-GB \ 06:58:06 INFO - \ 06:58:07 INFO - Error: Multiple locales aren't supported 06:58:08 ERROR - Traceback (most recent call last): 06:58:08 INFO - File "/builds/slave/m-cen-m64-l10n-ntly-1-00000000/build/mozilla-central/toolkit/mozapps/installer/l10n-repack.py", line 60, in <module> 06:58:08 INFO - main() 06:58:08 INFO - File "/builds/slave/m-cen-m64-l10n-ntly-1-00000000/build/mozilla-central/toolkit/mozapps/installer/l10n-repack.py", line 56, in main 06:58:08 INFO - non_resources=args.non_resource, non_chrome=NON_CHROME) 06:58:08 INFO - File "/builds/slave/m-cen-m64-l10n-ntly-1-00000000/build/mozilla-central/python/mozbuild/mozpack/packager/l10n.py", line 247, in repack 06:58:08 INFO - _repack(app_finder, l10n_finder, copier, formatter, non_chrome) 06:58:08 INFO - File "/tools/python/lib/python2.7/contextlib.py", line 24, in __exit__ 06:58:08 INFO - self.gen.next() 06:58:08 INFO - File "/builds/slave/m-cen-m64-l10n-ntly-1-00000000/build/mozilla-central/python/mozbuild/mozpack/errors.py", line 131, in accumulate 06:58:08 INFO - raise AccumulatedErrors() 06:58:08 INFO - mozpack.errors.AccumulatedErrors 06:58:08 ERROR - make[1]: *** [repackage-zip] Error 1 06:58:08 INFO - make: *** [repackage-zip-en-GB] Error 2 06:58:08 ERROR - Return code: 2 06:58:08 ERROR - make installers-en-GB failed Given all of this, I think we can call this fixed...
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: