Closed
Bug 1062302
Opened 10 years ago
Closed 10 years ago
mac-v2-signing servers sometimes don't return a response to the client
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: massimo, Assigned: bhearsum)
References
Details
Attachments
(5 files, 1 obsolete file)
6.40 KB,
patch
|
bhearsum
:
review+
massimo
:
checked-in+
|
Details | Diff | Splinter Review |
1.01 KB,
patch
|
rail
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
6.05 KB,
patch
|
rail
:
review+
bhearsum
:
checked-in-
|
Details | Diff | Splinter Review |
2.52 KB,
patch
|
rail
:
review+
bhearsum
:
checked-in-
|
Details | Diff | Splinter Review |
727 bytes,
patch
|
rail
:
review+
bhearsum
:
checked-in-
|
Details | Diff | Splinter Review |
Today we had a lot of issues with the l10n repacks for Beta 33 build2, slaves upload their binaries to the signing servers and nothing happens; buildbot kills the process after 2400 seconds of inactivity. After some investigation, bhearsum has find out that all the errors are coming from mac-v2 servers. With this patch mac-v2 servers will not be used to generate gpg and mar signatures.
Attachment #8483493 -
Flags: review?(bhearsum)
Assignee | ||
Comment 1•10 years ago
|
||
Comment on attachment 8483493 [details] [diff] [review] [puppet] disable gpg and mar from mac-v2 signing servers.patch Review of attachment 8483493 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/buildmaster/templates/passwords.py.erb @@ +18,5 @@ > ('mac-signing3.srv.releng.scl3.mozilla.com:9100', '<%= scope.function_secret(['signing_server_username']) %>', '<%= scope.function_secret(['signing_server_nightly_password']) %>', ('gpg', 'dmg', 'mar')), > ('mac-signing4.srv.releng.scl3.mozilla.com:9100', '<%= scope.function_secret(['signing_server_username']) %>', '<%= scope.function_secret(['signing_server_nightly_password']) %>', ('gpg', 'dmg', 'mar')), > + ('mac-v2-signing1.srv.releng.scl3.mozilla.com:9100', '<%= scope.function_secret(['signing_server_username']) %>', '<%= scope.function_secret(['signing_server_nightly_password']) %>', ('dmgv2')), > + ('mac-v2-signing2.srv.releng.scl3.mozilla.com:9100', '<%= scope.function_secret(['signing_server_username']) %>', '<%= scope.function_secret(['signing_server_nightly_password']) %>', ('dmgv2')), > + ('mac-v2-signing3.srv.releng.scl3.mozilla.com:9100', '<%= scope.function_secret(['signing_server_username']) %>', '<%= scope.function_secret(['signing_server_nightly_password']) %>', ('dmgv2')), You need a comm after 'dmgv2', otherwise this tuple turns into a string, and 'd', 'm', 'g', 'v', and '2' will all be interpreted as signing formats. r=me with this fixed. Feel free to fix upon landing.
Attachment #8483493 -
Flags: review?(bhearsum) → review+
Reporter | ||
Comment 2•10 years ago
|
||
Comment on attachment 8483493 [details] [diff] [review] [puppet] disable gpg and mar from mac-v2 signing servers.patch Fixed and landed.
Attachment #8483493 -
Flags: checked-in+
Assignee | ||
Comment 3•10 years ago
|
||
Updating summary to make it a bit more generic. The specific conditions we saw were the mac-v2-signing servers under very heavy load when gpg+mar signing was enabled _and_ a release _and_ nightlies were running. I'm not sure why they would be under more pressure than, say, the v1 mac signing servers though. Leaving this open in case we need to debug further.
Summary: Disable "gpg" and "mar" from mac-v2 signing servers → mac-v2-signing servers died under load
Assignee | ||
Updated•10 years ago
|
Component: Release Automation → General Automation
QA Contact: bhearsum → catlee
Assignee | ||
Comment 4•10 years ago
|
||
A quick look at top is showing pretty high CPU usage with no signing happening, most of it coming from "systemstats": Processes: 131 total, 3 running, 2 stuck, 126 sleeping, 440 threads 12:53:23 Load Avg: 1.61, 1.79, 1.76 CPU usage: 8.0% user, 48.44% sys, 43.55% idle SharedLibs: 22M resident, 0B data, 2316K linkedit. MemRegions: 21323 total, 5824M resident, 6536K private, 64M shared. PhysMem: 7182M used (1209M wired), 6788K unused. VM: 333G vsize, 1310M framework vsize, 111381508(896) swapins, 112102954(0) swapouts. Networks: packets: 152779503/157G in, 140299072/130G out. Disks: 10672671/545G read, 20293314/1204G written. PID COMMAND %CPU TIME #TH #WQ #PORTS #MREG MEM RPRVT PURG CMPRS VPRVT VSIZE PGRP PPID 99920 xpcd 0.0 39:27.36 2 0 50161 33 8192B 0B 0B 1072K 22M 2389M 99920 1 99543 python 0.0 00:00.00 1 0 7 149 4096B 0B 0B 1096K 176K 2395M 99533 99534 99542 python 0.0 00:00.00 1 0 7 149 4096B 0B 0B 1100K 176K 2395M 99533 99534 99534 python 0.1 05:02:36 4 0 10 202 2580K 2016K 0B 44M 50M 2443M 99533 1 99446 cfprefsd 0.0 00:00.00 2 1 30 34 4096B 0B 0B 520K 23M 2389M 99446 99399 99445 distnoted 0.0 00:00.35 2 0 41 32 4096B 0B 0B 596K 22M 2411M 99445 99399 99439 XPCTimeStamp 0.0 00:05.58 4 1 68 60 156K 68K 0B 2196K 25M 2414M 99439 1 99399 launchd 0.0 01:11.39 2 0 62 35 4096B 0B 0B 476K 37M 2395M 99399 1 79059 xpcd 0.0 00:00.57 2 0 168 33 4096B 0B 0B 600K 22M 2389M 79059 1 73631 ruby 0.0 00:03.46 2 0 22 180 43M 43M 0B 0B 63M 2425M 73350 73350 73554 top 9.8 00:06.23 1/1 0 24 33 1472K 1304K 0B 124K 26M 2385M 73554 70461 73350 bash 0.0 00:00.00 1 0 19 29 140K 168K 0B 220K 26M 2385M 73350 1 70461 bash 0.0 00:00.03 1 0 19 29 124K 124K 0B 340K 26M 2385M 70461 70451 70451 sshd 0.0 00:00.07 1 0 11 72 228K 156K 0B 308K 1352K 2393M 70444 70444 70444 sshd 0.0 00:02.06 2 0 33 68 32K 0B 0B 1220K 180K 2392M 70444 1 70399 ubd 0.0 00:00.05 2 0 39 47 8192B 0B 0B 1628K 23M 2391M 70399 169 57805 coresymbolic 0.0 00:01.81 2 1 30 33 240K 180K 0B 440K 31M 2389M 57805 1 56477 aslmanager 0.0 00:00.00 2 1 18 33 4096B 0B 0B 336K 23M 2389M 56477 1 56476 aslmanager 0.0 00:00.00 2 1 16 33 4096B 0B 0B 324K 23M 2389M 56476 1 55865 ntpd 0.0 00:00.07 2 0 28 34 4096B 0B 0B 648K 31M 2389M 55865 1 54192 systemstats 46.3 14:08:44 2/1 0 41 7366 5742M- 5743M- 0B 4882M+ 10G 13G 54192 1 xpcd also seems to be doing a lot too, but it seems pretty important: 15:49 < spohl> "xpcd is the daemon that manages XPC services. From Apple's developer site: "The XPC Services API, part of libSystem, provides a lightweight mechanism for basic interprocess communication integrated with Grand Central Dispatch (GCD) and launchd. The XPC Services API allows you to create lightweight helper tools, called XPC services, that perform work on behalf of your application."
Assignee | ||
Comment 5•10 years ago
|
||
Looks like systemstats can be disabled according to http://osxdaily.com/2014/02/27/systemstats-cpu-use-slow-mac-os-x/: Disabling systemstats This is not recommended and it may have unintended consequences beyond disabling property battery and power management monitoring, including the App Nap feature, but you can disable systemstats but unloading the daemon from launchd. This should be limited to advanced users only: sudo launchctl unload -w /System/Library/LaunchDaemons/com.apple.systemstatsd.plist sudo launchctl unload -w /System/Library/LaunchDaemons/com.apple.systemstats.daily.plist sudo launchctl unload -w /System/Library/LaunchDaemons/com.apple.systemstats.analysis.plist
Updated•10 years ago
|
See Also: → mac-v2-signing1
Assignee | ||
Comment 6•10 years ago
|
||
Yikes - load is still quite bad. Load is between 1 and 4 depending on the machine. System stats seems to be a big culprit at times, so I've manually unloaded the service on mac-v2-signing3, which has the worst load. If that makes a big difference, I'll get it disabled on all of them with Puppet.
Assignee | ||
Comment 7•10 years ago
|
||
Load dropped pretty quickly from from 3 -> 2, and dips close to 1 at times - so I'm pretty sure this is helping. Load is still higher than I would've expected though, especially compared to some v1 signing machines. Even more surprising that the load is higher when v1 are doing mar/gpg signing and v2 are not. http://people.mozilla.org/~bhearsum/sattap/0e9e65bf.png In any case, this certainly won't hurt.
Assignee | ||
Comment 8•10 years ago
|
||
I noticed that part of Spotlight was enabled too, let's kill it as well.
Attachment #8493115 -
Attachment is obsolete: true
Attachment #8493115 -
Flags: review?(dustin)
Attachment #8493137 -
Flags: review?(dustin)
Assignee | ||
Comment 9•10 years ago
|
||
Comment on attachment 8493137 [details] [diff] [review] disable system stats + mds Rail, Dustin is at a conference this week. Do you feel comfortable reviewing this?
Attachment #8493137 -
Flags: review?(dustin) → review?(rail)
Comment 10•10 years ago
|
||
Comment on attachment 8493137 [details] [diff] [review] disable system stats + mds syntax check: OK :)
Attachment #8493137 -
Flags: review?(rail) → review+
Assignee | ||
Updated•10 years ago
|
Attachment #8493137 -
Flags: checked-in+
Assignee | ||
Comment 11•10 years ago
|
||
Going to leave this open for now because the v2 signing work is merging to mozilla-central in the next week. We'll see how these hold up when it makes it there - I expect it to cause quite a load increase.
Assignee | ||
Comment 12•10 years ago
|
||
I accidentally turned on mac v2 signing for the latest release and beta builds' partner repacks, and the v2 signing servers managed to cause timeouts there, despite the load not being awfully high. I'd like to investigate this more ASAP, to try to address it before v2 signatures merge to mozilla-central.
Assignee | ||
Comment 13•10 years ago
|
||
Amy, one route of investigation I'm taking is to try and figure out why these systems idle at such high load. mac-v2-signing1 & 3 are rev4 machines and idle around ~1.5 load average. mac-v2-signing2 is a rev5 and idles close to 2.0. mac-signing3 is rev4 and idles at well under .5. I'm baffled - the processes that use the most CPU time on all of them are WindowServer and ScreenSaverEngine - I don't understand why the v2 signing machines are so much more heavily loaded. We may just have to write it off to "10.9 is busier", but I feel like there's something I'm missing here.
Flags: needinfo?(arich)
Assignee | ||
Comment 14•10 years ago
|
||
Another angle of approach is tracing one of the failed signing jobs, eg: 2014-09-24 05:39:44,688 - in /builds/slave/rel-m-beta-osx64_partner_rpk-0/partner-repacks/scripts/repacked_builds/33.0b7/build1/partner-repacks/yandex-tr-tamindir/mac/tr/working 2014-09-24 05:39:54,528 - 01e1bf3f40e6b19fe1d243759b98dc857c42ba16: processing Firefox 33.0b7.dmg.tar.gz on https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9120 2014-09-24 05:39:54,844 - 01e1bf3f40e6b19fe1d243759b98dc857c42ba16: uploading for signing 2014-09-24 05:40:01,929 - 01e1bf3f40e6b19fe1d243759b98dc857c42ba16: processing Firefox 33.0b7.dmg.tar.gz on https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9120 command timed out: 1200 seconds without output running ['python', './partner-repacks.py', '--version', '33.0b7', '--build-number', '1', '--repo', 'releases/mozilla-beta', '--hgroot', 'https://hg.mozilla.org', '--staging-server', 'stage.mozilla.org', '--dmg-extract-script', '/builds/slave/rel-m-beta-osx64_partner_rpk-0/tools/release/common/unpack-diskimage.sh', '--nightly-dir', 'firefox/nightly', '--pkg-dmg', '/builds/slave/rel-m-beta-osx64_partner_rpk-0/partner-repacks/scripts/pkg-dmg', '--platform', 'macosx64', '--signed'], attempting to kill Looking on mac-v2-signing2 I see the following for 01e1bf3f40e6b19fe1d243759b98dc857c42ba16: 2014-09-24 05:39:54,551 - INFO - 10.26.52.70 GET /sign/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16 2014-09-24 05:39:54,831 - DEBUG - Looking for /builds/signing/rel-key-signing-server/signed-files/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16 2014-09-24 05:39:54,840 - DEBUG - /builds/signing/rel-key-signing-server/signed-files/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16 is missing 2014-09-24 05:39:54,841 - INFO - 10.26.52.70 - - [2014-09-24 05:39:54] "GET /sign/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16 HTTP/1.1" 404 96 0.290019 2014-09-24 05:39:55,150 - INFO - 10.26.52.70 POST /sign/dmg 2014-09-24 05:40:00,330 - INFO - Request to dmg sign Firefox 33.0b7.dmg.tar.gz (01e1bf3f40e6b19fe1d243759b98dc857c42ba16) from 10.26.52.70 2014-09-24 05:40:00,924 - DEBUG - Putting (u'01e1bf3f40e6b19fe1d243759b98dc857c42ba16', u'Firefox 33.0b7.dmg.tar.gz', 'dmg', <gevent.event.Event object at 0x101ce6610>) on the queue 2014-09-24 05:40:00,925 - DEBUG - 1 workers active 2014-09-24 05:40:00,926 - INFO - 10.26.52.70 - - [2014-09-24 05:40:00] "POST /sign/dmg HTTP/1.1" 202 98 5.775935 2014-09-24 05:40:00,932 - INFO - Signing Firefox 33.0b7.dmg.tar.gz (dmg - 01e1bf3f40e6b19fe1d243759b98dc857c42ba16) 2014-09-24 05:40:01,088 - DEBUG - 50084: ['/builds/signing/rel-key-signing-server/bin/python2.7', '/builds/signing/rel-key-signing-server/tools/release/signing/signscript.py', '-c', '/builds/signing/rel-key-signing-server/signscript.ini', 'dmg', u'/builds/signing/rel-key-signing-server/unsigned-files/01e1bf3f40e6b19fe1d243759b98dc857c42ba16', u'/builds/signing/rel-key-signing-server/signed-files/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16', u'Firefox 33.0b7.dmg.tar.gz'] 2014-09-24 05:40:01,949 - INFO - 10.26.52.70 GET /sign/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16 2014-09-24 05:40:01,949 - DEBUG - Waiting for pending job 2014-09-24 05:40:05,122 - DEBUG - 50084: Success! 2014-09-24 05:40:05,409 - DEBUG - Copying result to a88fd5cb02a2173c84e0712cbeb7927c9c367264 2014-09-24 05:40:06,716 - DEBUG - no items, exiting 2014-09-24 05:40:06,717 - DEBUG - Worker exiting 2014-09-24 05:40:06,717 - DEBUG - Got message: ('done', (u'01e1bf3f40e6b19fe1d243759b98dc857c42ba16', u'Firefox 33.0b7.dmg.tar.gz', 'dmg', <gevent.event.Event object at 0x101ce6610>), 'a88fd5cb02a2173c84e0712cbeb7927c9c367264') 2014-09-24 05:40:06,842 - DEBUG - Pending job finished! 2014-09-24 05:40:06,887 - DEBUG - Looking for /builds/signing/rel-key-signing-server/signed-files/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16 (Firefox 33.0b7.dmg.tar.gz) 2014-09-24 05:40:07,167 - DEBUG - /builds/signing/rel-key-signing-server/signed-files/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16 is OK 2014-09-24 05:40:07,171 - DEBUG - Done worker 2014-09-24 05:40:07,172 - DEBUG - 0 workers left This is very interesting, because as far as the signing server goes, it thinks it completed the work without issue. This points away from a load issue on the server causing it to hang, and towards the client acting funny, or possibly the result never reaching the client. This makes me wonder if the transfer from the server back to the client was simply extrememly slow (for whatever reason), and the client couldn't get the file within the 20min before it timed out. I'm going to continue looking into this...
Comment 15•10 years ago
|
||
As another datapoint, services on the three v2 signing servers occasionally show up as unreachable via nagios. I'm wondering if there's some sort of issue with the 10.9 network stack.
Assignee | ||
Comment 16•10 years ago
|
||
At the moment, this isn't a blocker, but it may become critical/blocker if it holds up the merge of the Mac v2 signing work from bug 1046306. I'm not 100% sure when that is happening. If possible, it would be nice to get any information we have from any network equipment between these two machines. Anything that meets the following may be helpful: * Sept 24th between 05:39:50 and 05:40:10 (pacific) * source: 10.26.52.70 * destination: 10.26.49.30:9120/tcp
Blocks: 1046306
Assignee | ||
Comment 17•10 years ago
|
||
I ran this with --noop on 10.6, 10.9, and Linux signing machines. Only 10.9 had any changes.
Attachment #8495424 -
Flags: review?(rail)
Assignee | ||
Comment 18•10 years ago
|
||
I found this tip about httplib at http://stackoverflow.com/questions/1170744/how-do-i-get-urllib2-to-log-all-transferred-bytes, and I verified the value of 1 by reading httplib.py (which uses "print" (!!!!!) to log).
Attachment #8495428 -
Flags: review?(rail)
Updated•10 years ago
|
Attachment #8495428 -
Flags: review?(rail) → review+
Updated•10 years ago
|
Attachment #8495424 -
Flags: review?(rail) → review+
Assignee | ||
Updated•10 years ago
|
Attachment #8495428 -
Flags: checked-in+
Assignee | ||
Updated•10 years ago
|
Attachment #8495424 -
Flags: checked-in+
Assignee | ||
Comment 19•10 years ago
|
||
So, the signtool patch kind of worked, except that the output is getting buffered because httplib is using print, not log :(: 2014-09-25 12:21:52,247 - in /builds/slave/m-in-osx64-0000000000000000000/build/obj-firefox/i386/dist/universal/firefox 2014-09-25 12:21:52,342 - doing dmg signing 2014-09-25 12:21:52,342 - possible hosts are ['https://mac-signing4.srv.releng.scl3.mozilla.com:9110', 'https://mac-signing2.srv.releng.scl3.mozilla.com:9110', 'https://mac-signing3.srv.releng.scl3.mozilla.com:9110'] 2014-09-25 12:21:52,343 - pack tar /builds/slave/m-in-osx64-0000000000000000000/build/obj-firefox/i386/dist/universal/firefox/FirefoxNightly.app.tar.gz from folder /builds/slave/m-in-osx64-0000000000000000000/build/obj-firefox/i386/dist/universal/firefox with files 2014-09-25 12:21:52,343 - ['FirefoxNightly.app'] 2014-09-25 12:22:06,778 - FirefoxNightly.app.tar.gz 2014-09-25 12:22:06,778 - checking FirefoxNightly.app.tar.gz for signature... 2014-09-25 12:22:07,162 - dae8f2646c08c468bfc4e82226a179f6c76d56ab: checking cache 2014-09-25 12:22:07,213 - dae8f2646c08c468bfc4e82226a179f6c76d56ab: processing FirefoxNightly.app.tar.gz on https://mac-signing4.srv.releng.scl3.mozilla.com:9110 2014-09-25 12:22:07,213 - dae8f2646c08c468bfc4e82226a179f6c76d56ab: GET https://mac-signing4.srv.releng.scl3.mozilla.com:9110/sign/dmg/dae8f2646c08c468bfc4e82226a179f6c76d56ab 2014-09-25 12:22:07,261 - dae8f2646c08c468bfc4e82226a179f6c76d56ab: uploading for signing 2014-09-25 12:22:21,004 - dae8f2646c08c468bfc4e82226a179f6c76d56ab: processing FirefoxNightly.app.tar.gz on https://mac-signing4.srv.releng.scl3.mozilla.com:9110 2014-09-25 12:22:21,004 - dae8f2646c08c468bfc4e82226a179f6c76d56ab: GET https://mac-signing4.srv.releng.scl3.mozilla.com:9110/sign/dmg/dae8f2646c08c468bfc4e82226a179f6c76d56ab 2014-09-25 12:22:57,711 - dae8f2646c08c468bfc4e82226a179f6c76d56ab: OK 2014-09-25 12:22:57,711 - Creating /builds/slave/m-in-osx64-0000000000000000000/signing_cache/dmg 2014-09-25 12:22:57,711 - Copying FirefoxNightly.app.tar.gz to cache /builds/slave/m-in-osx64-0000000000000000000/signing_cache/dmg/dae8f2646c08c468bfc4e82226a179f6c76d56ab 2014-09-25 12:22:58,201 - unpacking FirefoxNightly.app 2014-09-25 12:22:58,201 - unpack tar /builds/slave/m-in-osx64-0000000000000000000/build/obj-firefox/i386/dist/universal/firefox/FirefoxNightly.app.tar.gz into /builds/slave/m-in-osx64-0000000000000000000/build/obj-firefox/i386/dist/universal/firefox send: 'GET /sign/dmg/dae8f2646c08c468bfc4e82226a179f6c76d56ab HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: mac-signing4.srv.releng.scl3.mozilla.com:9110\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n' reply: 'HTTP/1.1 404 Not Found\r\n' header: Date: Thu, 25 Sep 2014 19:22:07 GMT header: Transfer-Encoding: chunked send: 'POST /sign/dmg HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 98123283\r\nHost: mac-signing4.srv.releng.scl3.mozilla.com:9110\r\nContent-Type: multipart/form-data; boundary=2caa4e5573864e059408a65acae11620\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n' send: <poster.encode.multipart_yielder instance at 0x100786f38> sendIng an iterable reply: 'HTTP/1.1 202 Accepted\r\n' header: X-Nonce: UNUSED header: Date: Thu, 25 Sep 2014 19:22:19 GMT header: Content-Length: 0 send: 'GET /sign/dmg/dae8f2646c08c468bfc4e82226a179f6c76d56ab HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: mac-signing4.srv.releng.scl3.mozilla.com:9110\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n' reply: 'HTTP/1.1 200 OK\r\n' header: X-Sha1-Digest: 8f4cfe03dd61df64d46cf189882c4ca3873df9aa header: Content-Length: 98132940 header: Date: Thu, 25 Sep 2014 19:22:53 GMT I'm not sure we'll get all of the "send", "reply", and "header" information in the event of a timeout.
Updated•10 years ago
|
Attachment #8495488 -
Flags: review?(rail) → review+
Assignee | ||
Updated•10 years ago
|
Attachment #8495488 -
Flags: checked-in+
Assignee | ||
Comment 21•10 years ago
|
||
I had to land a bustage fix because I forgot to return from the methods I wrapped, but we're not getting logging that will be there even if it times out: 014-09-25 13:42:59,727 - 7ed85480a4c3ae0519794865106a192e461f9f30: checking cache 2014-09-25 13:42:59,757 - 7ed85480a4c3ae0519794865106a192e461f9f30: processing FirefoxNightly.app.tar.gz on https://mac-signing3.srv.releng.scl3.mozilla.com:9110 2014-09-25 13:42:59,757 - 7ed85480a4c3ae0519794865106a192e461f9f30: GET https://mac-signing3.srv.releng.scl3.mozilla.com:9110/sign/dmg/7ed85480a4c3ae0519794865106a192e461f9f30 2014-09-25 13:42:59,759 - send: ('GET /sign/dmg/7ed85480a4c3ae0519794865106a192e461f9f30 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: mac-signing3.srv.releng.scl3.mozilla.com:9110\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n',), {} 2014-09-25 13:42:59,889 - getting response 2014-09-25 13:42:59,896 - 7ed85480a4c3ae0519794865106a192e461f9f30: uploading for signing 2014-09-25 13:43:00,379 - send: ('POST /sign/dmg HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 98126066\r\nHost: mac-signing3.srv.releng.scl3.mozilla.com:9110\r\nContent-Type: multipart/form-data; boundary=7abf2c38339147258b9c87d710175cb4\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n',), {} 2014-09-25 13:43:00,431 - send: (<poster.encode.multipart_yielder instance at 0x100786fc8>,), {} 2014-09-25 13:43:10,747 - getting response 2014-09-25 13:43:13,568 - 7ed85480a4c3ae0519794865106a192e461f9f30: processing FirefoxNightly.app.tar.gz on https://mac-signing3.srv.releng.scl3.mozilla.com:9110 2014-09-25 13:43:13,568 - 7ed85480a4c3ae0519794865106a192e461f9f30: GET https://mac-signing3.srv.releng.scl3.mozilla.com:9110/sign/dmg/7ed85480a4c3ae0519794865106a192e461f9f30 2014-09-25 13:43:13,569 - send: ('GET /sign/dmg/7ed85480a4c3ae0519794865106a192e461f9f30 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: mac-signing3.srv.releng.scl3.mozilla.com:9110\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n',), {} 2014-09-25 13:43:13,669 - getting response
Comment 22•10 years ago
|
||
dustin: apparently they're going to merge the v2 signing stuff into mozilla-central on Monday, so it would be great to get these machines upgraded (I think the firewall is blocking that) and do any debugging we can on Friday.
Flags: needinfo?(dustin)
Assignee | ||
Comment 23•10 years ago
|
||
The issue was repro'ed overnight with the additional logging enabled. From the client: 2014-09-25 18:40:33,971 - doing dmg signing 2014-09-25 18:40:33,972 - possible hosts are ['https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9110', 'https://mac-v2-signing1.srv.releng.scl3.mozilla.com:9110', 'https://mac-v2-signing3.srv.releng.scl3.mozilla.com:9110'] 2014-09-25 18:40:33,972 - pack tar /builds/slave/oak-osx64-00000000000000000000/build/obj-firefox/i386/dist/universal/firefox/FirefoxNightly.app.tar.gz from folder /builds/slave/oak-osx64-00000000000000000000/build/obj-firefox/i386/dist/universal/firefox with files 2014-09-25 18:40:33,972 - ['FirefoxNightly.app'] 2014-09-25 18:40:48,331 - FirefoxNightly.app.tar.gz 2014-09-25 18:40:48,331 - checking FirefoxNightly.app.tar.gz for signature... 2014-09-25 18:40:48,715 - 217bf3f8a3a7bcc3475dc64581225699e51fd71d: checking cache 2014-09-25 18:40:48,756 - 217bf3f8a3a7bcc3475dc64581225699e51fd71d: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9110 2014-09-25 18:40:48,756 - 217bf3f8a3a7bcc3475dc64581225699e51fd71d: GET https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9110/sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d 2014-09-25 18:40:48,758 - send: ('GET /sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: mac-v2-signing2.srv.releng.scl3.mozilla.com:9110\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n',), {} 2014-09-25 18:40:48,802 - getting response 2014-09-25 18:40:48,806 - 217bf3f8a3a7bcc3475dc64581225699e51fd71d: uploading for signing 2014-09-25 18:40:49,262 - send: ('POST /sign/dmg HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 98139983\r\nHost: mac-v2-signing2.srv.releng.scl3.mozilla.com:9110\r\nContent-Type: multipart/form-data; boundary=7b58050e46b7485aa17f6d7438a47ea9\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n',), {} 2014-09-25 18:40:49,279 - send: (<poster.encode.multipart_yielder instance at 0x100786fc8>,), {} 2014-09-25 18:40:55,751 - getting response 2014-09-25 18:40:57,665 - 217bf3f8a3a7bcc3475dc64581225699e51fd71d: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9110 2014-09-25 18:40:57,666 - 217bf3f8a3a7bcc3475dc64581225699e51fd71d: GET https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9110/sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d 2014-09-25 18:40:57,666 - send: ('GET /sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: mac-v2-signing2.srv.releng.scl3.mozilla.com:9110\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n',), {} 2014-09-25 18:40:57,680 - getting response And the server: 2014-09-25 18:40:48,802 - INFO - 10.26.52.78 GET /sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d 2014-09-25 18:40:48,803 - DEBUG - Looking for /builds/signing/dep-key-signing-server/signed-files/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d 2014-09-25 18:40:48,803 - DEBUG - /builds/signing/dep-key-signing-server/signed-files/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d is missing 2014-09-25 18:40:48,804 - INFO - 10.26.52.78 - - [2014-09-25 18:40:48] "GET /sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d HTTP/1.1" 404 96 0.001593 2014-09-25 18:40:49,280 - INFO - 10.26.52.78 POST /sign/dmg 2014-09-25 18:40:55,786 - INFO - Request to dmg sign FirefoxNightly.app.tar.gz (217bf3f8a3a7bcc3475dc64581225699e51fd71d) from 10.26.52.78 2014-09-25 18:40:56,628 - DEBUG - Putting (u'217bf3f8a3a7bcc3475dc64581225699e51fd71d', u'FirefoxNightly.app.tar.gz', 'dmg', <gevent.event.Event object at 0x103c1bdd0>) on the queue 2014-09-25 18:40:56,629 - DEBUG - 1 workers active 2014-09-25 18:40:56,629 - INFO - 10.26.52.78 - - [2014-09-25 18:40:56] "POST /sign/dmg HTTP/1.1" 202 98 7.349819 2014-09-25 18:40:56,636 - INFO - Signing FirefoxNightly.app.tar.gz (dmg - 217bf3f8a3a7bcc3475dc64581225699e51fd71d) 2014-09-25 18:40:56,640 - DEBUG - 35708: ['/builds/signing/dep-key-signing-server/bin/python2.7', '/builds/signing/dep-key-signing-server/tools/release/signing/signscript.py', '-c', '/builds/signing/dep-key-signing-server/signscript.ini', 'dmg', u'/builds/signing/dep-key-signing-server/unsigned-files/217bf3f8a3a7bcc3475dc64581225699e51fd71d', u'/builds/signing/dep-key-signing-server/signed-files/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d', u'FirefoxNightly.app.tar.gz'] 2014-09-25 18:40:57,680 - INFO - 10.26.52.78 GET /sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d 2014-09-25 18:40:57,681 - DEBUG - Waiting for pending job 2014-09-25 18:41:21,572 - DEBUG - 35708: Success! 2014-09-25 18:41:21,960 - DEBUG - Copying result to 096126e7572b4a2a8b1f7b0dbe657b26e70b90f8 2014-09-25 18:41:23,890 - DEBUG - no items, exiting 2014-09-25 18:41:23,890 - DEBUG - Worker exiting 2014-09-25 18:41:23,891 - DEBUG - Got message: ('done', (u'217bf3f8a3a7bcc3475dc64581225699e51fd71d', u'FirefoxNightly.app.tar.gz', 'dmg', <gevent.event.Event object at 0x103c1bdd0>), '096126e7572b4a2a8b1f7b0dbe657b26e70b90f8') 2014-09-25 18:41:23,892 - DEBUG - Pending job finished! 2014-09-25 18:41:23,892 - DEBUG - Looking for /builds/signing/dep-key-signing-server/signed-files/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d (FirefoxNightly.app.tar.gz) 2014-09-25 18:41:24,274 - DEBUG - /builds/signing/dep-key-signing-server/signed-files/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d is OK 2014-09-25 18:41:24,278 - DEBUG - Done worker 2014-09-25 18:41:24,278 - DEBUG - 0 workers left We can see the client making the GET at 18:40:57 and the server receiving that. It was still in the middle of signing at the time, so it waits up to 60 seconds for that to complete (this is normal: https://github.com/mozilla/build-tools/blob/master/lib/python/signing/server.py#L506). At 18:41:23 we see the "Pending job finished" message along with a bunch of other things later in the server's do_GET method. So as far as the server is concern, it *definitely* finishes processing the GET, and yields the data. The client logging is a bit weirder. Unfortunately, the "getting response" messages seems to happen as soon as the request is made - not when the response starts streaming in. So the fact that we see that doesn't indicate that the server returned anything yet, as I'd hoped it would. Because of that it's hard to know exactly where the client is hanging, but I would _guess_ that it's around https://hg.python.org/cpython/file/22a46f05ce23/Lib/httplib.py#l414, where it's waiting for a header. One thing that may be worth looking into is whether or not some networking gear is dropping the GET after some timeout. In the example above, the GET request starts at 18:40:57, but the response isn't started until at least 10:41:24 - which is at least 27 seconds later. It doesn't seem like an awfully long time, but I know we used to have issues in other contexts with tcp connections getting dropped by a firewall. I'm going to see if I can hack even more logging and repro the problem again.
Comment 24•10 years ago
|
||
Per our discussions in irc, there's no line like INFO - 10.26.52.78 - - [2014-09-25 18:40:56] "GET /sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d HTTP/1.1" logged on the server. This is logged by the WSGI framework *after* the response body is transmitted. So the HTTP connection hung during the HTTP response, in such a way that neither side saw the response complete, and the usual TCP mechanisms to restart a connection failed. This is consistent with the behavior seen when some L4 intermediary drops the connection from its state table and subsequently drops packets in either direction for that connection. It's *not* consistent with either end host dropping the connection state -- in that case, the end that dropped the connection would see an EOF on the socket and log completion of the transaction. Even without any special configuration, our firewalls accept 60+ seconds of silence on a TCP connection without timing it out. I'd like to gather more data in a tcpdump when this occurs, so that's where I'll look.
Assignee | ||
Comment 25•10 years ago
|
||
Latest status: We're pretty sure that there is some sort of network issue - NOT a load issue. The fact we've only ever hit this 10.9.0 (not on 10.6, not on 10.9.5) hints strongly towards an issue in 10.9.0's network stack or kernel somewhere. However, we've only done ~12 builds with signing on 10.9.5, which is not the biggest sample size. We've verified that Firefox is still signed and runs correctly when done from 10.9.5, so I've gone ahead and upgraded all of the mac-v2-signing machines to it. All of them are now back in service. At this point we haven't been able to think of any ways to try and diagnose this further, given that it's an extremely difficult bug to reproduce. I am encouraged by the 10.9.5 upgrade, but I certainly can't make any guarantees that it has permanently fixed the issue. Once the v2 signing work has merged to a major branch (ie, mozilla-inbound) and we see the signing servers holding up to that load, we can call this fixed. If we see more issue, especially after merging, we are prepared to do additional investigation with tcpdump (we can't do this now because we don't know how to reproduce the problem).
Flags: needinfo?(dustin)
Flags: needinfo?(arich)
Assignee | ||
Comment 26•10 years ago
|
||
Updating summary to match latest investigation. We haven't seen any more timeouts over the weekend. The real test will be today though, after the v2 signing work merges to a mainline branch.
Summary: mac-v2-signing servers died under load → mac-v2-signing servers sometimes don't return a response to the client
Assignee | ||
Comment 27•10 years ago
|
||
Comment on attachment 8495424 [details] [diff] [review] reduce concurrency for v2 signing machines I've backed this out (well, I changed concurrency to be "4" for everything, not backed out the code that lets us set it) now that we're pretty darn sure load isn't a factor here.
Attachment #8495424 -
Flags: checked-in+ → checked-in-
Assignee | ||
Comment 28•10 years ago
|
||
v2 signing landed on mozilla-central and mozilla-inbound this morning. We've had over a hundred builds signed since then, and I've not seen nor heard of any issues related to signing. I'm calling this FIXED. Thanks again to everyone who helped here - particularly Amy and Dustin.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 29•10 years ago
|
||
Comment on attachment 8495488 [details] [diff] [review] log harder by wrapping some methods I backed these out because they break get_token.py, which we use for one-off partner repacks.
Attachment #8495488 -
Flags: checked-in+ → checked-in-
Assignee | ||
Updated•10 years ago
|
Attachment #8495428 -
Flags: checked-in+ → checked-in-
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•