Closed Bug 1062302 Opened 10 years ago Closed 10 years ago

mac-v2-signing servers sometimes don't return a response to the client

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: massimo, Assigned: bhearsum)

References

Details

Attachments

(5 files, 1 obsolete file)

Today we had a lot of issues with the l10n repacks for Beta 33 build2, slaves upload their binaries to the signing servers and nothing happens; buildbot kills the process after 2400 seconds of inactivity.

After some investigation, bhearsum has find out that all the errors are coming from mac-v2 servers.

With this patch mac-v2 servers will not be used to generate gpg and mar signatures.
Attachment #8483493 - Flags: review?(bhearsum)
Comment on attachment 8483493 [details] [diff] [review]
[puppet] disable gpg and mar from mac-v2 signing servers.patch

Review of attachment 8483493 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/buildmaster/templates/passwords.py.erb
@@ +18,5 @@
>          ('mac-signing3.srv.releng.scl3.mozilla.com:9100', '<%= scope.function_secret(['signing_server_username']) %>', '<%= scope.function_secret(['signing_server_nightly_password']) %>', ('gpg', 'dmg', 'mar')),
>          ('mac-signing4.srv.releng.scl3.mozilla.com:9100', '<%= scope.function_secret(['signing_server_username']) %>', '<%= scope.function_secret(['signing_server_nightly_password']) %>', ('gpg', 'dmg', 'mar')),
> +        ('mac-v2-signing1.srv.releng.scl3.mozilla.com:9100', '<%= scope.function_secret(['signing_server_username']) %>', '<%= scope.function_secret(['signing_server_nightly_password']) %>', ('dmgv2')),
> +        ('mac-v2-signing2.srv.releng.scl3.mozilla.com:9100', '<%= scope.function_secret(['signing_server_username']) %>', '<%= scope.function_secret(['signing_server_nightly_password']) %>', ('dmgv2')),
> +        ('mac-v2-signing3.srv.releng.scl3.mozilla.com:9100', '<%= scope.function_secret(['signing_server_username']) %>', '<%= scope.function_secret(['signing_server_nightly_password']) %>', ('dmgv2')),

You need a comm after 'dmgv2', otherwise this tuple turns into a string, and 'd', 'm', 'g', 'v', and '2' will all be interpreted as signing formats. r=me with this fixed. Feel free to fix upon landing.
Attachment #8483493 - Flags: review?(bhearsum) → review+
Comment on attachment 8483493 [details] [diff] [review]
[puppet] disable gpg and mar from mac-v2 signing servers.patch

Fixed and landed.
Attachment #8483493 - Flags: checked-in+
Updating summary to make it a bit more generic. The specific conditions we saw were the mac-v2-signing servers under very heavy load when gpg+mar signing was enabled _and_ a release _and_ nightlies were running. I'm not sure why they would be under more pressure than, say, the v1 mac signing servers though. Leaving this open in case we need to debug further.
Summary: Disable "gpg" and "mar" from mac-v2 signing servers → mac-v2-signing servers died under load
Component: Release Automation → General Automation
QA Contact: bhearsum → catlee
A quick look at top is showing pretty high CPU usage with no signing happening, most of it coming from "systemstats":
Processes: 131 total, 3 running, 2 stuck, 126 sleeping, 440 threads                                     12:53:23
Load Avg: 1.61, 1.79, 1.76  CPU usage: 8.0% user, 48.44% sys, 43.55% idle
SharedLibs: 22M resident, 0B data, 2316K linkedit.
MemRegions: 21323 total, 5824M resident, 6536K private, 64M shared.
PhysMem: 7182M used (1209M wired), 6788K unused.
VM: 333G vsize, 1310M framework vsize, 111381508(896) swapins, 112102954(0) swapouts.
Networks: packets: 152779503/157G in, 140299072/130G out. Disks: 10672671/545G read, 20293314/1204G written.

PID    COMMAND      %CPU TIME     #TH  #WQ  #PORTS #MREG MEM    RPRVT  PURG CMPRS  VPRVT  VSIZE  PGRP  PPID
99920  xpcd         0.0  39:27.36 2    0    50161  33    8192B  0B     0B   1072K  22M    2389M  99920 1
99543  python       0.0  00:00.00 1    0    7      149   4096B  0B     0B   1096K  176K   2395M  99533 99534
99542  python       0.0  00:00.00 1    0    7      149   4096B  0B     0B   1100K  176K   2395M  99533 99534
99534  python       0.1  05:02:36 4    0    10     202   2580K  2016K  0B   44M    50M    2443M  99533 1
99446  cfprefsd     0.0  00:00.00 2    1    30     34    4096B  0B     0B   520K   23M    2389M  99446 99399
99445  distnoted    0.0  00:00.35 2    0    41     32    4096B  0B     0B   596K   22M    2411M  99445 99399
99439  XPCTimeStamp 0.0  00:05.58 4    1    68     60    156K   68K    0B   2196K  25M    2414M  99439 1
99399  launchd      0.0  01:11.39 2    0    62     35    4096B  0B     0B   476K   37M    2395M  99399 1
79059  xpcd         0.0  00:00.57 2    0    168    33    4096B  0B     0B   600K   22M    2389M  79059 1
73631  ruby         0.0  00:03.46 2    0    22     180   43M    43M    0B   0B     63M    2425M  73350 73350
73554  top          9.8  00:06.23 1/1  0    24     33    1472K  1304K  0B   124K   26M    2385M  73554 70461
73350  bash         0.0  00:00.00 1    0    19     29    140K   168K   0B   220K   26M    2385M  73350 1
70461  bash         0.0  00:00.03 1    0    19     29    124K   124K   0B   340K   26M    2385M  70461 70451
70451  sshd         0.0  00:00.07 1    0    11     72    228K   156K   0B   308K   1352K  2393M  70444 70444
70444  sshd         0.0  00:02.06 2    0    33     68    32K    0B     0B   1220K  180K   2392M  70444 1
70399  ubd          0.0  00:00.05 2    0    39     47    8192B  0B     0B   1628K  23M    2391M  70399 169
57805  coresymbolic 0.0  00:01.81 2    1    30     33    240K   180K   0B   440K   31M    2389M  57805 1
56477  aslmanager   0.0  00:00.00 2    1    18     33    4096B  0B     0B   336K   23M    2389M  56477 1
56476  aslmanager   0.0  00:00.00 2    1    16     33    4096B  0B     0B   324K   23M    2389M  56476 1
55865  ntpd         0.0  00:00.07 2    0    28     34    4096B  0B     0B   648K   31M    2389M  55865 1
54192  systemstats  46.3 14:08:44 2/1  0    41     7366  5742M- 5743M- 0B   4882M+ 10G    13G    54192 1


xpcd also seems to be doing a lot too, but it seems pretty important:
15:49 < spohl> "xpcd is the daemon that manages XPC services. From Apple's developer site: "The XPC Services API, part of libSystem, provides a lightweight mechanism for basic interprocess communication integrated with Grand 
               Central Dispatch (GCD) and launchd. The XPC Services API allows you to create lightweight helper tools, called XPC services, that perform work on behalf of your application."
Looks like systemstats can be disabled according to http://osxdaily.com/2014/02/27/systemstats-cpu-use-slow-mac-os-x/:
Disabling systemstats

This is not recommended and it may have unintended consequences beyond disabling property battery and power management monitoring, including the App Nap feature, but you can disable systemstats but unloading the daemon from launchd. This should be limited to advanced users only:

sudo launchctl unload -w /System/Library/LaunchDaemons/com.apple.systemstatsd.plist

sudo launchctl unload -w /System/Library/LaunchDaemons/com.apple.systemstats.daily.plist

sudo launchctl unload -w /System/Library/LaunchDaemons/com.apple.systemstats.analysis.plist
Yikes - load is still quite bad. Load is between 1 and 4 depending on the machine. System stats seems to be a big culprit at times, so I've manually unloaded the service on mac-v2-signing3, which has the worst load. If that makes a big difference, I'll get it disabled on all of them with Puppet.
Attached patch disable system stats services (obsolete) — Splinter Review
Load dropped pretty quickly from from 3 -> 2, and dips close to 1 at times - so I'm pretty sure this is helping. Load is still higher than I would've expected though, especially compared to some v1 signing machines. Even more surprising that the load is higher when v1 are doing mar/gpg signing and v2 are not.

http://people.mozilla.org/~bhearsum/sattap/0e9e65bf.png

In any case, this certainly won't hurt.
Assignee: mgervasini → bhearsum
Status: NEW → ASSIGNED
Attachment #8493115 - Flags: review?(dustin)
I noticed that part of Spotlight was enabled too, let's kill it as well.
Attachment #8493115 - Attachment is obsolete: true
Attachment #8493115 - Flags: review?(dustin)
Attachment #8493137 - Flags: review?(dustin)
Comment on attachment 8493137 [details] [diff] [review]
disable system stats + mds

Rail, Dustin is at a conference this week. Do you feel comfortable reviewing this?
Attachment #8493137 - Flags: review?(dustin) → review?(rail)
Comment on attachment 8493137 [details] [diff] [review]
disable system stats + mds

syntax check: OK :)
Attachment #8493137 - Flags: review?(rail) → review+
Attachment #8493137 - Flags: checked-in+
Going to leave this open for now because the v2 signing work is merging to mozilla-central in the next week. We'll see how these hold up when it makes it there - I expect it to cause quite a load increase.
I accidentally turned on mac v2 signing for the latest release and beta builds' partner repacks, and the v2 signing servers managed to cause timeouts there, despite the load not being awfully high. I'd like to investigate this more ASAP, to try to address it before v2 signatures merge to mozilla-central.
Amy, one route of investigation I'm taking is to try and figure out why these systems idle at such high load. mac-v2-signing1 & 3 are rev4 machines and idle around ~1.5 load average. mac-v2-signing2 is a rev5 and idles close to 2.0. mac-signing3 is rev4 and idles at well under .5.

I'm baffled - the processes that use the most CPU time on all of them are WindowServer and ScreenSaverEngine - I don't understand why the v2 signing machines are so much more heavily loaded. We may just have to write it off to "10.9 is busier", but I feel like there's something I'm missing here.
Flags: needinfo?(arich)
Another angle of approach is tracing one of the failed signing jobs, eg:
2014-09-24 05:39:44,688 - in /builds/slave/rel-m-beta-osx64_partner_rpk-0/partner-repacks/scripts/repacked_builds/33.0b7/build1/partner-repacks/yandex-tr-tamindir/mac/tr/working
2014-09-24 05:39:54,528 - 01e1bf3f40e6b19fe1d243759b98dc857c42ba16: processing Firefox 33.0b7.dmg.tar.gz on https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9120
2014-09-24 05:39:54,844 - 01e1bf3f40e6b19fe1d243759b98dc857c42ba16: uploading for signing
2014-09-24 05:40:01,929 - 01e1bf3f40e6b19fe1d243759b98dc857c42ba16: processing Firefox 33.0b7.dmg.tar.gz on https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9120

command timed out: 1200 seconds without output running ['python', './partner-repacks.py', '--version', '33.0b7', '--build-number', '1', '--repo', 'releases/mozilla-beta', '--hgroot', 'https://hg.mozilla.org', '--staging-server', 'stage.mozilla.org', '--dmg-extract-script', '/builds/slave/rel-m-beta-osx64_partner_rpk-0/tools/release/common/unpack-diskimage.sh', '--nightly-dir', 'firefox/nightly', '--pkg-dmg', '/builds/slave/rel-m-beta-osx64_partner_rpk-0/partner-repacks/scripts/pkg-dmg', '--platform', 'macosx64', '--signed'], attempting to kill

Looking on mac-v2-signing2 I see the following for 01e1bf3f40e6b19fe1d243759b98dc857c42ba16:
2014-09-24 05:39:54,551 - INFO - 10.26.52.70 GET /sign/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16
2014-09-24 05:39:54,831 - DEBUG - Looking for /builds/signing/rel-key-signing-server/signed-files/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16
2014-09-24 05:39:54,840 - DEBUG - /builds/signing/rel-key-signing-server/signed-files/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16 is missing
2014-09-24 05:39:54,841 - INFO - 10.26.52.70 - - [2014-09-24 05:39:54] "GET /sign/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16 HTTP/1.1" 404 96 0.290019

2014-09-24 05:39:55,150 - INFO - 10.26.52.70 POST /sign/dmg
2014-09-24 05:40:00,330 - INFO - Request to dmg sign Firefox 33.0b7.dmg.tar.gz (01e1bf3f40e6b19fe1d243759b98dc857c42ba16) from 10.26.52.70
2014-09-24 05:40:00,924 - DEBUG - Putting (u'01e1bf3f40e6b19fe1d243759b98dc857c42ba16', u'Firefox 33.0b7.dmg.tar.gz', 'dmg', <gevent.event.Event object at 0x101ce6610>) on the queue
2014-09-24 05:40:00,925 - DEBUG - 1 workers active
2014-09-24 05:40:00,926 - INFO - 10.26.52.70 - - [2014-09-24 05:40:00] "POST /sign/dmg HTTP/1.1" 202 98 5.775935

2014-09-24 05:40:00,932 - INFO - Signing Firefox 33.0b7.dmg.tar.gz (dmg - 01e1bf3f40e6b19fe1d243759b98dc857c42ba16)
2014-09-24 05:40:01,088 - DEBUG - 50084: ['/builds/signing/rel-key-signing-server/bin/python2.7', '/builds/signing/rel-key-signing-server/tools/release/signing/signscript.py', '-c', '/builds/signing/rel-key-signing-server/signscript.ini', 'dmg', u'/builds/signing/rel-key-signing-server/unsigned-files/01e1bf3f40e6b19fe1d243759b98dc857c42ba16', u'/builds/signing/rel-key-signing-server/signed-files/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16', u'Firefox 33.0b7.dmg.tar.gz']
2014-09-24 05:40:01,949 - INFO - 10.26.52.70 GET /sign/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16
2014-09-24 05:40:01,949 - DEBUG - Waiting for pending job
2014-09-24 05:40:05,122 - DEBUG - 50084: Success!
2014-09-24 05:40:05,409 - DEBUG - Copying result to a88fd5cb02a2173c84e0712cbeb7927c9c367264
2014-09-24 05:40:06,716 - DEBUG - no items, exiting
2014-09-24 05:40:06,717 - DEBUG - Worker exiting
2014-09-24 05:40:06,717 - DEBUG - Got message: ('done', (u'01e1bf3f40e6b19fe1d243759b98dc857c42ba16', u'Firefox 33.0b7.dmg.tar.gz', 'dmg', <gevent.event.Event object at 0x101ce6610>), 'a88fd5cb02a2173c84e0712cbeb7927c9c367264')
2014-09-24 05:40:06,842 - DEBUG - Pending job finished!
2014-09-24 05:40:06,887 - DEBUG - Looking for /builds/signing/rel-key-signing-server/signed-files/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16 (Firefox 33.0b7.dmg.tar.gz)
2014-09-24 05:40:07,167 - DEBUG - /builds/signing/rel-key-signing-server/signed-files/dmg/01e1bf3f40e6b19fe1d243759b98dc857c42ba16 is OK
2014-09-24 05:40:07,171 - DEBUG - Done worker
2014-09-24 05:40:07,172 - DEBUG - 0 workers left



This is very interesting, because as far as the signing server goes, it thinks it completed the work without issue. This points away from a load issue on the server causing it to hang, and towards the client acting funny, or possibly the result never reaching the client. This makes me wonder if the transfer from the server back to the client was simply extrememly slow (for whatever reason), and the client couldn't get the file within the 20min before it timed out. I'm going to continue looking into this...
As another datapoint, services on the three v2 signing servers occasionally show up as unreachable via nagios. I'm wondering if there's some sort of issue with the 10.9 network stack.
At the moment, this isn't a blocker, but it may become critical/blocker if it holds up the merge of the Mac v2 signing work from bug 1046306. I'm not 100% sure when that is happening.


If possible, it would be nice to get any information we have from any network equipment between these two machines. Anything that meets the following may be helpful:
* Sept 24th between 05:39:50 and 05:40:10 (pacific)
* source: 10.26.52.70
* destination: 10.26.49.30:9120/tcp
Blocks: 1046306
I ran this with --noop on 10.6, 10.9, and Linux signing machines. Only 10.9 had any changes.
Attachment #8495424 - Flags: review?(rail)
I found this tip about httplib at http://stackoverflow.com/questions/1170744/how-do-i-get-urllib2-to-log-all-transferred-bytes, and I verified the value of 1 by reading httplib.py (which uses "print" (!!!!!) to log).
Attachment #8495428 - Flags: review?(rail)
Attachment #8495428 - Flags: review?(rail) → review+
Attachment #8495424 - Flags: review?(rail) → review+
Attachment #8495428 - Flags: checked-in+
Attachment #8495424 - Flags: checked-in+
So, the signtool patch kind of worked, except that the output is getting buffered because httplib is using print, not log :(:
2014-09-25 12:21:52,247 - in /builds/slave/m-in-osx64-0000000000000000000/build/obj-firefox/i386/dist/universal/firefox
2014-09-25 12:21:52,342 - doing dmg signing
2014-09-25 12:21:52,342 - possible hosts are ['https://mac-signing4.srv.releng.scl3.mozilla.com:9110', 'https://mac-signing2.srv.releng.scl3.mozilla.com:9110', 'https://mac-signing3.srv.releng.scl3.mozilla.com:9110']
2014-09-25 12:21:52,343 - pack tar /builds/slave/m-in-osx64-0000000000000000000/build/obj-firefox/i386/dist/universal/firefox/FirefoxNightly.app.tar.gz from folder  /builds/slave/m-in-osx64-0000000000000000000/build/obj-firefox/i386/dist/universal/firefox with files 
2014-09-25 12:21:52,343 - ['FirefoxNightly.app']
2014-09-25 12:22:06,778 - FirefoxNightly.app.tar.gz
2014-09-25 12:22:06,778 - checking FirefoxNightly.app.tar.gz for signature...
2014-09-25 12:22:07,162 - dae8f2646c08c468bfc4e82226a179f6c76d56ab: checking cache
2014-09-25 12:22:07,213 - dae8f2646c08c468bfc4e82226a179f6c76d56ab: processing FirefoxNightly.app.tar.gz on https://mac-signing4.srv.releng.scl3.mozilla.com:9110
2014-09-25 12:22:07,213 - dae8f2646c08c468bfc4e82226a179f6c76d56ab: GET https://mac-signing4.srv.releng.scl3.mozilla.com:9110/sign/dmg/dae8f2646c08c468bfc4e82226a179f6c76d56ab
2014-09-25 12:22:07,261 - dae8f2646c08c468bfc4e82226a179f6c76d56ab: uploading for signing
2014-09-25 12:22:21,004 - dae8f2646c08c468bfc4e82226a179f6c76d56ab: processing FirefoxNightly.app.tar.gz on https://mac-signing4.srv.releng.scl3.mozilla.com:9110
2014-09-25 12:22:21,004 - dae8f2646c08c468bfc4e82226a179f6c76d56ab: GET https://mac-signing4.srv.releng.scl3.mozilla.com:9110/sign/dmg/dae8f2646c08c468bfc4e82226a179f6c76d56ab
2014-09-25 12:22:57,711 - dae8f2646c08c468bfc4e82226a179f6c76d56ab: OK
2014-09-25 12:22:57,711 - Creating /builds/slave/m-in-osx64-0000000000000000000/signing_cache/dmg
2014-09-25 12:22:57,711 - Copying FirefoxNightly.app.tar.gz to cache /builds/slave/m-in-osx64-0000000000000000000/signing_cache/dmg/dae8f2646c08c468bfc4e82226a179f6c76d56ab
2014-09-25 12:22:58,201 - unpacking FirefoxNightly.app
2014-09-25 12:22:58,201 - unpack tar /builds/slave/m-in-osx64-0000000000000000000/build/obj-firefox/i386/dist/universal/firefox/FirefoxNightly.app.tar.gz into /builds/slave/m-in-osx64-0000000000000000000/build/obj-firefox/i386/dist/universal/firefox
send: 'GET /sign/dmg/dae8f2646c08c468bfc4e82226a179f6c76d56ab HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: mac-signing4.srv.releng.scl3.mozilla.com:9110\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 404 Not Found\r\n'
header: Date: Thu, 25 Sep 2014 19:22:07 GMT
header: Transfer-Encoding: chunked
send: 'POST /sign/dmg HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 98123283\r\nHost: mac-signing4.srv.releng.scl3.mozilla.com:9110\r\nContent-Type: multipart/form-data; boundary=2caa4e5573864e059408a65acae11620\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
send: <poster.encode.multipart_yielder instance at 0x100786f38>
sendIng an iterable
reply: 'HTTP/1.1 202 Accepted\r\n'
header: X-Nonce: UNUSED
header: Date: Thu, 25 Sep 2014 19:22:19 GMT
header: Content-Length: 0
send: 'GET /sign/dmg/dae8f2646c08c468bfc4e82226a179f6c76d56ab HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: mac-signing4.srv.releng.scl3.mozilla.com:9110\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: X-Sha1-Digest: 8f4cfe03dd61df64d46cf189882c4ca3873df9aa
header: Content-Length: 98132940
header: Date: Thu, 25 Sep 2014 19:22:53 GMT

I'm not sure we'll get all of the "send", "reply", and "header" information in the event of a timeout.
Stupid httplib!
Attachment #8495488 - Flags: review?(rail)
Attachment #8495488 - Flags: review?(rail) → review+
Attachment #8495488 - Flags: checked-in+
I had to land a bustage fix because I forgot to return from the methods I wrapped, but we're not getting logging that will be there even if it times out:
014-09-25 13:42:59,727 - 7ed85480a4c3ae0519794865106a192e461f9f30: checking cache
2014-09-25 13:42:59,757 - 7ed85480a4c3ae0519794865106a192e461f9f30: processing FirefoxNightly.app.tar.gz on https://mac-signing3.srv.releng.scl3.mozilla.com:9110
2014-09-25 13:42:59,757 - 7ed85480a4c3ae0519794865106a192e461f9f30: GET https://mac-signing3.srv.releng.scl3.mozilla.com:9110/sign/dmg/7ed85480a4c3ae0519794865106a192e461f9f30
2014-09-25 13:42:59,759 - send: ('GET /sign/dmg/7ed85480a4c3ae0519794865106a192e461f9f30 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: mac-signing3.srv.releng.scl3.mozilla.com:9110\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n',), {}
2014-09-25 13:42:59,889 - getting response
2014-09-25 13:42:59,896 - 7ed85480a4c3ae0519794865106a192e461f9f30: uploading for signing
2014-09-25 13:43:00,379 - send: ('POST /sign/dmg HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 98126066\r\nHost: mac-signing3.srv.releng.scl3.mozilla.com:9110\r\nContent-Type: multipart/form-data; boundary=7abf2c38339147258b9c87d710175cb4\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n',), {}
2014-09-25 13:43:00,431 - send: (<poster.encode.multipart_yielder instance at 0x100786fc8>,), {}
2014-09-25 13:43:10,747 - getting response
2014-09-25 13:43:13,568 - 7ed85480a4c3ae0519794865106a192e461f9f30: processing FirefoxNightly.app.tar.gz on https://mac-signing3.srv.releng.scl3.mozilla.com:9110
2014-09-25 13:43:13,568 - 7ed85480a4c3ae0519794865106a192e461f9f30: GET https://mac-signing3.srv.releng.scl3.mozilla.com:9110/sign/dmg/7ed85480a4c3ae0519794865106a192e461f9f30
2014-09-25 13:43:13,569 - send: ('GET /sign/dmg/7ed85480a4c3ae0519794865106a192e461f9f30 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: mac-signing3.srv.releng.scl3.mozilla.com:9110\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n',), {}
2014-09-25 13:43:13,669 - getting response
dustin: apparently they're going to merge the v2 signing stuff into mozilla-central on Monday, so it would be great to get these machines upgraded (I think the firewall is blocking that) and do any debugging we can on Friday.
Flags: needinfo?(dustin)
The issue was repro'ed overnight with the additional logging enabled. From the client:
2014-09-25 18:40:33,971 - doing dmg signing
2014-09-25 18:40:33,972 - possible hosts are ['https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9110', 'https://mac-v2-signing1.srv.releng.scl3.mozilla.com:9110', 'https://mac-v2-signing3.srv.releng.scl3.mozilla.com:9110']
2014-09-25 18:40:33,972 - pack tar /builds/slave/oak-osx64-00000000000000000000/build/obj-firefox/i386/dist/universal/firefox/FirefoxNightly.app.tar.gz from folder  /builds/slave/oak-osx64-00000000000000000000/build/obj-firefox/i386/dist/universal/firefox with files 
2014-09-25 18:40:33,972 - ['FirefoxNightly.app']
2014-09-25 18:40:48,331 - FirefoxNightly.app.tar.gz
2014-09-25 18:40:48,331 - checking FirefoxNightly.app.tar.gz for signature...
2014-09-25 18:40:48,715 - 217bf3f8a3a7bcc3475dc64581225699e51fd71d: checking cache
2014-09-25 18:40:48,756 - 217bf3f8a3a7bcc3475dc64581225699e51fd71d: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9110
2014-09-25 18:40:48,756 - 217bf3f8a3a7bcc3475dc64581225699e51fd71d: GET https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9110/sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d
2014-09-25 18:40:48,758 - send: ('GET /sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: mac-v2-signing2.srv.releng.scl3.mozilla.com:9110\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n',), {}
2014-09-25 18:40:48,802 - getting response
2014-09-25 18:40:48,806 - 217bf3f8a3a7bcc3475dc64581225699e51fd71d: uploading for signing
2014-09-25 18:40:49,262 - send: ('POST /sign/dmg HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 98139983\r\nHost: mac-v2-signing2.srv.releng.scl3.mozilla.com:9110\r\nContent-Type: multipart/form-data; boundary=7b58050e46b7485aa17f6d7438a47ea9\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n',), {}
2014-09-25 18:40:49,279 - send: (<poster.encode.multipart_yielder instance at 0x100786fc8>,), {}
2014-09-25 18:40:55,751 - getting response
2014-09-25 18:40:57,665 - 217bf3f8a3a7bcc3475dc64581225699e51fd71d: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9110
2014-09-25 18:40:57,666 - 217bf3f8a3a7bcc3475dc64581225699e51fd71d: GET https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9110/sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d
2014-09-25 18:40:57,666 - send: ('GET /sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: mac-v2-signing2.srv.releng.scl3.mozilla.com:9110\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n',), {}
2014-09-25 18:40:57,680 - getting response


And the server:
2014-09-25 18:40:48,802 - INFO - 10.26.52.78 GET /sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d
2014-09-25 18:40:48,803 - DEBUG - Looking for /builds/signing/dep-key-signing-server/signed-files/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d
2014-09-25 18:40:48,803 - DEBUG - /builds/signing/dep-key-signing-server/signed-files/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d is missing
2014-09-25 18:40:48,804 - INFO - 10.26.52.78 - - [2014-09-25 18:40:48] "GET /sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d HTTP/1.1" 404 96 0.001593

2014-09-25 18:40:49,280 - INFO - 10.26.52.78 POST /sign/dmg
2014-09-25 18:40:55,786 - INFO - Request to dmg sign FirefoxNightly.app.tar.gz (217bf3f8a3a7bcc3475dc64581225699e51fd71d) from 10.26.52.78
2014-09-25 18:40:56,628 - DEBUG - Putting (u'217bf3f8a3a7bcc3475dc64581225699e51fd71d', u'FirefoxNightly.app.tar.gz', 'dmg', <gevent.event.Event object at 0x103c1bdd0>) on the queue
2014-09-25 18:40:56,629 - DEBUG - 1 workers active
2014-09-25 18:40:56,629 - INFO - 10.26.52.78 - - [2014-09-25 18:40:56] "POST /sign/dmg HTTP/1.1" 202 98 7.349819

2014-09-25 18:40:56,636 - INFO - Signing FirefoxNightly.app.tar.gz (dmg - 217bf3f8a3a7bcc3475dc64581225699e51fd71d)
2014-09-25 18:40:56,640 - DEBUG - 35708: ['/builds/signing/dep-key-signing-server/bin/python2.7', '/builds/signing/dep-key-signing-server/tools/release/signing/signscript.py', '-c', '/builds/signing/dep-key-signing-server/signscript.ini', 'dmg', u'/builds/signing/dep-key-signing-server/unsigned-files/217bf3f8a3a7bcc3475dc64581225699e51fd71d', u'/builds/signing/dep-key-signing-server/signed-files/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d', u'FirefoxNightly.app.tar.gz']
2014-09-25 18:40:57,680 - INFO - 10.26.52.78 GET /sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d
2014-09-25 18:40:57,681 - DEBUG - Waiting for pending job
2014-09-25 18:41:21,572 - DEBUG - 35708: Success!
2014-09-25 18:41:21,960 - DEBUG - Copying result to 096126e7572b4a2a8b1f7b0dbe657b26e70b90f8
2014-09-25 18:41:23,890 - DEBUG - no items, exiting
2014-09-25 18:41:23,890 - DEBUG - Worker exiting
2014-09-25 18:41:23,891 - DEBUG - Got message: ('done', (u'217bf3f8a3a7bcc3475dc64581225699e51fd71d', u'FirefoxNightly.app.tar.gz', 'dmg', <gevent.event.Event object at 0x103c1bdd0>), '096126e7572b4a2a8b1f7b0dbe657b26e70b90f8')
2014-09-25 18:41:23,892 - DEBUG - Pending job finished!
2014-09-25 18:41:23,892 - DEBUG - Looking for /builds/signing/dep-key-signing-server/signed-files/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d (FirefoxNightly.app.tar.gz)
2014-09-25 18:41:24,274 - DEBUG - /builds/signing/dep-key-signing-server/signed-files/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d is OK
2014-09-25 18:41:24,278 - DEBUG - Done worker
2014-09-25 18:41:24,278 - DEBUG - 0 workers left


We can see the client making the GET at 18:40:57 and the server receiving that. It was still in the middle of signing at the time, so it waits up to 60 seconds for that to complete (this is normal: https://github.com/mozilla/build-tools/blob/master/lib/python/signing/server.py#L506).

At 18:41:23 we see the "Pending job finished" message along with a bunch of other things later in the server's do_GET method. So as far as the server is concern, it *definitely* finishes processing the GET, and yields the data.

The client logging is a bit weirder. Unfortunately, the "getting response" messages seems to happen as soon as the request is made - not when the response starts streaming in. So the fact that we see that doesn't indicate that the server returned anything yet, as I'd hoped it would. Because of that it's hard to know exactly where the client is hanging, but I would _guess_ that it's around https://hg.python.org/cpython/file/22a46f05ce23/Lib/httplib.py#l414, where it's waiting for a header.

One thing that may be worth looking into is whether or not some networking gear is dropping the GET after some timeout. In the example above, the GET request starts at 18:40:57, but the response isn't started until at least 10:41:24 - which is at least 27 seconds later. It doesn't seem like an awfully long time, but I know we used to have issues in other contexts with tcp connections getting dropped by a firewall.

I'm going to see if I can hack even more logging and repro the problem again.
Per our discussions in irc, there's no line like
  INFO - 10.26.52.78 - - [2014-09-25 18:40:56] "GET /sign/dmg/217bf3f8a3a7bcc3475dc64581225699e51fd71d HTTP/1.1"
logged on the server.  This is logged by the WSGI framework *after* the response body is transmitted.  So the HTTP connection hung during the HTTP response, in such a way that neither side saw the response complete, and the usual TCP mechanisms to restart a connection failed.

This is consistent with the behavior seen when some L4 intermediary drops the connection from its state table and subsequently drops packets in either direction for that connection.

It's *not* consistent with either end host dropping the connection state -- in that case, the end that dropped the connection would see an EOF on the socket and log completion of the transaction.

Even without any special configuration, our firewalls accept 60+ seconds of silence on a TCP connection without timing it out.

I'd like to gather more data in a tcpdump when this occurs, so that's where I'll look.
Latest status:
We're pretty sure that there is some sort of network issue - NOT a load issue. The fact we've only ever hit this 10.9.0 (not on 10.6, not on 10.9.5) hints strongly towards an issue in 10.9.0's network stack or kernel somewhere. However, we've only done ~12 builds with signing on 10.9.5, which is not the biggest sample size.

We've verified that Firefox is still signed and runs correctly when done from 10.9.5, so I've gone ahead and upgraded all of the mac-v2-signing machines to it. All of them are now back in service.

At this point we haven't been able to think of any ways to try and diagnose this further, given that it's an extremely difficult bug to reproduce. I am encouraged by the 10.9.5 upgrade, but I certainly can't make any guarantees that it has permanently fixed the issue. Once the v2 signing work has merged to a major branch (ie, mozilla-inbound) and we see the signing servers holding up to that load, we can call this fixed.

If we see more issue, especially after merging, we are prepared to do additional investigation with tcpdump (we can't do this now because we don't know how to reproduce the problem).
Flags: needinfo?(dustin)
Flags: needinfo?(arich)
Updating summary to match latest investigation. We haven't seen any more timeouts over the weekend. The real test will be today though, after the v2 signing work merges to a mainline branch.
Summary: mac-v2-signing servers died under load → mac-v2-signing servers sometimes don't return a response to the client
Comment on attachment 8495424 [details] [diff] [review]
reduce concurrency for v2 signing machines

I've backed this out (well, I changed concurrency to be "4" for everything, not backed out the code that lets us set it) now that we're pretty darn sure load isn't a factor here.
Attachment #8495424 - Flags: checked-in+ → checked-in-
v2 signing landed on mozilla-central and mozilla-inbound this morning. We've had over a hundred builds signed since then, and I've not seen nor heard of any issues related to signing.

I'm calling this FIXED. Thanks again to everyone who helped here - particularly Amy and Dustin.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment on attachment 8495488 [details] [diff] [review]
log harder by wrapping some methods

I backed these out because they break get_token.py, which we use for one-off partner repacks.
Attachment #8495488 - Flags: checked-in+ → checked-in-
Attachment #8495428 - Flags: checked-in+ → checked-in-
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: