Closed Bug 1261498 Opened 8 years ago Closed 8 years ago

mac-v2-signing2 & 7 sometimes fail to sign

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(2 files)

It seems to select mac signing2 and 7 more often than any others, which leads to them getting the vast majority of the load and sometimes falling over.
I'm starting to doubt that there's actually an issue with signing server selection. The code seems to do it properly, and when assessing across the all instances on each server, the numbers aren't too uneven. Here's the totals number of matches for "Putting" on each server since April 1:
mac-v2-signing1: 1712
mac-v2-signing2: 1901
mac-v2-signing3: 2039
mac-v2-signing4: 1955
mac-v2-signing6: 1968
mac-v2-signing7: 1393

(grepped for "Putting", because that only happens when a file is put onto the queue, whereas the file hash is printed when the client checks to see if the file is done, too.)

When I break it down to just Nightly signing servers, things look a bit different:
mac-v2-signing1: 215
mac-v2-signing2: 369
mac-v2-signing3: 359
mac-v2-signing4: 313
mac-v2-signing6: 309
mac-v2-signing7: 483

I also grepped for Timeouts, which only occured on 2 & 7:
mac-v2-signing2: 67
mac-v2-signing7: 156

Every timeout causes another "Putting" message, so that may explain the elevated number of matches in the Nightly logs on those servers.
Summary: bad signing server selection by signtool → mac-v2-signing2 & 7 sometimes fail to sign
A few notes about 2 & 7:
* They don't share a hardware class (2 is r5, 7 is r4)
* They were both reimaged recently, along with the rest of the pool
* I can't find record of hardware diagnostics being run anytime in the past year

Sounds like hardware diagnostics might be the next step here, though I'm not sure how much we trust them on Macs.
Depends on: 1264737
As we discussed, here's changes that should make us actually give up on pending files after awhile and try a new server. Increasing the error count after giving up on one server means that we'll eventually give up entirely. WIth this patch we should try a different signing server after ~5min, and give up entirely after trying 5 servers.
Assignee: nobody → bhearsum
Status: NEW → ASSIGNED
Attachment #8741496 - Flags: review?(catlee)
Attachment #8741496 - Flags: review?(catlee) → review+
My patch to the signing client seems to be working fine. I haven't seen it fail over to another server yet, but I'll look for timeouts next week and see if they switch servers after ~5min to confirm.
(In reply to Ben Hearsum (:bhearsum) from comment #5)
> My patch to the signing client seems to be working fine. I haven't seen it
> fail over to another server yet, but I'll look for timeouts next week and
> see if they switch servers after ~5min to confirm.

So, my patch works insofar as it switches to the next server after 5 tries...but it doesn't reset the pending count, so we end up not giving the next server a fair chance to sign:
05:41:14     INFO -  2016-04-16 05:41:14,442 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing7.srv.releng.scl3.mozilla.com:9100
05:41:36     INFO -  2016-04-16 05:41:36,911 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: uploading for signing
05:42:05     INFO -  2016-04-16 05:42:05,402 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing7.srv.releng.scl3.mozilla.com:9100
05:43:20     INFO -  2016-04-16 05:43:20,902 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing7.srv.releng.scl3.mozilla.com:9100
05:44:36     INFO -  2016-04-16 05:44:36,455 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing7.srv.releng.scl3.mozilla.com:9100
05:45:51     INFO -  2016-04-16 05:45:51,663 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing7.srv.releng.scl3.mozilla.com:9100
05:47:06     INFO -  2016-04-16 05:47:06,864 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing7.srv.releng.scl3.mozilla.com:9100
05:48:21     INFO -  2016-04-16 05:48:21,887 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: giving up after 5 tries
05:48:21     INFO -  2016-04-16 05:48:21,887 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing3.srv.releng.scl3.mozilla.com:9100
05:48:21     INFO -  2016-04-16 05:48:21,905 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: uploading for signing
05:48:31     INFO -  2016-04-16 05:48:31,059 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: giving up after 5 tries
05:48:31     INFO -  2016-04-16 05:48:31,059 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: processing FirefoxNightly.app.tar.gz on https://mac-v2-signing2.srv.releng.scl3.mozilla.com:9100
05:48:31     INFO -  2016-04-16 05:48:31,079 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: uploading for signing
05:48:41     INFO -  2016-04-16 05:48:41,820 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: giving up after 5 tries
05:48:41     INFO -  2016-04-16 05:48:41,821 - aba7e1dc2ad11788cc11d80285d6ede90a3cd29f: giving up after 6 tries
05:48:41     INFO -  2016-04-16 05:48:41,821 - Failed to sign FirefoxNightly.app.tar.gz with dmg
05:48:41    ERROR -  make[1]: *** [repackage-zip] Error 1
05:48:41     INFO -  make: *** [repackage-zip-ast] Error 2
Attachment #8742499 - Flags: review?(catlee)
Attachment #8742499 - Flags: review?(catlee) → review+
Comment on attachment 8742499 [details] [diff] [review]
reset pending count when trying a new server

https://hg.mozilla.org/build/tools/rev/710d0b6ec4d2
Attachment #8742499 - Flags: checked-in+
No timeouts since I landed the latest patch, couldn't verify it.
Looks to be working now:
05:56:10     INFO -  2016-04-19 05:56:10,873 - 071fcc11d39a4428a42ae48ef9705d9e8479776c: processing Nightly.app.tar.gz on https://mac-v2-signing7.srv.releng.scl3.mozilla.com:9110
05:56:10     INFO -  2016-04-19 05:56:10,909 - 071fcc11d39a4428a42ae48ef9705d9e8479776c: uploading for signing
05:56:31     INFO -  2016-04-19 05:56:31,683 - 071fcc11d39a4428a42ae48ef9705d9e8479776c: processing Nightly.app.tar.gz on https://mac-v2-signing7.srv.releng.scl3.mozilla.com:9110
05:57:46     INFO -  2016-04-19 05:57:46,905 - 071fcc11d39a4428a42ae48ef9705d9e8479776c: processing Nightly.app.tar.gz on https://mac-v2-signing7.srv.releng.scl3.mozilla.com:9110
05:59:06     INFO -  2016-04-19 05:59:06,276 - 071fcc11d39a4428a42ae48ef9705d9e8479776c: processing Nightly.app.tar.gz on https://mac-v2-signing7.srv.releng.scl3.mozilla.com:9110
06:00:21     INFO -  2016-04-19 06:00:21,819 - 071fcc11d39a4428a42ae48ef9705d9e8479776c: processing Nightly.app.tar.gz on https://mac-v2-signing7.srv.releng.scl3.mozilla.com:9110
06:01:43     INFO -  2016-04-19 06:01:43,116 - 071fcc11d39a4428a42ae48ef9705d9e8479776c: processing Nightly.app.tar.gz on https://mac-v2-signing7.srv.releng.scl3.mozilla.com:9110
06:02:58     INFO -  2016-04-19 06:02:58,489 - 071fcc11d39a4428a42ae48ef9705d9e8479776c: giving up after 5 tries
06:02:58     INFO -  2016-04-19 06:02:58,490 - 071fcc11d39a4428a42ae48ef9705d9e8479776c: processing Nightly.app.tar.gz on https://mac-v2-signing1.srv.releng.scl3.mozilla.com:9110
06:02:58     INFO -  2016-04-19 06:02:58,508 - 071fcc11d39a4428a42ae48ef9705d9e8479776c: uploading for signing
06:03:12     INFO -  2016-04-19 06:03:12,940 - 071fcc11d39a4428a42ae48ef9705d9e8479776c: processing Nightly.app.tar.gz on https://mac-v2-signing1.srv.releng.scl3.mozilla.com:9110
06:04:20     INFO -  2016-04-19 06:04:20,718 - 071fcc11d39a4428a42ae48ef9705d9e8479776c: OK
Since the signing client changes were made, there have only been 2 timeouts across the entire pool of signing servers. Load looks much more balanced as well, based on a grep for "Putting" again.

Calling this fixed.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: