Closed Bug 762922 Opened 12 years ago Closed 12 years ago

improve signing client retry logic

Categories

(Release Engineering :: General, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: catlee)

Details

(Whiteboard: [signing])

Attachments

(1 file)

The signing client currently seems to retry the same server 20 times before moving on to another one. For example:
2012-06-07 17:09:41,734 - a53774f6112fb2458853012d2ea02a21226f4e03: processing mac/is/Thunderbird 14.0b1.dmg on https://mac-signing2.srv.releng.scl3.mozilla.com:9120
2012-06-07 17:10:56,834 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:12:13,440 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:13:30,014 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:14:46,606 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:16:03,195 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:17:19,787 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:18:36,360 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:19:52,937 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:21:09,529 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:22:26,123 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:23:42,700 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:24:59,289 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:26:15,864 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:27:32,462 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:28:49,042 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:30:05,615 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:31:22,202 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:32:38,794 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:33:55,363 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:35:11,947 - a53774f6112fb2458853012d2ea02a21226f4e03: connection error; trying again soon
2012-06-07 17:35:12,948 - a53774f6112fb2458853012d2ea02a21226f4e03: giving up after 20 tries
2012-06-07 17:35:13,098 - a53774f6112fb2458853012d2ea02a21226f4e03: processing mac/is/Thunderbird 14.0b1.dmg on https://mac-signing4.build.scl1.mozilla.com:9100
2012-06-07 17:35:13,445 - a53774f6112fb2458853012d2ea02a21226f4e03: uploading for signing
2012-06-07 17:35:22,903 - a53774f6112fb2458853012d2ea02a21226f4e03: OK


For batched repacks, this more or less guarantees that your token will expire before your job is done if a signing server is down. The client should be switching servers after fewer failures than this.
(In reply to Ben Hearsum [:bhearsum] from comment #0) 
> For batched repacks, this more or less guarantees that your token will
> expire before your job is done if a signing server is down. The client
> should be switching servers after fewer failures than this.

Do we have to wait before switching servers at all, i.e. can we iterate through all possible servers before 'trying again soon' on each cycle?
(In reply to Chris Cooper [:coop] from comment #1)
> (In reply to Ben Hearsum [:bhearsum] from comment #0) 
> > For batched repacks, this more or less guarantees that your token will
> > expire before your job is done if a signing server is down. The client
> > should be switching servers after fewer failures than this.
> 
> Do we have to wait before switching servers at all, i.e. can we iterate
> through all possible servers before 'trying again soon' on each cycle?

The reason we wait right now is because we're retrying the same request to the same server, and giving it a chance to come back up first. Something like this would probably be better:
* Try server A
* If that fails, try server B
* If that fails, try server C
* If that fails, wait N seconds and try them all again.

Probably should shuffle the servers, though.
Whiteboard: [signing]
Severity: normal → enhancement
Priority: -- → P3
Assignee: nobody → catlee
this moves handling of multiple urls to inside remote_signfile.

the urls are first shuffled, and then are tried in order. if we fail on one url, that url is moved to the end of the list. I think it's worthwhile to keep the small sleep that's in there in case there's something network-wide that's failing.
Attachment #651979 - Flags: review?(bhearsum)
Comment on attachment 651979 [details] [diff] [review]
move url retrying into remote_signfile

Review of attachment 651979 [details] [diff] [review]:
-----------------------------------------------------------------

Looks reasonable to me.
Attachment #651979 - Flags: review?(bhearsum) → review+
Attachment #651979 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: