Use ftp.m.o for test downloads

RESOLVED FIXED

Status

Release Engineering
General
P2
normal
RESOLVED FIXED
6 years ago
17 days ago

People

(Reporter: nthomas, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments)

(Reporter)

Description

6 years ago
Currently we hit stage, so if you add together release load + uploads + downloads it makes for a busy box. Like up to 80Mb/s busy.
(Reporter)

Comment 1

6 years ago
Created attachment 605967 [details] [diff] [review]
[tools] Swap try to ftp.m.o
Attachment #605967 - Flags: review?(aki)

Updated

6 years ago
Attachment #605967 - Flags: review?(aki) → review+
(Reporter)

Comment 2

6 years ago
Comment on attachment 605967 [details] [diff] [review]
[tools] Swap try to ftp.m.o

http://hg.mozilla.org/build/tools/rev/4aa8b87df068

Deployed to stage.
Attachment #605967 - Flags: checked-in+
(Reporter)

Comment 3

6 years ago
Created attachment 606047 [details] [diff] [review]
[tools] Swap ReleaseToDated and ReleaseToTinderboxBuilds

No problems found so far for the try switch.
Attachment #606047 - Flags: review?(aki)

Updated

6 years ago
Attachment #606047 - Flags: review?(aki) → review+
(Reporter)

Comment 4

6 years ago
Comment on attachment 606047 [details] [diff] [review]
[tools] Swap ReleaseToDated and ReleaseToTinderboxBuilds

http://hg.mozilla.org/build/tools/rev/010d791caebd

Deployed to stage at 15:28 (was rev 4aa8b87df068).
Attachment #606047 - Flags: checked-in+
(Reporter)

Comment 5

6 years ago
Created attachment 606503 [details]
Traffic graph

We ran into a hitch, which philor spotted first. Several jobs requested a couple of files in a 10 second interval, and got a '503 Server Too Busy' response from ftp.m.o. This was between 00:27:10 and 00:27:20 Pacific. You can see from the attached graph that there's a spike in traffic on all three nodes behind ftp.m.o. 

dumitru, could you take a look at the load balancer/backend node logs and see if there is anything useful there ? During quiet times we have enough idle machines to respond straight away when one compile job turns into many test jobs. That means multiple requests for the same file.

----------
Failure details:

Request:
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64-debug/1331879846/firefox-14.0a1.en-US.mac64.dmg 
Logs:
https://tbpl.mozilla.org/php/getParsedLog.php?id=10117750&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10117730&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10117733&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10117753&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10117755&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10117748&tree=Mozilla-Inbound

Request:
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32-debug/1331879069/firefox-14.0a1.en-US.win32.zip
Logs:
https://tbpl.mozilla.org/php/getParsedLog.php?id=10117757&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=10117754&tree=Mozilla-Inbound

There's no info in the logs to identify if it was Zeus or one of the backend nodes returning the 503.
(Reporter)

Comment 6

6 years ago
Actually CC dumitru this time. Please see comment #5, thanks.
(Reporter)

Updated

6 years ago
Depends on: 738557
Hitting the 503 Too Busy two or three or four times every day, taking out two or three or six or eight jobs every time. Let me know if you want me to start pasting links to every single log in here :)
(Reporter)

Comment 8

6 years ago
dumitru says he can take a look in a 2-3 days, so I'd like to wait for that. If we don't find anything to fix we can swap back to stage until after the SCL3 migration. The downside of going back is that stage has a pretty hard time whenever we do releases, which can impact build/test/log uploads, and that's takes longer to recover from than retrying tests.
Hard to see. Can we manually trigger that so I can watch the logs 'live'? Let's coordinate this.
Hit this again tonight:

2012-04-11 18:56:51 ERROR 503: Server Too Busy. [https://tbpl.mozilla.org/php/getParsedLog.php?id=10829415&tree=Firefox&full=1]
2012-04-11 18:57:15 ERROR 503: Server Too Busy. [https://tbpl.mozilla.org/php/getParsedLog.php?id=10829427&tree=Firefox&full=1]

I didn't check all trees though [Times are PDT]
Duplicate of this bug: 747747
IT: anything you can do to help debug this?
Assignee: nrthomas → server-ops
Component: Release Engineering: Automation (General) → Server Operations
Flags: checked-in+
QA Contact: catlee → phong
zeus config says:
maximum simultaneous connections from one IP address: 30
maximum simultaneous connections from the top ten busiest IP addresses: 200
(Reporter)

Comment 18

6 years ago
Do we have any graphs or data on how many connections are in use vs time ? 

If all the requests from SCL1 are appearing to come from a single NAT IP then we'd probably hit those limits, particularly the first one.
(Reporter)

Comment 19

6 years ago
The minis in scl1 get a public IP for ftp.m.o (63.245...) so I think they will all appear to be coming from 63.245.222.66. There are more than 500 minis in the colo, so it's not surprising we're getting more than 30 requests sometimes. (I'm assuming going over that limit results in a 503 response.)

Please increase the connection limit for 63.245.222.66 to 100. Hopefully we won't need to do that globally.
(In reply to Nick Thomas [:nthomas] from comment #19)

> Please increase the connection limit for 63.245.222.66 to 100. Hopefully we
> won't need to do that globally.

I've added this IP to the "Allowed IPs" list on Zeus for ftp, which means it should bypass the restrictions above.

Let me know if you folks still see this issue.
(Reporter)

Comment 21

6 years ago
Thanks Shyam. Lets move this back to a RelEng component to monitor.

Incidentally, does that have any effect on caching ?
Assignee: server-ops → nobody
Component: Server Operations → Release Engineering: Automation (General)
QA Contact: phong → catlee
(In reply to Nick Thomas [:nthomas] from comment #21)
> Thanks Shyam. Lets move this back to a RelEng component to monitor.

You're welcome!
 
> Incidentally, does that have any effect on caching ?

None. It just doesn't process the connection limits for any connection coming from that IP.
(Reporter)

Comment 23

6 years ago
We've moved to the scl3 colo since this move, and that's working well too.
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
(Assignee)

Updated

5 years ago
Product: mozilla.org → Release Engineering
(Assignee)

Updated

17 days ago
Component: General Automation → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.