All trees closed due to ftp bustage

RESOLVED FIXED

Status

--
blocker
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: KWierso, Assigned: nmaul)

Tracking

Details

(Whiteboard: [kanban:webops:https://kanbanize.com/ctrl_board/4/1934] )

Attachments

(3 attachments, 1 obsolete attachment)

(Reporter)

Description

4 years ago
nagios is alerting the following:
ftp2-zlb.vips.scl3.mozilla.com:productdelivery https VIP is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.mozilla.org/productdelivery+https+VIP)

Downloads from ftp are failing.

All trees closed.
#moc notified

Comment 2

4 years ago
How are we doing here? This is a mission critical piece of infrastructure. Any updates? Thanks.
moving to correct component, webops was paged by #moc, and is investigating
Assignee: nobody → server-ops-webops
Component: Buildduty → WebOps: Product Delivery
Product: Release Engineering → Infrastructure & Operations
QA Contact: bugspam.Callek → nmaul
Version: unspecified → other

Updated

4 years ago
Whiteboard: [kanban:webops:https://kanbanize.com/ctrl_board/4/1934]
High connection volumes on the FTP cluster have caused most of the back-end nodes to reach their maximum client capacity. Webops has been engaged and are investigating, but they do not yet have a diagnosis for the increased traffic.

Comment 5

4 years ago
nothing new, the hosts are reaching max clients and poping in and out of the load balancer.

WebOps is looking at where the traffic is coming from but it is not clear yet. Currently no ETA.

Comment 6

4 years ago
per webops update:

no ETA, we are still investigating, there is some oddness with the type of traffic we are seeing but we do not know what is causing it yet, it is high request load - or stale client downloads type of thing, but we are not really sure yet.
Assignee: server-ops-webops → jcrowe
there was a similar issue at the firefox 10th anniversary where engineers (not our customers) downloaded in high volume off of FTP.  Customers download from CDN which scales. On the 10th anniversary webdev/releng were to flip a bit to have developers download from CDN due to volumes.  Is this a similar situation where lots of developers are downloading frm FTP directly and it should be off CDN?

Comment 8

4 years ago
issue is still on going, jakem and gozer from webops are investigating with no ETA. will update in about 30 minutes or when we find out more information.
We would need URL data from the logs to determine if that is the issue and a possible solution.

Webops: what are the top talkers for the fetches?
(In reply to Hal Wine [:hwine] (use needinfo) from comment #9)
> We would need URL data from the logs to determine if that is the issue and a
> possible solution.
> 
> Webops: what are the top talkers for the fetches?

Based on convo in #webops, this not related to the dev edition - this is related to bug 1061975 comment 80 and down.

That process "should" have users obtaining the installer from CDN's however CDN's appear to be refusing to serve, so request fall's back to ftp.m.o
Hal, didn't this follow the standard release process?

Comment 12

4 years ago
the issue has been identified and resolved. the hot fix released today had ftp.m.o hard coded instead of the bouncer. we should see ftp response time improve.
(Assignee)

Updated

4 years ago
See Also: → bug 1061975
This hotfix addon is released out-of-cycle from normal releases and is used to try and get users on old versions of Firefox back onto a modern version.

The url used to download Firefox was switched away from the CDN to FTP in this commit:
https://hg.mozilla.org/releases/firefox-hotfixes/rev/85c4ae522fc3#l2.1

gps says that the best way to stop these users from hammering FTP is to release a new hotfix addon with the correct url. He's working on generating the updated addon now.
(In reply to Chris AtLee [:catlee] from comment #13)
> This hotfix addon is released out-of-cycle from normal releases and is used
> to try and get users on old versions of Firefox back onto a modern version.
> 
> The url used to download Firefox was switched away from the CDN to FTP in
> this commit:
> https://hg.mozilla.org/releases/firefox-hotfixes/rev/85c4ae522fc3#l2.1
> 
> gps says that the best way to stop these users from hammering FTP is to
> release a new hotfix addon with the correct url. He's working on generating
> the updated addon now.

Did this not follow the standard release process and QA practices that would have caught this?
Posted file hotfix-v20140527.01.xpi (obsolete) —
Callek: please sign.
Depends on: 1061975
Depends on: 1098559
This XPI doesn't have a Python virtualenv in it :)
Attachment #8531755 - Attachment is obsolete: true
(In reply to Van Le [:van] from comment #12)
> the issue has been identified and resolved. the hot fix released today had
> ftp.m.o hard coded instead of the bouncer. we should see ftp response time
> improve.

Slight correction: issue identified, fix being worked on.
Signed, spot checking is likely worth it
Signed version is staged now.
(In reply to SylvieV from comment #14)
> Did this not follow the standard release process and QA practices that would
> have caught this?

Sylvie - hot fixes have a completely different release process and cycle driven directly by engineering. They are handled as a special type of add on, and that addon went through their normal processes. (As mentioned above, see bug 1061975 comment 80 and on.)

Bsmedberg can beter answer any hotfix release process questions you have.
(In reply to Justin Wood (:Callek) from comment #19)
> Created attachment 8531760 [details]
> hotfix-v20140527.01-signed.xpi
> 
> Signed, spot checking is likely worth it

Spot check with FF 14.0.1 en-US worked as expected.
I spot checked the new add-on from the dev server and all appears to work.
(Assignee)

Comment 24

4 years ago
FYI: on the server-side, we have implemented a ZLB trafficscript rule to redirect win32 33.1.1 installer downloads with no Referer header, from a Firefox User-Agent, to go to the CDN. We're trying to be ultra-specific with this... probably moreso than necessary... we just wanted to make sure we didn't break things for users actually browsing, or Sentry/bouncer health checks, Nagios checks, etc.

This doesn't save us the actual hits, but it does save the bandwidth, which is a pretty big help.

Even before doing this, once the old hotfix was pulled we started seeing improvement as connections completed and new ones didn't appear.

At the moment we are doing very well, infra-wise. Not quite back to normal, but well out of the water.
(Reporter)

Comment 25

4 years ago
So there are still nagios alerts in #buildduty about ftp load. Are we just going to have to wait this our for up to 24 hours until everyone gets an updated hotfix?
(In reply to Wes Kocher (:KWierso) from comment #25)
> So there are still nagios alerts in #buildduty about ftp load. Are we just
> going to have to wait this our for up to 24 hours until everyone gets an
> updated hotfix?

I believe clients will do the hotfix up-to-date check at most every 24 hours. Assuming clients are running daily, most traffic to ftp should dissipate about 24 hours from now. However, there will be stragglers. We'll likely see residual traffic for weeks. But hopefully not enough to take down the network.

I expect things to shape up rather quickly and we should hopefully be able to open the trees tonight. However, expect another bump from Europe and US east coast tomorrow morning. Hopefully not enough to impact automation. But you never know.
(Assignee)

Comment 27

4 years ago
Just for reference... this is what the load balancer traffic looked like during the original hotfix, followed by throttling, pulling it, redirecting traffic, and ultimately the new one.
(Reporter)

Comment 28

4 years ago
Just to check in from a treestatus/sheridfing point, I've reopened everything aside from trunk trees. b2g-inbound is set to approval-required so it can pick up Gaia bumperbot changes. I reopened Gaia so they can commit things again. I've triggered new builds on m-c, inbound, b2g-inbound, and fx-team, which will kick off new tests when the builds finish, which should catch any new bustage, right around the time Europe's sheriffs sign in, so they can make the call on when to reopen trunk trees.
The new version of the add-on is now live in prod.
Still getting intermittent failures as of 10pm.
(In reply to Phil Ringnalda (:philor) from comment #30)
> Still getting intermittent failures as of 10pm.

and seems still ongoing, so keeping the trees closed and checking with jd
(In reply to Carsten Book [:Tomcat] from comment #31)
> (In reply to Phil Ringnalda (:philor) from comment #30)
> > Still getting intermittent failures as of 10pm.
> 
> and seems still ongoing, so keeping the trees closed and checking with jd

and reopening trees now as the situation has improved
(In reply to Carsten Book [:Tomcat] from comment #32)

> and reopening trees now as the situation has improved

and we are closed again for bug 1107462 - no idea if this is related to this bug, investigation is ongoing

Updated

4 years ago
Blocks: 1107475
See Also: → bug 1107462

Updated

4 years ago
Assignee: jcrowe → nmaul
To clarify, all work related to hotfix caused ftp.m.o issue has been completed
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.