Closed Bug 1107156 Opened 10 years ago Closed 10 years ago

All trees closed due to ftp bustage

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: KWierso, Assigned: nmaul)

References

Details

(Whiteboard: [kanban:webops:https://kanbanize.com/ctrl_board/4/1934] )

Attachments

(3 files, 1 obsolete file)

hotfix-v20140527.01.xpi 10 years ago Gregory Szorc [:gps] 688.70 KB, application/x-xpinstall		Details
hotfix-v20140527.01.xpi 10 years ago Gregory Szorc [:gps] 214.95 KB, application/x-xpinstall		Details
hotfix-v20140527.01-signed.xpi 10 years ago Justin Wood (:Callek) 222.57 KB, application/x-xpinstall		Details
ZLB traffic before/during/after 10 years ago Jake Maul [:jakem] 221.83 KB, image/png		Details

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Description

•

10 years ago

nagios is alerting the following: ftp2-zlb.vips.scl3.mozilla.com:productdelivery https VIP is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.mozilla.org/productdelivery+https+VIP) Downloads from ftp are failing. All trees closed.

hwine

Comment 1

•

10 years ago

#moc notified

Andreas Gal :gal

Comment 2

•

10 years ago

How are we doing here? This is a mission critical piece of infrastructure. Any updates? Thanks.

hwine

Comment 3

•

10 years ago

moving to correct component, webops was paged by #moc, and is investigating

Assignee: nobody → server-ops-webops

Component: Buildduty → WebOps: Product Delivery

Product: Release Engineering → Infrastructure & Operations

QA Contact: bugspam.Callek → nmaul

Version: unspecified → other

:kanban

Updated

•

10 years ago

Whiteboard: [kanban:webops:https://kanbanize.com/ctrl_board/4/1934]

Derek Moore [:dmoore]

Comment 4

•

10 years ago

High connection volumes on the FTP cluster have caused most of the back-end nodes to reach their maximum client capacity. Webops has been engaged and are investigating, but they do not yet have a diagnosis for the increased traffic.

Van Le [:van]

Comment 5

•

10 years ago

nothing new, the hosts are reaching max clients and poping in and out of the load balancer. WebOps is looking at where the traffic is coming from but it is not clear yet. Currently no ETA.

Van Le [:van]

Comment 6

•

10 years ago

per webops update: no ETA, we are still investigating, there is some oddness with the type of traffic we are seeing but we do not know what is causing it yet, it is high request load - or stale client downloads type of thing, but we are not really sure yet.

Shyam Mani [:fox2mike]

Updated

•

10 years ago

Assignee: server-ops-webops → jcrowe

Sylvie Veilleux [:sylvieV]

Comment 7

•

10 years ago

there was a similar issue at the firefox 10th anniversary where engineers (not our customers) downloaded in high volume off of FTP. Customers download from CDN which scales. On the 10th anniversary webdev/releng were to flip a bit to have developers download from CDN due to volumes. Is this a similar situation where lots of developers are downloading frm FTP directly and it should be off CDN?

Van Le [:van]

Comment 8

•

10 years ago

issue is still on going, jakem and gozer from webops are investigating with no ETA. will update in about 30 minutes or when we find out more information.

hwine

Comment 9

•

10 years ago

We would need URL data from the logs to determine if that is the issue and a possible solution. Webops: what are the top talkers for the fetches?

hwine

Comment 10

•

10 years ago

(In reply to Hal Wine [:hwine] (use needinfo) from comment #9) > We would need URL data from the logs to determine if that is the issue and a > possible solution. > > Webops: what are the top talkers for the fetches? Based on convo in #webops, this not related to the dev edition - this is related to bug 1061975 comment 80 and down. That process "should" have users obtaining the installer from CDN's however CDN's appear to be refusing to serve, so request fall's back to ftp.m.o

Sylvie Veilleux [:sylvieV]

Comment 11

•

10 years ago

Hal, didn't this follow the standard release process?

Van Le [:van]

Comment 12

•

10 years ago

the issue has been identified and resolved. the hot fix released today had ftp.m.o hard coded instead of the bouncer. we should see ftp response time improve.

Jake Maul [:jakem]

Assignee

Updated

•

10 years ago

Comment 13

•

10 years ago

This hotfix addon is released out-of-cycle from normal releases and is used to try and get users on old versions of Firefox back onto a modern version. The url used to download Firefox was switched away from the CDN to FTP in this commit: https://hg.mozilla.org/releases/firefox-hotfixes/rev/85c4ae522fc3#l2.1 gps says that the best way to stop these users from hammering FTP is to release a new hotfix addon with the correct url. He's working on generating the updated addon now.

Sylvie Veilleux [:sylvieV]

Comment 14

•

10 years ago

(In reply to Chris AtLee [:catlee] from comment #13) > This hotfix addon is released out-of-cycle from normal releases and is used > to try and get users on old versions of Firefox back onto a modern version. > > The url used to download Firefox was switched away from the CDN to FTP in > this commit: > https://hg.mozilla.org/releases/firefox-hotfixes/rev/85c4ae522fc3#l2.1 > > gps says that the best way to stop these users from hammering FTP is to > release a new hotfix addon with the correct url. He's working on generating > the updated addon now. Did this not follow the standard release process and QA practices that would have caught this?

Gregory Szorc [:gps]

Comment 15

•

10 years ago

Attached file hotfix-v20140527.01.xpi (obsolete) — Details

Callek: please sign.

Gregory Szorc [:gps]

Updated

•

10 years ago

Depends on: 1061975

Stephen A Pohl [:spohl]

Updated

•

10 years ago

Depends on: 1098559

Gregory Szorc [:gps]

Comment 16

•

10 years ago

Attached file hotfix-v20140527.01.xpi — Details

This XPI doesn't have a Python virtualenv in it :)

Attachment #8531755 - Attachment is obsolete: true

Gregory Szorc [:gps]

Comment 17

•

10 years ago

https://hg.mozilla.org/releases/firefox-hotfixes/rev/50f72489815e

hwine

Comment 18

•

10 years ago

(In reply to Van Le [:van] from comment #12) > the issue has been identified and resolved. the hot fix released today had > ftp.m.o hard coded instead of the bouncer. we should see ftp response time > improve. Slight correction: issue identified, fix being worked on.

Justin Wood (:Callek)

Comment 19

•

10 years ago

Attached file hotfix-v20140527.01-signed.xpi — Details

Signed, spot checking is likely worth it

Jorge Villalobos [:jorgev] (he/him)

Comment 20

•

10 years ago

Signed version is staged now.

hwine

Comment 21

•

10 years ago

(In reply to SylvieV from comment #14) > Did this not follow the standard release process and QA practices that would > have caught this? Sylvie - hot fixes have a completely different release process and cycle driven directly by engineering. They are handled as a special type of add on, and that addon went through their normal processes. (As mentioned above, see bug 1061975 comment 80 and on.) Bsmedberg can beter answer any hotfix release process questions you have.

Stephen A Pohl [:spohl]

Comment 22

•

10 years ago

(In reply to Justin Wood (:Callek) from comment #19) > Created attachment 8531760 [details] > hotfix-v20140527.01-signed.xpi > > Signed, spot checking is likely worth it Spot check with FF 14.0.1 en-US worked as expected.

Gregory Szorc [:gps]

Comment 23

•

10 years ago

I spot checked the new add-on from the dev server and all appears to work.

Jake Maul [:jakem]

Assignee

Comment 24

•

10 years ago

FYI: on the server-side, we have implemented a ZLB trafficscript rule to redirect win32 33.1.1 installer downloads with no Referer header, from a Firefox User-Agent, to go to the CDN. We're trying to be ultra-specific with this... probably moreso than necessary... we just wanted to make sure we didn't break things for users actually browsing, or Sentry/bouncer health checks, Nagios checks, etc. This doesn't save us the actual hits, but it does save the bandwidth, which is a pretty big help. Even before doing this, once the old hotfix was pulled we started seeing improvement as connections completed and new ones didn't appear. At the moment we are doing very well, infra-wise. Not quite back to normal, but well out of the water.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Comment 25

•

10 years ago

So there are still nagios alerts in #buildduty about ftp load. Are we just going to have to wait this our for up to 24 hours until everyone gets an updated hotfix?

Gregory Szorc [:gps]

Comment 26

•

10 years ago

(In reply to Wes Kocher (:KWierso) from comment #25) > So there are still nagios alerts in #buildduty about ftp load. Are we just > going to have to wait this our for up to 24 hours until everyone gets an > updated hotfix? I believe clients will do the hotfix up-to-date check at most every 24 hours. Assuming clients are running daily, most traffic to ftp should dissipate about 24 hours from now. However, there will be stragglers. We'll likely see residual traffic for weeks. But hopefully not enough to take down the network. I expect things to shape up rather quickly and we should hopefully be able to open the trees tonight. However, expect another bump from Europe and US east coast tomorrow morning. Hopefully not enough to impact automation. But you never know.

Jake Maul [:jakem]

Assignee

Comment 27

•

10 years ago

Attached image ZLB traffic before/during/after — Details

Just for reference... this is what the load balancer traffic looked like during the original hotfix, followed by throttling, pulling it, redirecting traffic, and ultimately the new one.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Comment 28

•

10 years ago

Just to check in from a treestatus/sheridfing point, I've reopened everything aside from trunk trees. b2g-inbound is set to approval-required so it can pick up Gaia bumperbot changes. I reopened Gaia so they can commit things again. I've triggered new builds on m-c, inbound, b2g-inbound, and fx-team, which will kick off new tests when the builds finish, which should catch any new bustage, right around the time Europe's sheriffs sign in, so they can make the call on when to reopen trunk trees.

Jorge Villalobos [:jorgev] (he/him)

Comment 29

•

10 years ago

The new version of the add-on is now live in prod.

Phil Ringnalda (:philor)

Comment 30

•

10 years ago

Still getting intermittent failures as of 10pm.

Carsten Book [:Tomcat]

Comment 31

•

10 years ago

(In reply to Phil Ringnalda (:philor) from comment #30) > Still getting intermittent failures as of 10pm. and seems still ongoing, so keeping the trees closed and checking with jd

Carsten Book [:Tomcat]

Comment 32

•

10 years ago

(In reply to Carsten Book [:Tomcat] from comment #31) > (In reply to Phil Ringnalda (:philor) from comment #30) > > Still getting intermittent failures as of 10pm. > > and seems still ongoing, so keeping the trees closed and checking with jd and reopening trees now as the situation has improved

Carsten Book [:Tomcat]

Comment 33

•

10 years ago

(In reply to Carsten Book [:Tomcat] from comment #32) > and reopening trees now as the situation has improved and we are closed again for bug 1107462 - no idea if this is related to this bug, investigation is ongoing

Teodor Druta

Updated

•

10 years ago

Blocks: 1107475

Carsten Book [:Tomcat]

Updated

•

10 years ago

Updated

•

10 years ago

Assignee: jcrowe → nmaul

hwine

Comment 34

•

10 years ago

To clarify, all work related to hotfix caused ftp.m.o issue has been completed

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.