Closed
Bug 1107156
Opened 10 years ago
Closed 10 years ago
All trees closed due to ftp bustage
Categories
(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)
Infrastructure & Operations Graveyard
WebOps: Product Delivery
x86
Windows 8.1
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: KWierso, Assigned: nmaul)
References
Details
(Whiteboard: [kanban:webops:https://kanbanize.com/ctrl_board/4/1934] )
Attachments
(3 files, 1 obsolete file)
nagios is alerting the following:
ftp2-zlb.vips.scl3.mozilla.com:productdelivery https VIP is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.mozilla.org/productdelivery+https+VIP)
Downloads from ftp are failing.
All trees closed.
Comment 2•10 years ago
|
||
How are we doing here? This is a mission critical piece of infrastructure. Any updates? Thanks.
moving to correct component, webops was paged by #moc, and is investigating
Assignee: nobody → server-ops-webops
Component: Buildduty → WebOps: Product Delivery
Product: Release Engineering → Infrastructure & Operations
QA Contact: bugspam.Callek → nmaul
Version: unspecified → other
Comment 4•10 years ago
|
||
High connection volumes on the FTP cluster have caused most of the back-end nodes to reach their maximum client capacity. Webops has been engaged and are investigating, but they do not yet have a diagnosis for the increased traffic.
Comment 5•10 years ago
|
||
nothing new, the hosts are reaching max clients and poping in and out of the load balancer.
WebOps is looking at where the traffic is coming from but it is not clear yet. Currently no ETA.
Comment 6•10 years ago
|
||
per webops update:
no ETA, we are still investigating, there is some oddness with the type of traffic we are seeing but we do not know what is causing it yet, it is high request load - or stale client downloads type of thing, but we are not really sure yet.
Updated•10 years ago
|
Assignee: server-ops-webops → jcrowe
Comment 7•10 years ago
|
||
there was a similar issue at the firefox 10th anniversary where engineers (not our customers) downloaded in high volume off of FTP. Customers download from CDN which scales. On the 10th anniversary webdev/releng were to flip a bit to have developers download from CDN due to volumes. Is this a similar situation where lots of developers are downloading frm FTP directly and it should be off CDN?
Comment 8•10 years ago
|
||
issue is still on going, jakem and gozer from webops are investigating with no ETA. will update in about 30 minutes or when we find out more information.
We would need URL data from the logs to determine if that is the issue and a possible solution.
Webops: what are the top talkers for the fetches?
Comment 10•10 years ago
|
||
(In reply to Hal Wine [:hwine] (use needinfo) from comment #9)
> We would need URL data from the logs to determine if that is the issue and a
> possible solution.
>
> Webops: what are the top talkers for the fetches?
Based on convo in #webops, this not related to the dev edition - this is related to bug 1061975 comment 80 and down.
That process "should" have users obtaining the installer from CDN's however CDN's appear to be refusing to serve, so request fall's back to ftp.m.o
Comment 11•10 years ago
|
||
Hal, didn't this follow the standard release process?
Comment 12•10 years ago
|
||
the issue has been identified and resolved. the hot fix released today had ftp.m.o hard coded instead of the bouncer. we should see ftp response time improve.
Comment 13•10 years ago
|
||
This hotfix addon is released out-of-cycle from normal releases and is used to try and get users on old versions of Firefox back onto a modern version.
The url used to download Firefox was switched away from the CDN to FTP in this commit:
https://hg.mozilla.org/releases/firefox-hotfixes/rev/85c4ae522fc3#l2.1
gps says that the best way to stop these users from hammering FTP is to release a new hotfix addon with the correct url. He's working on generating the updated addon now.
Comment 14•10 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #13)
> This hotfix addon is released out-of-cycle from normal releases and is used
> to try and get users on old versions of Firefox back onto a modern version.
>
> The url used to download Firefox was switched away from the CDN to FTP in
> this commit:
> https://hg.mozilla.org/releases/firefox-hotfixes/rev/85c4ae522fc3#l2.1
>
> gps says that the best way to stop these users from hammering FTP is to
> release a new hotfix addon with the correct url. He's working on generating
> the updated addon now.
Did this not follow the standard release process and QA practices that would have caught this?
Comment 15•10 years ago
|
||
Callek: please sign.
Comment 16•10 years ago
|
||
This XPI doesn't have a Python virtualenv in it :)
Attachment #8531755 -
Attachment is obsolete: true
Comment 17•10 years ago
|
||
Comment 18•10 years ago
|
||
(In reply to Van Le [:van] from comment #12)
> the issue has been identified and resolved. the hot fix released today had
> ftp.m.o hard coded instead of the bouncer. we should see ftp response time
> improve.
Slight correction: issue identified, fix being worked on.
Comment 19•10 years ago
|
||
Signed, spot checking is likely worth it
Comment 20•10 years ago
|
||
Signed version is staged now.
Comment 21•10 years ago
|
||
(In reply to SylvieV from comment #14)
> Did this not follow the standard release process and QA practices that would
> have caught this?
Sylvie - hot fixes have a completely different release process and cycle driven directly by engineering. They are handled as a special type of add on, and that addon went through their normal processes. (As mentioned above, see bug 1061975 comment 80 and on.)
Bsmedberg can beter answer any hotfix release process questions you have.
Comment 22•10 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #19)
> Created attachment 8531760 [details]
> hotfix-v20140527.01-signed.xpi
>
> Signed, spot checking is likely worth it
Spot check with FF 14.0.1 en-US worked as expected.
Comment 23•10 years ago
|
||
I spot checked the new add-on from the dev server and all appears to work.
| Assignee | ||
Comment 24•10 years ago
|
||
FYI: on the server-side, we have implemented a ZLB trafficscript rule to redirect win32 33.1.1 installer downloads with no Referer header, from a Firefox User-Agent, to go to the CDN. We're trying to be ultra-specific with this... probably moreso than necessary... we just wanted to make sure we didn't break things for users actually browsing, or Sentry/bouncer health checks, Nagios checks, etc.
This doesn't save us the actual hits, but it does save the bandwidth, which is a pretty big help.
Even before doing this, once the old hotfix was pulled we started seeing improvement as connections completed and new ones didn't appear.
At the moment we are doing very well, infra-wise. Not quite back to normal, but well out of the water.
| Reporter | ||
Comment 25•10 years ago
|
||
So there are still nagios alerts in #buildduty about ftp load. Are we just going to have to wait this our for up to 24 hours until everyone gets an updated hotfix?
Comment 26•10 years ago
|
||
(In reply to Wes Kocher (:KWierso) from comment #25)
> So there are still nagios alerts in #buildduty about ftp load. Are we just
> going to have to wait this our for up to 24 hours until everyone gets an
> updated hotfix?
I believe clients will do the hotfix up-to-date check at most every 24 hours. Assuming clients are running daily, most traffic to ftp should dissipate about 24 hours from now. However, there will be stragglers. We'll likely see residual traffic for weeks. But hopefully not enough to take down the network.
I expect things to shape up rather quickly and we should hopefully be able to open the trees tonight. However, expect another bump from Europe and US east coast tomorrow morning. Hopefully not enough to impact automation. But you never know.
| Assignee | ||
Comment 27•10 years ago
|
||
Just for reference... this is what the load balancer traffic looked like during the original hotfix, followed by throttling, pulling it, redirecting traffic, and ultimately the new one.
| Reporter | ||
Comment 28•10 years ago
|
||
Just to check in from a treestatus/sheridfing point, I've reopened everything aside from trunk trees. b2g-inbound is set to approval-required so it can pick up Gaia bumperbot changes. I reopened Gaia so they can commit things again. I've triggered new builds on m-c, inbound, b2g-inbound, and fx-team, which will kick off new tests when the builds finish, which should catch any new bustage, right around the time Europe's sheriffs sign in, so they can make the call on when to reopen trunk trees.
Comment 29•10 years ago
|
||
The new version of the add-on is now live in prod.
Comment 30•10 years ago
|
||
Still getting intermittent failures as of 10pm.
Comment 31•10 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #30)
> Still getting intermittent failures as of 10pm.
and seems still ongoing, so keeping the trees closed and checking with jd
Comment 32•10 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #31)
> (In reply to Phil Ringnalda (:philor) from comment #30)
> > Still getting intermittent failures as of 10pm.
>
> and seems still ongoing, so keeping the trees closed and checking with jd
and reopening trees now as the situation has improved
Comment 33•10 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #32)
> and reopening trees now as the situation has improved
and we are closed again for bug 1107462 - no idea if this is related to this bug, investigation is ongoing
Updated•10 years ago
|
Assignee: jcrowe → nmaul
Comment 34•10 years ago
|
||
To clarify, all work related to hotfix caused ftp.m.o issue has been completed
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•