Closed Bug 393462 Opened 13 years ago Closed 13 years ago

Increasing load on manna

Categories

(Release Engineering :: General, defect, P1, critical)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aravind, Unassigned)

References

()

Details

Over that last few weeks, we have been getting more and more pages about increasing load on ftp.mozilla.org (which points back just to manna).  All this increase in traffic has been due to http load on ftp.mozilla.org.

I did some preliminary digging through one hours worth of logs and it seems like a large chunk of it is coming from folks requesting "/pub/mozilla.org/firefox/nightly/2.0.0.6-candidates/rc2/firefox-2.0.0.6.*.complete.mar"

A rough estimate of the number of folks requesting this (in one hour):

Firefox/2.0b2 - win32 - en-US - 1153 hits
Firefox/2.0b2 - win32 -  de   - 141 hits
Firefox/2.0b2 - win32 -  ru   - 348 hits
Firefox/2.0b2 - mac   -  fr   - 35 hits


The other major chunk requesting this is folks using BonEcho/2.0a2, and all those were from win32 and en-US - 3809 hits.

This problem seems to be getting progressively worse, so I appreciate and help on this.  Note that the above numbers are all from just one hours worth of logs.
Is it mostly coming from any IP or block of IPs in particular?

The 2.0a2 stuff is weird.. people should be using bouncer in any case, sounds like someone is crawling us maybe.

Is ftp.m.o a mirror in bouncer?
nope, its not in bouncer, I checked that already.

And nope, the hits don't seem to coming from any ip pattern.  It could still be a DOS attack, but I wanted to run it by you folks in case there was a build configuration issue.
(In reply to comment #2)
> nope, its not in bouncer, I checked that already.
> 
> And nope, the hits don't seem to coming from any ip pattern.  It could still be
> a DOS attack, but I wanted to run it by you folks in case there was a build
> configuration issue.
> 

I can't think of anything we're doing that should cause this..
After discussing this with joduinn and aravind, we should at least check to make sure:

1. The snippets for Firefox/2.0a1, BonEcho/2.0a2, Firefox/2.0b1, and Firefox/2.0b2 all point to download.mozilla.org (and NOT ftp.mozilla.org as the logs are showing).  Looks like 2.0a2 and 2.0b2 make up the majority of the pings from the log.

2. We didn't forget to rename ftp.m.o to download.m.o at some point between QA testing the updates for beta/release on stage, and us pushing the bits to the mirrors.  The fact that folks are trying to get complete.mar from the nightly candidates directory is really odd (regardless of update channel).

3. There probably is no easy way to tell, but can we find out whether something is wacky on the AUS end?  Are people getting back update.xml files?  I'm not sure where the download URL (in this case they point to the nightly candidates dir) is created for the user.  Is it:  AUS snippets + User agent = Download URL?
(In reply to comment #4)
> After discussing this with joduinn and aravind, we should at least check to
> make sure:
> 
> 1. The snippets for Firefox/2.0a1, BonEcho/2.0a2, Firefox/2.0b1, and
> Firefox/2.0b2 all point to download.mozilla.org (and NOT ftp.mozilla.org as the
> logs are showing).  Looks like 2.0a2 and 2.0b2 make up the majority of the
> pings from the log.


These are on the beta channel, which certainly does point directly at ftp.mozilla.org and does not go through bouncer.


> 2. We didn't forget to rename ftp.m.o to download.m.o at some point between QA
> testing the updates for beta/release on stage, and us pushing the bits to the
> mirrors.  The fact that folks are trying to get complete.mar from the nightly
> candidates directory is really odd (regardless of update channel).


See above,


> 3. There probably is no easy way to tell, but can we find out whether something
> is wacky on the AUS end?  Are people getting back update.xml files?  I'm not
> sure where the download URL (in this case they point to the nightly candidates
> dir) is created for the user.  Is it:  AUS snippets + User agent = Download
> URL?


I'd really love to be able to match AUS pings against Bouncer hits, for diagnosing this and lots of other issues; can we do this via cookies, or at least IP matching, from the HTTP logs?

Sorr(In reply to comment #1)
> Is it mostly coming from any IP or block of IPs in particular?
> 
> The 2.0a2 stuff is weird.. people should be using bouncer in any case, sounds
> like someone is crawling us maybe.
> 
> Is ftp.m.o a mirror in bouncer?


Sorry, I don't know what I was thinking here; yes all of the releases specified here would have updates on the "beta" channel pointing directly to ftp.m.o, not going through bouncer. That's expected.

We should not suddenly have higher traffic though, that's kind of odd.
(In reply to comment #6)
> We should not suddenly have higher traffic though, that's kind of odd.
> 

Hasn't been sudden, its been building up over the last couple of weeks.  From nagios history, I think it started happening around Aug 2nd.
Group: security → mozillaorgconfidential
setting to p1 because of impact to ftp site.
Status: NEW → ASSIGNED
Priority: -- → P1
Thats was a busy time. Looking through old emails, I see that we shipped:

Firefox 2.0.0.6 and updates at 6:15pm PDT 30july2007
Thunderbird 2.0.0.6 and updates at 6:29pm PST 01aug2007
Gecko1.9a7 shipped (no updates) at 11:46am PST 03aug2007 

Seth:  Any idea why this problem has been getting worse over the past few weeks?  

If a user automatically checks and downloads the complete.mar and fails the update... do we clean everything up and retry?  What is the time delay for auto check for update?

Priority: P1 → --
Rhelmer:  We intentionally served Firefox 2.0 Beta updates through ftp?  I thought we always switched to bouncer for betas and final, and only did alphas off ftp?

Regardless of what our policy is, why did 2.0 alpha/beta users all of sudden want to get a 2006 updates?  Why had they not tried for 2001-2005 in the past?  I wonder if those same users have been continuously failing or if we somehow didn't serve them updates until 2006 (I doubt this is the case).

Just throwing some more ideas out there...maybe one will stick. :-)
At rhelmer's suggetsino, I did some grepping the update snippet store. There are live updates on beta and betatest channels for all 2.0 version starting at 2.0a{1,2,3} (en-US only), thru the betas & 2.0 RCs to 2.0.0.5. We point people on the beta channel at ftp.m.o because we haven't done the full signing, staging, mirroring-out process of a full release, and want to get these test builds out to users promptly.

If http://10.2.72.16:8500/aus2/aus2_dynamic.cfm is to be believed, and Justin has warned in the past that it's not supported, then the number of beta users is unchanged over the last 3 months, currently peaking at about 80k.

Nightly updates also come off ftp.m.o, that's about 25,000 users/day but I would guess most of them are partials. Pretty much flat over recent weeks.

I don't have any good ideas about why we are suddenly getting a bunch more traffic, espcially such a large number of 2.0a2 users. Can we look at the User-Agents to sanity check ? Anyone noticed anything in the blogosphere ?
(In reply to comment #12)
> Nightly updates also come off ftp.m.o, that's about 25,000 users/day but I
> would guess most of them are partials. Pretty much flat over recent weeks.

Wouldn't the partial requests be for something other than the complete.mar files?

> Can we look at the
> User-Agents to sanity check ? Anyone noticed anything in the blogosphere ?

What specifically do you want about the User agents?  They look legitimate to me.  Most are like "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1b2) Gecko/20060821 Firefox/2.0b2"

Its also odd to me that some of the ips seem to be requesting the same file multiple times (sometimes up to 100 times) in the hour.  Also, they seem to be requesting this over multiple days.  Not all of them, but a few of the ips I have checked so far seem to be doing this.  I am not ruling out a DOS attack, but just sharing what I see.


(In reply to comment #13)
> (In reply to comment #12)
> > Nightly updates also come off ftp.m.o, that's about 25,000 users/day but I
> > would guess most of them are partials. Pretty much flat over recent weeks.
> 
> Wouldn't the partial requests be for something other than the complete.mar
> files?


No, because for previous releases we don't generate partial diffs, we just serve the complete.

> > Can we look at the
> > User-Agents to sanity check ? Anyone noticed anything in the blogosphere ?
> 
> What specifically do you want about the User agents?  They look legitimate to
> me.  Most are like "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1b2)
> Gecko/20060821 Firefox/2.0b2"
> 
> Its also odd to me that some of the ips seem to be requesting the same file
> multiple times (sometimes up to 100 times) in the hour.  Also, they seem to be
> requesting this over multiple days.  Not all of them, but a few of the ips I
> have checked so far seem to be doing this.  I am not ruling out a DOS attack,
> but just sharing what I see.

The same IP sounds really fishy to me. Did you check to see if they have the same cookie or not? Hoping to narrow down if this is some crazy client bug, or if it's a DOS or other attack faking their user agent.
(In reply to comment #13)
> (In reply to comment #12)
> > Nightly updates also come off ftp.m.o, that's about 25,000 users/day but I
> > would guess most of them are partials. Pretty much flat over recent weeks.
> Wouldn't the partial requests be for something other than the complete.mar
> files?

I mentioned nightlies just to rule other possible sources of load. We do serve completes for them if you are more than one build behind, but they wouldn't come out of firefox/nightly/2.0.0.6-candidates/. 

Looks like it was just a red-herring.
 
> > Can we look at the
> > User-Agents to sanity check ? Anyone noticed anything in the blogosphere ?
> 
> What specifically do you want about the User agents?  They look legitimate to
> me.  Most are like "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1b2)
> Gecko/20060821 Firefox/2.0b2"

Sorry, wasn't specific enough - I meant checking that AUS request URLs matched up with the User-Agent. The 20060821 build date is right so 2.0b2, so that's not a frankenfox.

> Its also odd to me that some of the ips seem to be requesting the same file
> multiple times (sometimes up to 100 times) in the hour.  Also, they seem to be
> requesting this over multiple days.  Not all of them, but a few of the ips I
> have checked so far seem to be doing this.  I am not ruling out a DOS attack,
> but just sharing what I see.

Hmm, you'd think that users would get sick of getting prompted to update. Are any of these frequent IP's identifiable as a corporate NAT ? 
(In reply to comment #14)
> (In reply to comment #13)
> > (In reply to comment #12)
> > > Nightly updates also come off ftp.m.o, that's about 25,000 users/day but I
> > > would guess most of them are partials. Pretty much flat over recent weeks.
> > 
> > Wouldn't the partial requests be for something other than the complete.mar
> > files?
> No, because for previous releases we don't generate partial diffs, we just
> serve the complete.

Meaning, for example, that when we ship version 'n', we ship partial updates for users on version 'n-1', so they have a small, quick, download. However, any user on version 'n-2' or older will be served a complete update, instead of having to upgrade from 'n-2' -> 'n-1' and then from 'n-1' -> 'n'.


> > > Can we look at the
> > > User-Agents to sanity check ? Anyone noticed anything in the blogosphere ?
> > 
> > What specifically do you want about the User agents?  They look legitimate to
> > me.  Most are like "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1b2)
> > Gecko/20060821 Firefox/2.0b2"
> > 
> > Its also odd to me that some of the ips seem to be requesting the same file
> > multiple times (sometimes up to 100 times) in the hour.  Also, they seem to be
> > requesting this over multiple days.  Not all of them, but a few of the ips I
> > have checked so far seem to be doing this.  I am not ruling out a DOS attack,
> > but just sharing what I see.
> 
> The same IP sounds really fishy to me. Did you check to see if they have the
> same cookie or not? Hoping to narrow down if this is some crazy client bug, or
> if it's a DOS or other attack faking their user agent.

Could these same symptoms happen if the client is trying to apply the complete download, but failing to apply the update for some reason, erroring out, and then later re-detecting the same pending update, retrying, re-failing, and repeating?
(In reply to comment #16)
> (In reply to comment #14)
> > The same IP sounds really fishy to me. Did you check to see if they have the
> > same cookie or not? Hoping to narrow down if this is some crazy client bug, or
> > if it's a DOS or other attack faking their user agent.
> 
> Could these same symptoms happen if the client is trying to apply the complete
> download, but failing to apply the update for some reason, erroring out, and
> then later re-detecting the same pending update, retrying, re-failing, and
> repeating?

We should try correlating these with AUS logs. I think that if what you propose is what is happening we should see people trying the partial, failing, then the full, failing and then retrying the next day. Trying e.g. 100x per hour does not sound like normal behavior.
Blocks: 393714
Group: mozillaorgconfidential
https://bugzilla.mozilla.org/show_bug.cgi?id=393714

what ever is causing this is starting to effect nightly/hourly testers
Bumping up the severity since this is not impacting nightly testing.
Severity: normal → major
Bumping up the severity since this is now impacting nightly testing.
We should consider pulling the snippets pointing to ftp.m.o. I'll go ahead and make preparations.
Aravind and I met up on IRC and agreed to remove from the update datastore 
    Firefox/2*/<all platforms>/<all locales>/beta
to test if this is real update traffic or something more malicious. The pull was completed at 0819 PDT.

Backup prior to this is 20070826-1-pre-20070826-Remove-Fx2-beta.tar.bz2, made with the pushsnip script and an empty dir to push in.

http and ftp access to ftp.m.o is was disabled at 9am PDT to let rsync catch up.
looks like no new hourlies or nightlys are showing up at all now on ftp since the last respin
Raising severity because of ongoing impact on testers and nightly updates. I've made all of the files inside
   ../firefox/nightly/2.0.0.6-candidates/rc2/
read only to remove that source of http load.

Rsync from staging to ftp.m.o is currently 140 minutes behind. What's the current status ?
Severity: major → critical
Priority: P1 → --
(In reply to comment #25)
...
> read only to remove that source of http load.

s/read only/inaccessible/

I think the problem is related to hardware and overlapping rsyncs.  I am working on trying to speed up the rsyncs.  But the load on the box is under control now.  So I will close this bug.

Thanks to the build team for all their effort.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
For the record, the multiple requests from the same IP are more than likely chunked downloads.  It'll request a few K at a time over a long period of time, to prevent soaking the user's bandwidth if they're not actively watching it download.  Sometimes this takes a few days if they're on a slow connection.  These should be detectable in the logs by looking for the HTTP response code being 206 instead of 200.
(In reply to comment #22)
> Aravind and I met up on IRC and agreed to remove from the update datastore 
>     Firefox/2*/<all platforms>/<all locales>/beta
> to test if this is real update traffic or something more malicious. The pull
> was completed at 0819 PDT.

(In reply to comment #25)
> I've
> made all of the files inside
>    ../firefox/nightly/2.0.0.6-candidates/rc2/
> read only to remove that source of http load.

Both of these changes have been reversed.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.