Closed Bug 358025 Opened 13 years ago Closed 12 years ago

AUS needs a way to stagger updates to reduce load on mirrors

Categories

(AUS Graveyard :: Administration, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED
4.x (triaged)

People

(Reporter: morgamic, Assigned: morgamic)

Details

Attachments

(2 files, 1 obsolete file)

The current mirror infrastructure floods with connections when a small patch gets released.  The 2.0 .mar is 10 times that size, so we need to think of our strategy .  AUS should most likely be adjusted to stagger by locales or some other identifier so that the mirrors don't get flooded.
For context, this bug is a server-side fix for the problems caused by bug 352853.
Summary: Need a way to stagger 1.5.0.8 updates to reduce load on mirrors → AUS needs a way to stagger updates to reduce load on mirrors
Another option could be to stagger by percentage and/or random seed.  If we gradually increased the probability of offering an update, we could reduce the # of mirror requests initially and even out the distribution.

A point was brought up about the difficulties it presents when trying to test/verify updates for all locales.  My response to that would be that we could also easily code a bypass query variable to the AUS code that would tell AUS to not use the random seed logic and just offer whatever it knows about for testing.
Working on a patch to stagger, just in case.
Status: NEW → ASSIGNED
Simple patch to throttle, in case we need it.  If we would plan on using it I would add two acceptance tests -- one for making sure 100 throttle returns updates all the time and another ensuring that 0 throttle never does.  Testing the in-between is tricky.
Attachment #268001 - Flags: review?(preed)
Is there a config file already defined where we could put the value of THROTTLE?  I know it's basically the same - but if we have to check-in a rush change to the throttle amount might be nice to be sure it is a config only (and not code change).  If there isn't already a config.php file no bigs..
 
(In reply to comment #5)
> If there isn't already a config.php file no bigs..
It's in config-dist.php -- in prod there is a config.php which could be locally modified by sysadmins as needed.  Though I think that right now since config.php didn't have anything different it is just a symlink to config-dist.php.  That can change, of course -- but not hard to do.
Comment on attachment 268001 [details] [diff] [review]
v1, patch to throttle updates based on config var

Alright... insert the "Preed is so paranoid, I hate him"-chorus here, but could we wrap all this logic in a boolean *and* a percentage... i.e. THROTTLE = 0/1, and THROTTLE_LEVEL = 0-100.

That way, it's easy to turn this code off with a flag, and we're not relying on the output of mr_rand() at all.

(I'm basically looking for a way to null out/turn off this code entirely.)
Attachment #268001 - Flags: review?(preed) → review-
You could argue that this is redundant, but I won't and I won't. :)
Attachment #268001 - Attachment is obsolete: true
Attachment #268452 - Flags: review?
Attachment #268452 - Flags: review? → review?(preed)
Comment on attachment 268452 [details] [diff] [review]
v2, added boolean flag per preed's request

(In reply to comment #8)
> You could argue that this is redundant, but I won't and I won't. :)

I do appreciate it. Really. Appreciate it.

I just want a very easy way to turn off the entire logic (including calls to anything else; for instance, I could see an [possibly improbable] situation where mt_rand() calls start to stack up because we run out of entropy... although, looking at mt_rand(), this isn't supposed to happen); an ifdef 0 of sorts.

With this, I'm comfortable with this going in for MU. That doesn't imply that we use it, but I'm ok with it going in (because we can keep it off for now).

Any other testing we need/would want to do?
Attachment #268452 - Flags: review?(preed) → review+
I think morgamic has done acceptance testing (morgamic can you verify)?  If so, I'd like to get this out there in production soon so it has some bake time before the major release.  When would you guys be ready for a push?
Yeah, acceptance tests work fine w/ the code change.  So I'd be okay with pushing this if build is.
(In reply to comment #10)
> I think morgamic has done acceptance testing (morgamic can you verify)?  If so,
> I'd like to get this out there in production soon so it has some bake time
> before the major release.  When would you guys be ready for a push?

I believe that we are still on track for Thursday, June 22.

To clarify, I'm ok with checking it in, but I don't believe we're planning on turning it on.

Do I have that correct?

The bake time is important, but if THROTTLING is set to false, then...

One other thought: morgamic: is it worth adding some error_log() (or other) logic to specify when a hit was deferred vs. not? Or do we not care? It would be nice to add to the access_log that such and such client ping, but did not get an update because of throttling.

(Sorry for having just thought of this.)
We could write to the standard apache error log every time the THROTTLE prevents the script from continuing.  The interesting thing is that the way this patch is set up, it will be the same percentage across all requests, and we won't know if the user was supposed to get an update when it terminates the request with a "no updates" xml.  Is whether or not the user was supposed to get an update important?  Logging with what we have now would just get us an idea of what percentage of overall hits were "throttled".
Please no logging on every hit - that would create a lot more i/o for the webheads - rob seemed to think there was an easier way to test before the release.
(In reply to comment #15)
> Please no logging on every hit - that would create a lot more i/o for the
> webheads - rob seemed to think there was an easier way to test before the
> release.

But, I think we should log :) It would make testing much easier.
for testing maybe, but once we go live I'd like it off - I dont want to double the logging load (2x per hit).
I'll add logging and make it configurable.  Will resubmit a patch and sample log when I've got it.
The previous version was checked in after preed approved it.  This is a new patch to add logging.  Log file format will look like this:
[client 10.2.72.11] AUS2 THROTTLE: 10.2.72.11 /austest/update/1/Firefox/1.5.0.7/2006090918/WINNT_x86-msvc/en-US/release-yahoo/update.xml

IP looks redundant but is like that because the IP apache logs will be the netscalers.
Attachment #268562 - Flags: review?(preed)
Comment on attachment 268562 [details] [diff] [review]
v3, with logging IP and path info

Looks fine. So, so I understand this: if we wanted to figure out a percentage of clients that got the update, based on logs alone, we would have to take the access_log and the error_log and correlate between the two to see which IPs at which times got updates?

Also, this part looks fine, but it doesn't take into account bug 377088; will that be in another patch?
Attachment #268562 - Flags: review?(preed) → review+
Logging stuff has been checked in, but based on feedback we aren't going to be pushing this before major update because two days is not enough time to test it.

So before I concede, this could have been pushed on the 14th, but we had to add two additional features we thought of after-the-fact.  The boolean is redundant and the logging is basically verifying the behavior of mt_rand()?

I should note that these changes are a) easily disabled b) tested by the acceptance tests and spot checks c) pretty important if we reach a point where we can't handle the MU load.

So we can put it in now while we still have two days or we can wait until we are 100% certain we need it and push it during duress and eminent server craziness as a last resort.  

Ultimately, if we're confident that our current plan for handling excessive load is good enough, then I'm fine with punting on this.  If we are uncertain then I'm not so sure ditching this patch is a good idea even if major update is just two days away.
We need this in for the release - we'll be pushing it tonight.  Who from the build side can be on to test?
How about we just set up aus2-staging and test that behind the Netscaler tonight?  Would that be better -- so that we know for sure it works in case we do need it at some point?

We need aus2-staging set up soonish either way.  It disappeared after the migration.

Also... needed to reply to preed:
(In reply to comment #20)
> (From update of attachment 268562 [details] [diff] [review])
> Looks fine. So, so I understand this: if we wanted to figure out a percentage
> of clients that got the update, based on logs alone, we would have to take the
> access_log and the error_log and correlate between the two to see which IPs at
> which times got updates?
Yeah, that would be the case, though I thought we were more interested in verifying the distribution of updates -- i.e. if we set throttle to 10% only 10% get updates, etc.

> Also, this part looks fine, but it doesn't take into account bug 377088; will
> that be in another patch?
That will be in another patch.
(In reply to comment #22)
> We need this in for the release - we'll be pushing it tonight.  Who from the
> build side can be on to test?

So, we talked about this at our build team meeting today, and we (in the meeting) were all of the mind that any code changes--even as simple as this seems--before a push like MU aren't a great idea.

We were all under the assumption we were going to go with the less-optimal-but-still-doable 2 hour on/off throttling, if necessary, as we agreed at the last major update meeting that you, Seth, Juan, Jay, rhelmer, joduinn and Basil were at.

Are we changing that plan now? If so, who do we need to talk to to make sure it's tested before Thursday?
Decision tree to me looks like this:

1) If we think the infra might need throttling (cause it might melt) than we should be ready to both have it and *use it*.   (I don't want to be rushing into this, apologizing for killing mirrors, or apologizing for busted infra that we knew about in advance).  

2) If we take the patch we should take the time to test it (both on an off) to make sure it is doing what is expected and is actually useful.

3) We should take the time to do this right.

So:

Justin: Make the call on whether you want this avail in production or not (question #1).

Preed/Morgamic: Need to understanding a testing plan and timeframe that makes you comfortable with this code in production (#2). 
This was pushed but it's currently disabled in production pending a bit more verification on aus2-staging.  Once those are finished up with some input from QA we'll give the green light on staggering if/when we need it.

Pre-deployment test comparing disabled aus2-staging throttling with aus2 prod:
http://people.mozilla.com/~morgamic/aus2-tests/no-throttling.txt

Post-deployment test verifying that indeed nothing has changed:
http://people.mozilla.com/~morgamic/aus2-tests/post-deployment.txt
Somehow missing from this bug thread:

The major update was postponed a week (now June28), due to a lastminute showstopper problem discovered in CJKT locales. For details, see https://bugzilla.mozilla.org/show_bug.cgi?id=385281
Err, this was fixed a while back.  :)
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.