Closed Bug 467502 Opened 16 years ago Closed 16 years ago

QA AMO behind Zeus ZXTM load balancer

Categories

(mozilla.org Graveyard :: Server Operations, task)

All
Other
task
Not set
minor

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: mrz, Assigned: oremj)

References

Details

Attachments

(1 file)

Need some QA on AMO behind Zeus ZXTM.  Site is staged at 63.245.209.107 and only answers https right now.
I'll save Reed the trouble and add him.
Assignee: server-ops → oremj
Group: infra
Works for http too, redirects.

Desktop-Computer:~ mrz$ curl -H'Host: addons.mozilla.org' -v http://addons.mozilla.org/
* About to connect() to addons.mozilla.org port 80 (#0)
*   Trying 63.245.209.107... connected
* Connected to addons.mozilla.org (63.245.209.107) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.16.3 (powerpc-apple-darwin9.0) libcurl/7.16.3 OpenSSL/0.9.7l zlib/1.2.3
> Accept: */*
> Host: addons.mozilla.org
> 
< HTTP/1.1 301 Moved Permanently
< Content-Length: 0
< Date: Tue, 02 Dec 2008 07:22:04 GMT
< Connection: Keep-Alive
< Location: https://addons.mozilla.org/
< Content-Type: text/html
< 
* Connection #0 to host addons.mozilla.org left intact
* Closing connection #0
curl -H'Host: addons.mozilla.org' -v -k https://63.245.209.107/en-US/firefox/

wfm.

Did you duplicate the cookie rules that the NS currently has for AMO?
Not looking good thus far;

[1] Taking upward of ~ 40 seconds once I've clicked on the "Add to Firefox" button
[2] Taking even longer --60 seconds or so--to get the file-failed-to-download message of:
[3] "Firefox could not install the file at 

https://addons.mozilla.org/en-US/firefox/downloads/file/42123/adblock_plus-1.0-fx+sm+tb.xpi

because: Download error
-228"

Maybe I should've added services.addons.mozilla.org, too, to my HOSTS file?

https://services.addons.mozilla.org/en-US/firefox/api/1.1/search/live%20http%20headers/all/10/WINNT/3.0.4

More tomorrow; bed now :-P
(In reply to comment #3)
> Did you duplicate the cookie rules that the NS currently has for AMO?

Those were specific to the Netscaler's version of caching (or rather how it chose to ignore/break thing).  As such those rules are directly "duplicate-able" and I'm not sure if they're needed.  

This bug is here to track all of that though :)
oremj - this is on zxlb04 / 10.2.10.72.  Still using a self-signed SSL cert,
feel free to replace it with a moco signed *.mozilla.org one or something.
On prod, I get the following two things right after that 302:

https://addons.mozilla.org/en-US/firefox/

GET /en-US/firefox/ HTTP/1.1
Host: addons.mozilla.org
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: https://addons.mozilla.org/en-US/firefox/
Cookie: __utma=164683759.1720830908.1228207447.1228255323.1228255654.11; __utmz=164683759.1228207447.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none); __utmb=164683759; __utmc=164683759; AMOappName=firefox

HTTP/1.x 200 OK
Date: Tue, 02 Dec 2008 21:47:41 GMT
Content-Length: 10575
Connection: Keep-Alive
Via: NS-CACHE-6.0:   4
Server: Apache/2.2.3 (Red Hat)
X-Powered-By: PHP/5.1.6
X-AMO-ServedBy: mrapp06
Content-Type: text/html; charset=UTF-8
Cache-Control: private
Content-Encoding: gzip
I don't see an addon cookie in your logout request in comment 7.
(In reply to comment #9)
> I don't see an addon cookie in your logout request in comment 7.

Yeah; I didn't modify anything--that's the straight output :-(
I think I fixed that issue.  Let me know if logging in/out is working for you now.
Hey guys - what is the schedule/timeline for this?  Would like someone on webdev to give stephend a hand with this.  What are you guys thinking as a switch date?
Until we buy some hardware to front zeus I believe this is just going to be a backup for the netscaler.  

Mrz, what's the layer 4 load balancer evaluation timeline?
I'm still seeing the problems I mentioned in comment 4, along with:

[4] When I log in, the page doesn't reflect that fact; it still displays "Log in", until I load another link or click Refresh.

Log out, though, seems to now work correctly.
(In reply to comment #12)
> Hey guys - what is the schedule/timeline for this?  Would like someone on
> webdev to give stephend a hand with this.  What are you guys thinking as a
> switch date?

Right now I want a fall back plan for Dec 16 release.  I think we have enough stuff shifted around for MU on Thursday though it would be great to have AMO ready to go on Zeus should that be necessary.

(In reply to comment #13)
> Until we buy some hardware to front zeus I believe this is just going to be a
> backup for the netscaler.  

Zeus is "production" for vamo & fxfeeds.  Moving those took a significant load off the Netscalers.
 
> Mrz, what's the layer 4 load balancer evaluation timeline?

Just got that Cisco ACE try-n-buy signed today, should get hardware soon.
(In reply to comment #14)
> I'm still seeing the problems I mentioned in comment 4, along with:
> 
> [4] When I log in, the page doesn't reflect that fact; it still displays "Log
> in", until I load another link or click Refresh.
> 
> Log out, though, seems to now work correctly.

Jeremy has fixed the performance and failure-to-download error (-228); the load balance was returning https://releases..., rather than http://releases...

Issue [4], however, still remains: I'm getting the cached homepage view after I log in to AMO (shows "Log in" as a link, rather than "My Account", "Developer Tools", and "Log out".
Do you have a way to reliably reproduce Issue 4?
(In reply to comment #17)
> Do you have a way to reliably reproduce Issue 4?

Yes, 100%.

1. Just click "Log in", log in, and then look at the page -- it'll still appear as though you're logged out, though if you refresh the page, you'll clearly see you're logged in.
2. If you click "Log out" from this page (once that link is visible after a reload/refresh), you can keep reproducing this.

http://pastebin.com/m461f37d2 has the headers
OK, yet-another-update:

At 2:26pm today, Jeremy did something that appears to have fixed issue # 4.

Since then, I've tested:

* Account creation
* Password reset
* Different application views

Additionally, I've thrown my two Selenium testcases--search and the more general one--at it, with no errors.

Next up, the Dev CP...
What kind of performance improvement does the zeus box have over the netscalers?
(In reply to comment #20)
> What kind of performance improvement does the zeus box have over the
> netscalers?

Hard to quantify but it's a more scalable solution that Netscaler is, which is just two boxes.  Zeus ZXTM runs on commodity hardware (on top of RHEL5) and I can keep adding ZXTM nodes to grow it. 

The only thing Netscaler does in specialized hardware is SSL offload.  The Netscaler 12k on paper can do 28,000 SSL tps but we see closer to 8,000 before it fails.  Zeus claims a dual quad-core L5450 Xeon server can do about 12,000 SSL tps.  At a 1/20th the cost of Netscaler, I can get a couple boxes to handle 28k and can exceed that a lot easier.

(For production, I'm looking at dual E5460 Xeons)
Sweet, that sounds great.
I've tested uploading an add-on, as well as changing its description and screenshots, now, with no issues.

I'd feel more comfortable if a few webdev folks could also poke at AMO, too :-P
I clicked around and it worked fine for me.
That's great!

I'd like to do a performance test run and shift AMO over to ZXTM Tuesday night and into at least Wednesday morning/afternoon to pick up peak traffic. Maybe even Tuesday - Thursday.

This will affect log processing for the duration.

Trying to gauge how well ZXTM does under AMO load, how Gomez views external performance and how large the eventual ZXTM cluster might need to be to handle (right now it's 80% idle with vamo & fxfeeds).

Any show stoppers to that?
(In reply to comment #25)
> This will affect log processing for the duration.

What's that mean?  I think anything that affects our statistics is a show stopper.
Means deinspanjer will need to process the ZXTM log dir for AMO instead of (or in addition to) the normal AMO log dir.

Anything else look at those logs?
(In reply to comment #27)
> Anything else look at those logs?

Yes, AMO has its own log processing, which went critical yesterday due to other log problems.
AMO's log parse scripts will need to be updated for the VAMO and AMO log changes, as logs from both sites are processed. Considering VAMO already moved to ZXTM earlier, the VAMO part of the addons processing scripts has been broken since then.
I don't understand how this keeps happening.

The information on reconfiguring the log dirs and re-parsing the old logs is here: https://wiki.mozilla.org/Update:Developers/Statistics
(In reply to comment #30)
> I don't understand how this keeps happening.

Probably because there's no monitor to remind anyone about it when it stops grabbing current data.  Is that something easy to do?
(In reply to comment #31)
> (In reply to comment #30)
> > I don't understand how this keeps happening.
> 
> Probably because there's no monitor to remind anyone about it when it stops
> grabbing current data.  Is that something easy to do?

I said in comment #28 that the current AMO log processing monitor I wrote for this purpose went critical yesterday, but it looks like it was ignored (at least from my point of view).

I have to agree with fligtar here, though. This continues to happen every time something happens to the logs (I can personally count at least 3-4+ times). What needs to be done to make sure that everybody is pulled in every time something happens or needs to happen to any AMO and/or VAMO logs? deinspanjer was notified in bug 467412 about this change, but it doesn't seem like anybody told fligtar.
It's not even necessary that I be told - it was my understanding that we handed the log configuration aspect off to IT when that documentation was written and we had trouble last time and made the integrity check script and nagios monitor.
(In reply to comment #30)
> I don't understand how this keeps happening.
> 
> The information on reconfiguring the log dirs and re-parsing the old logs is
> here: https://wiki.mozilla.org/Update:Developers/Statistics

Sorry we switched over to Zeus in an emergency situation, so stats were secondary to keeping the service up.  I did get a page yesterday(?), "14:15 <@nagios> [79] dm-stats01:addons stats is CRITICAL: [FAIL][Adblock Plus] [Update Pings] Count from Wednesday changed by 85% from 
          2008-11-26 count of 5202612 [FAIL][Adblock Plus] [Update Pings] Count from Wednesday changed by 82% from 2008-11-19 count of 4348990 
          [FAIL][NoScript] [Update Pings] Count from Wednesday changed by 89% from 2008-11-26 count of 1818583 [FAIL][NoScript] [U". I think I've come up for a solution that will fix the logs without changing the configs.

I'll start running the script in Attachment 35183 [details] [diff] which will consolidate all the zeus logs to one place.
Oops, meant Attachment 351833 [details].
Thanks Jeremy - can you update the update ping counter config to parse that directory and re-run the update ping script for last Wednesday (2008-12-03)?
Blocks: 468570
It's probably stating the obvious but when the AMO test happens at 9pm tonight the log parsing scripts need to continue to work with the Zeus logs.
(In reply to comment #37)
> It's probably stating the obvious but when the AMO test happens at 9pm tonight
> the log parsing scripts need to continue to work with the Zeus logs.
I'm hardlinking the logs in to the directory the stats scripts are already looking at, so hopefully everything will just work.
Site has been live behind the ZXTM boxes since ~9:10pm.
Still looking good so far.
From the vserver | Content Compression:

The following table holds a list of the MIME types for the content that will be compressed.

MIME Type       	
 text/css      	
 text/plain      	
 text/html      


Default Mime-types don't include javascript.  There was some bug some time ago to include javascript.  The matching Netscaler rule is:

add policy expression moz_javascript "RES.HTTP.HEADER Content-Type CONTAINS javascript"                                 

Pretty sure I did a CONTAINS because I didn't want to have to match on these:

text/javascript
application/x-javascript

Added both of those to the addons-ssl vserver (sure wish there was a global setting).
Test complete.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Verified -- for my part, anyway, which was what this bug was about :-)
Status: RESOLVED → VERIFIED
Jeremy - can you take a look at comment #39? Has that script been run?
(In reply to comment #44)
> Jeremy - can you take a look at comment #39? Has that script been run?

You sure that's the right comment?

Are we caught up on stats crunching now?  I'm still hearing reports about broken stats.
I meant comment #36, but I'm filing another bug now because we now have 2 weeks of broken stats.
Is this fixed now? The stats are still way down. Even if were not retroactively parsing the missed stats (yet?), the current stats don't look right either. E.g. https://addons.mozilla.org/en-US/firefox/statistics/addon/1865 fell off a cliff after Nov 25, and continues to drop despite a huge spike in downloads (due to a new release).
The new bug for the broken stats is bug 469376
Blocks: 477130
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: