Last Comment Bug 467502 - QA AMO behind Zeus ZXTM load balancer
: QA AMO behind Zeus ZXTM load balancer
Status: VERIFIED FIXED
:
Product: mozilla.org Graveyard
Classification: Graveyard
Component: Server Operations (show other bugs)
: other
: All Other
: -- minor (vote)
: ---
Assigned To: Jeremy Orem [:oremj]
: matthew zeier [:mrz]
Mentors:
Depends on:
Blocks: 468570 477130
  Show dependency treegraph
 
Reported: 2008-12-01 23:17 PST by matthew zeier [:mrz]
Modified: 2015-03-12 08:17 PDT (History)
13 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
Live HTTP Headers output of failure to log out (2.52 KB, text/plain)
2008-12-02 14:12 PST, Stephen Donner [:stephend]
no flags Details

Description matthew zeier [:mrz] 2008-12-01 23:17:23 PST
Need some QA on AMO behind Zeus ZXTM.  Site is staged at 63.245.209.107 and only answers https right now.
Comment 1 matthew zeier [:mrz] 2008-12-01 23:18:39 PST
I'll save Reed the trouble and add him.
Comment 2 matthew zeier [:mrz] 2008-12-01 23:22:42 PST
Works for http too, redirects.

Desktop-Computer:~ mrz$ curl -H'Host: addons.mozilla.org' -v http://addons.mozilla.org/
* About to connect() to addons.mozilla.org port 80 (#0)
*   Trying 63.245.209.107... connected
* Connected to addons.mozilla.org (63.245.209.107) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.16.3 (powerpc-apple-darwin9.0) libcurl/7.16.3 OpenSSL/0.9.7l zlib/1.2.3
> Accept: */*
> Host: addons.mozilla.org
> 
< HTTP/1.1 301 Moved Permanently
< Content-Length: 0
< Date: Tue, 02 Dec 2008 07:22:04 GMT
< Connection: Keep-Alive
< Location: https://addons.mozilla.org/
< Content-Type: text/html
< 
* Connection #0 to host addons.mozilla.org left intact
* Closing connection #0
Comment 3 Reed Loden [:reed] (use needinfo?) 2008-12-01 23:28:06 PST
curl -H'Host: addons.mozilla.org' -v -k https://63.245.209.107/en-US/firefox/

wfm.

Did you duplicate the cookie rules that the NS currently has for AMO?
Comment 4 Stephen Donner [:stephend] 2008-12-02 01:02:02 PST
Not looking good thus far;

[1] Taking upward of ~ 40 seconds once I've clicked on the "Add to Firefox" button
[2] Taking even longer --60 seconds or so--to get the file-failed-to-download message of:
[3] "Firefox could not install the file at 

https://addons.mozilla.org/en-US/firefox/downloads/file/42123/adblock_plus-1.0-fx+sm+tb.xpi

because: Download error
-228"

Maybe I should've added services.addons.mozilla.org, too, to my HOSTS file?

https://services.addons.mozilla.org/en-US/firefox/api/1.1/search/live%20http%20headers/all/10/WINNT/3.0.4

More tomorrow; bed now :-P
Comment 5 matthew zeier [:mrz] 2008-12-02 08:32:41 PST
(In reply to comment #3)
> Did you duplicate the cookie rules that the NS currently has for AMO?

Those were specific to the Netscaler's version of caching (or rather how it chose to ignore/break thing).  As such those rules are directly "duplicate-able" and I'm not sure if they're needed.  

This bug is here to track all of that though :)
Comment 6 matthew zeier [:mrz] 2008-12-02 09:56:23 PST
oremj - this is on zxlb04 / 10.2.10.72.  Still using a self-signed SSL cert,
feel free to replace it with a moco signed *.mozilla.org one or something.
Comment 7 Stephen Donner [:stephend] 2008-12-02 14:12:15 PST
Created attachment 351054 [details]
Live HTTP Headers output of failure to log out
Comment 8 Stephen Donner [:stephend] 2008-12-02 14:18:33 PST
On prod, I get the following two things right after that 302:

https://addons.mozilla.org/en-US/firefox/

GET /en-US/firefox/ HTTP/1.1
Host: addons.mozilla.org
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Referer: https://addons.mozilla.org/en-US/firefox/
Cookie: __utma=164683759.1720830908.1228207447.1228255323.1228255654.11; __utmz=164683759.1228207447.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none); __utmb=164683759; __utmc=164683759; AMOappName=firefox

HTTP/1.x 200 OK
Date: Tue, 02 Dec 2008 21:47:41 GMT
Content-Length: 10575
Connection: Keep-Alive
Via: NS-CACHE-6.0:   4
Server: Apache/2.2.3 (Red Hat)
X-Powered-By: PHP/5.1.6
X-AMO-ServedBy: mrapp06
Content-Type: text/html; charset=UTF-8
Cache-Control: private
Content-Encoding: gzip
Comment 9 Jeremy Orem [:oremj] 2008-12-03 15:41:23 PST
I don't see an addon cookie in your logout request in comment 7.
Comment 10 Stephen Donner [:stephend] 2008-12-03 15:48:26 PST
(In reply to comment #9)
> I don't see an addon cookie in your logout request in comment 7.

Yeah; I didn't modify anything--that's the straight output :-(
Comment 11 Jeremy Orem [:oremj] 2008-12-03 16:10:19 PST
I think I fixed that issue.  Let me know if logging in/out is working for you now.
Comment 12 Michael Morgan [:morgamic] 2008-12-03 16:11:31 PST
Hey guys - what is the schedule/timeline for this?  Would like someone on webdev to give stephend a hand with this.  What are you guys thinking as a switch date?
Comment 13 Jeremy Orem [:oremj] 2008-12-03 16:15:56 PST
Until we buy some hardware to front zeus I believe this is just going to be a backup for the netscaler.  

Mrz, what's the layer 4 load balancer evaluation timeline?
Comment 14 Stephen Donner [:stephend] 2008-12-03 16:30:33 PST
I'm still seeing the problems I mentioned in comment 4, along with:

[4] When I log in, the page doesn't reflect that fact; it still displays "Log in", until I load another link or click Refresh.

Log out, though, seems to now work correctly.
Comment 15 matthew zeier [:mrz] 2008-12-03 16:58:52 PST
(In reply to comment #12)
> Hey guys - what is the schedule/timeline for this?  Would like someone on
> webdev to give stephend a hand with this.  What are you guys thinking as a
> switch date?

Right now I want a fall back plan for Dec 16 release.  I think we have enough stuff shifted around for MU on Thursday though it would be great to have AMO ready to go on Zeus should that be necessary.

(In reply to comment #13)
> Until we buy some hardware to front zeus I believe this is just going to be a
> backup for the netscaler.  

Zeus is "production" for vamo & fxfeeds.  Moving those took a significant load off the Netscalers.
 
> Mrz, what's the layer 4 load balancer evaluation timeline?

Just got that Cisco ACE try-n-buy signed today, should get hardware soon.
Comment 16 Stephen Donner [:stephend] 2008-12-03 18:21:20 PST
(In reply to comment #14)
> I'm still seeing the problems I mentioned in comment 4, along with:
> 
> [4] When I log in, the page doesn't reflect that fact; it still displays "Log
> in", until I load another link or click Refresh.
> 
> Log out, though, seems to now work correctly.

Jeremy has fixed the performance and failure-to-download error (-228); the load balance was returning https://releases..., rather than http://releases...

Issue [4], however, still remains: I'm getting the cached homepage view after I log in to AMO (shows "Log in" as a link, rather than "My Account", "Developer Tools", and "Log out".
Comment 17 Jeremy Orem [:oremj] 2008-12-04 12:19:46 PST
Do you have a way to reliably reproduce Issue 4?
Comment 18 Stephen Donner [:stephend] 2008-12-04 13:14:45 PST
(In reply to comment #17)
> Do you have a way to reliably reproduce Issue 4?

Yes, 100%.

1. Just click "Log in", log in, and then look at the page -- it'll still appear as though you're logged out, though if you refresh the page, you'll clearly see you're logged in.
2. If you click "Log out" from this page (once that link is visible after a reload/refresh), you can keep reproducing this.

http://pastebin.com/m461f37d2 has the headers
Comment 19 Stephen Donner [:stephend] 2008-12-04 16:00:50 PST
OK, yet-another-update:

At 2:26pm today, Jeremy did something that appears to have fixed issue # 4.

Since then, I've tested:

* Account creation
* Password reset
* Different application views

Additionally, I've thrown my two Selenium testcases--search and the more general one--at it, with no errors.

Next up, the Dev CP...
Comment 20 Wil Clouser [:clouserw] 2008-12-04 16:14:43 PST
What kind of performance improvement does the zeus box have over the netscalers?
Comment 21 matthew zeier [:mrz] 2008-12-04 16:19:37 PST
(In reply to comment #20)
> What kind of performance improvement does the zeus box have over the
> netscalers?

Hard to quantify but it's a more scalable solution that Netscaler is, which is just two boxes.  Zeus ZXTM runs on commodity hardware (on top of RHEL5) and I can keep adding ZXTM nodes to grow it. 

The only thing Netscaler does in specialized hardware is SSL offload.  The Netscaler 12k on paper can do 28,000 SSL tps but we see closer to 8,000 before it fails.  Zeus claims a dual quad-core L5450 Xeon server can do about 12,000 SSL tps.  At a 1/20th the cost of Netscaler, I can get a couple boxes to handle 28k and can exceed that a lot easier.

(For production, I'm looking at dual E5460 Xeons)
Comment 22 Wil Clouser [:clouserw] 2008-12-04 16:26:14 PST
Sweet, that sounds great.
Comment 23 Stephen Donner [:stephend] 2008-12-06 21:23:31 PST
I've tested uploading an add-on, as well as changing its description and screenshots, now, with no issues.

I'd feel more comfortable if a few webdev folks could also poke at AMO, too :-P
Comment 24 Wil Clouser [:clouserw] 2008-12-06 21:56:51 PST
I clicked around and it worked fine for me.
Comment 25 matthew zeier [:mrz] 2008-12-07 08:50:01 PST
That's great!

I'd like to do a performance test run and shift AMO over to ZXTM Tuesday night and into at least Wednesday morning/afternoon to pick up peak traffic. Maybe even Tuesday - Thursday.

This will affect log processing for the duration.

Trying to gauge how well ZXTM does under AMO load, how Gomez views external performance and how large the eventual ZXTM cluster might need to be to handle (right now it's 80% idle with vamo & fxfeeds).

Any show stoppers to that?
Comment 26 Wil Clouser [:clouserw] 2008-12-07 10:35:25 PST
(In reply to comment #25)
> This will affect log processing for the duration.

What's that mean?  I think anything that affects our statistics is a show stopper.
Comment 27 matthew zeier [:mrz] 2008-12-07 10:52:51 PST
Means deinspanjer will need to process the ZXTM log dir for AMO instead of (or in addition to) the normal AMO log dir.

Anything else look at those logs?
Comment 28 Reed Loden [:reed] (use needinfo?) 2008-12-07 11:07:43 PST
(In reply to comment #27)
> Anything else look at those logs?

Yes, AMO has its own log processing, which went critical yesterday due to other log problems.
Comment 29 Reed Loden [:reed] (use needinfo?) 2008-12-07 11:20:07 PST
AMO's log parse scripts will need to be updated for the VAMO and AMO log changes, as logs from both sites are processed. Considering VAMO already moved to ZXTM earlier, the VAMO part of the addons processing scripts has been broken since then.
Comment 30 Justin Scott [:fligtar] 2008-12-07 14:27:30 PST
I don't understand how this keeps happening.

The information on reconfiguring the log dirs and re-parsing the old logs is here: https://wiki.mozilla.org/Update:Developers/Statistics
Comment 31 matthew zeier [:mrz] 2008-12-07 14:29:53 PST
(In reply to comment #30)
> I don't understand how this keeps happening.

Probably because there's no monitor to remind anyone about it when it stops grabbing current data.  Is that something easy to do?
Comment 32 Reed Loden [:reed] (use needinfo?) 2008-12-07 14:42:22 PST
(In reply to comment #31)
> (In reply to comment #30)
> > I don't understand how this keeps happening.
> 
> Probably because there's no monitor to remind anyone about it when it stops
> grabbing current data.  Is that something easy to do?

I said in comment #28 that the current AMO log processing monitor I wrote for this purpose went critical yesterday, but it looks like it was ignored (at least from my point of view).

I have to agree with fligtar here, though. This continues to happen every time something happens to the logs (I can personally count at least 3-4+ times). What needs to be done to make sure that everybody is pulled in every time something happens or needs to happen to any AMO and/or VAMO logs? deinspanjer was notified in bug 467412 about this change, but it doesn't seem like anybody told fligtar.
Comment 33 Justin Scott [:fligtar] 2008-12-07 14:51:25 PST
It's not even necessary that I be told - it was my understanding that we handed the log configuration aspect off to IT when that documentation was written and we had trouble last time and made the integrity check script and nagios monitor.
Comment 34 Jeremy Orem [:oremj] 2008-12-07 16:52:44 PST
(In reply to comment #30)
> I don't understand how this keeps happening.
> 
> The information on reconfiguring the log dirs and re-parsing the old logs is
> here: https://wiki.mozilla.org/Update:Developers/Statistics

Sorry we switched over to Zeus in an emergency situation, so stats were secondary to keeping the service up.  I did get a page yesterday(?), "14:15 <@nagios> [79] dm-stats01:addons stats is CRITICAL: [FAIL][Adblock Plus] [Update Pings] Count from Wednesday changed by 85% from 
          2008-11-26 count of 5202612 [FAIL][Adblock Plus] [Update Pings] Count from Wednesday changed by 82% from 2008-11-19 count of 4348990 
          [FAIL][NoScript] [Update Pings] Count from Wednesday changed by 89% from 2008-11-26 count of 1818583 [FAIL][NoScript] [U". I think I've come up for a solution that will fix the logs without changing the configs.

I'll start running the script in Attachment 35183 [details] [diff] which will consolidate all the zeus logs to one place.
Comment 35 Jeremy Orem [:oremj] 2008-12-07 17:06:42 PST
Oops, meant Attachment 351833 [details].
Comment 36 Justin Scott [:fligtar] 2008-12-08 18:10:58 PST
Thanks Jeremy - can you update the update ping counter config to parse that directory and re-run the update ping script for last Wednesday (2008-12-03)?
Comment 37 Wil Clouser [:clouserw] 2008-12-09 18:16:35 PST
It's probably stating the obvious but when the AMO test happens at 9pm tonight the log parsing scripts need to continue to work with the Zeus logs.
Comment 38 Jeremy Orem [:oremj] 2008-12-09 21:36:45 PST
(In reply to comment #37)
> It's probably stating the obvious but when the AMO test happens at 9pm tonight
> the log parsing scripts need to continue to work with the Zeus logs.
I'm hardlinking the logs in to the directory the stats scripts are already looking at, so hopefully everything will just work.
Comment 39 Jeremy Orem [:oremj] 2008-12-09 21:37:28 PST
Site has been live behind the ZXTM boxes since ~9:10pm.
Comment 40 Stephen Donner [:stephend] 2008-12-09 21:47:09 PST
Still looking good so far.
Comment 41 matthew zeier [:mrz] 2008-12-10 01:11:10 PST
From the vserver | Content Compression:

The following table holds a list of the MIME types for the content that will be compressed.

MIME Type       	
 text/css      	
 text/plain      	
 text/html      


Default Mime-types don't include javascript.  There was some bug some time ago to include javascript.  The matching Netscaler rule is:

add policy expression moz_javascript "RES.HTTP.HEADER Content-Type CONTAINS javascript"                                 

Pretty sure I did a CONTAINS because I didn't want to have to match on these:

text/javascript
application/x-javascript

Added both of those to the addons-ssl vserver (sure wish there was a global setting).
Comment 42 Jeremy Orem [:oremj] 2008-12-10 15:03:09 PST
Test complete.
Comment 43 Stephen Donner [:stephend] 2008-12-10 15:22:10 PST
Verified -- for my part, anyway, which was what this bug was about :-)
Comment 44 Justin Scott [:fligtar] 2008-12-11 19:47:31 PST
Jeremy - can you take a look at comment #39? Has that script been run?
Comment 45 Wil Clouser [:clouserw] 2008-12-12 10:55:55 PST
(In reply to comment #44)
> Jeremy - can you take a look at comment #39? Has that script been run?

You sure that's the right comment?

Are we caught up on stats crunching now?  I'm still hearing reports about broken stats.
Comment 46 Justin Scott [:fligtar] 2008-12-12 11:23:19 PST
I meant comment #36, but I'm filing another bug now because we now have 2 weeks of broken stats.
Comment 47 Daniel Veditz [:dveditz] 2008-12-12 12:28:50 PST
Is this fixed now? The stats are still way down. Even if were not retroactively parsing the missed stats (yet?), the current stats don't look right either. E.g. https://addons.mozilla.org/en-US/firefox/statistics/addon/1865 fell off a cliff after Nov 25, and continues to drop despite a huge spike in downloads (due to a new release).
Comment 48 Wil Clouser [:clouserw] 2008-12-12 13:03:41 PST
The new bug for the broken stats is bug 469376

Note You need to log in before you can comment on or make changes to this bug.