Closed Bug 1075542 Opened 10 years ago Closed 9 years ago

flip the switch to move release users over to balrog

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

References

Details

Attachments

(2 files)

Nick, Catlee, and I talked about this a bit yesterday. We're shooting for the week of October 20th, which is one week after Firefox 33 ships, and should give us time to fix the remaining blocking issues.

We talked a bit about mechanics and decided we'd like to switch it in Zeus, like we did for Beta. We'll start by switching 1% of the traffic and gradually move up from there, so we can keep an eye on load on the web backends and database server. We also want to prevent any requests with an application version less than 4.0 from switching, because Balrog can't support the types of requests that they send. I need to speak to IT to make sure this is all doable still.

I also want to have a chat with someone from the db team, to talk about scaling the load to that database.

One thing that came up is that because we're tying test channel names changes to the switch to Balrog, we must be 100% switched over before we can ship another release. This has two implications:
1) We have about 4 weeks between switching the initial 1% over, and building 34.0. We need to be switched over 100% at that point for test channels to work.
2) If we chemspill betweeen 33.0 and moving 100% of users to Balrog, we need to revert fully back to aus3 until the chemspill is shipped. (Again, so the test channels will work.)
Blocks: 1082601
With cyliang's help, we did some load testing by switching over a portion of the release channel traffic to Balrog. We started at 1% at 10:16am pacific, then 4% at 10:21am pacific, and finally 10% at 10:26am pacific.

Here's some graphs from one of the web heads: http://people.mozilla.org/~bhearsum/sattap/1de26986.png

You can see the CPU and network load grows linearly - both of which most likely represent an increase in the number of JSON blobs we're retrieving and parsing from the database. Yellow is rx on the network graph, so as expected we're receiving much more data than we're transmitting (responses to requests are < 10kb typically). The load average spikes very differently, and I'm not sure how to interpret that. About 5 minutes after increasing traffic to 10%, we started hitting max clients on the web heads, as reported by Nagios:
[11-12-2014 10:31:17] SERVICE ALERT: aus4.webapp.phx1.mozilla.com;httpd max clients;WARNING;SOFT;1;Using 256 out of 256 Clients
[11-12-2014 10:30:57] SERVICE ALERT: aus1.webapp.phx1.mozilla.com;httpd max clients;WARNING;SOFT;1;Using 256 out of 256 Clients
[11-12-2014 10:30:57] SERVICE ALERT: aus2.webapp.phx1.mozilla.com;httpd max clients;WARNING;SOFT;1;Using 256 out of 256 Clients

Around the same time I started getting "Service Unavailable" for some manual requests I was making. It's possible that increasing the max clients would've let us handle more load, but given that the CPU usage was very close to 100%, it may have just caused the machines to fall over.

The database server held up much better: http://people.mozilla.org/~bhearsum/sattap/4e34278c.png

CPU usage increased slowly, but never went higher than ~20%. Network tx increased, matching the rx we saw on the web heads. I'm not sure how much the link between those can handle, but we topped around 600Mb/sec, which is pretty high.

--

If we were truly close to saturating the web heads or the network link between them and the db server with only 10% of release channel traffic being sent to them, it's clear that we need some application level improvements to handle the full load. I'd like to talk to someone from webops and maybe dbops to make sure I'm reading all of this data correctly first, though.

If network load is going to be a bottleneck, we probably need some sort of application caching of queries to the database. If CPU load on the web heads is going to be a bottleneck, we probably need to cache parsed JSON blobs (which will spike memory - so we'd need to watch out for that). If we do either one of these it may be trivial to just do both, since they're different parts of the same operation.
cyliang put together a nice graphite dashboard, too, if anyone wants to poke at the data: https://graphite-phx1.mozilla.org/dashboard/#AUS
I went over the data with cturra today and he agrees that we're going to need some app caching to handle the full release channel load. We've got bug 671488 filed ages ago to add some sort of caching to Balrog, so we can take care of it there.
Blocks: diesnippets
Depends on: 1119844
Blocks: 1120420
Attached file latest comparison urls
This is 10.0 through 34.0.5, Firefox and Thunderbird, release channel, all locales. A comparison between aus3 and aus4 is a full pass, with two caveats:
* sha512 vs SHA512 (the client doesn't care)
* Thunderbird detailsURLs are ignored. (Balrog uses the newer, more correct URLs.)

This needs to be run again after 35.0 ships, because things currently point at 34.0.5.
(In reply to Ben Hearsum [:bhearsum] from comment #4)
> Created attachment 8547642 [details]
> latest comparison urls
> 
> This is 10.0 through 34.0.5, Firefox and Thunderbird, release channel, all
> locales. A comparison between aus3 and aus4 is a full pass, with two caveats:
> * sha512 vs SHA512 (the client doesn't care)
> * Thunderbird detailsURLs are ignored. (Balrog uses the newer, more correct
> URLs.)
> 
> This needs to be run again after 35.0 ships, because things currently point
> at 34.0.5.

I ran them again, another full pass! Nothing can stop us now!!
How about adding some partner urls, and possibly getting a log from aus3 ?
(In reply to Nick Thomas [:nthomas] from comment #6)
> How about adding some partner urls, and possibly getting a log from aus3 ?

Good idea about the partner urls. What do you mean by getting a log from aus3 though? To look for other edge case updates?
I did a bunch of testing of partner urls. This list contains a subset of our partner channels (I stripped out a bunch of yahoo ones), and every single locale that we ship ANY partner build for (so, we're testing things like zh-TW yandex updates, which don't exist, but it was easier to generate this list than have partner-specific locale lists).

All of these passed:
PASS: /update/3/Firefox/10.0/20120129021758/WINNT_x86-msvc/tr/release-cck-yandexua/Windows_NT%205.2/default/default/update.xml?force=1
PASS: /update/3/Firefox/10.0/20120129021758/WINNT_x86-msvc/uk/release-cck-yandexua/Windows_NT%205.2/default/default/update.xml?force=1
PASS: /update/3/Firefox/10.0/20120129021758/WINNT_x86-msvc/vi/release-cck-yandexua/Windows_NT%205.2/default/default/update.xml?force=1
PASS: /update/3/Firefox/10.0/20120129021758/WINNT_x86-msvc/zh-TW/release-cck-yandexua/Windows_NT%205.2/default/default/update.xml?force=1
Tested 216000 paths.
Pass count: 215980
Fail count: 0
Error count: 0
This is done \o/. Load looks good on the web heads, and has gone down to almost nothing on the aus3 web heads.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Release Engineering Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: