Closed Bug 1075542 Opened 10 years ago Closed 9 years ago

flip the switch to move release users over to balrog

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

References

Details

Attachments

(2 files)

latest comparison urls 9 years ago bhearsum@mozilla.com (:bhearsum) 4.86 MB, text/plain		Details
partner urls used in testing 9 years ago bhearsum@mozilla.com (:bhearsum) 765.60 KB, application/gzip		Details

bhearsum@mozilla.com (:bhearsum)

Reporter

Description

•

10 years ago

Nick, Catlee, and I talked about this a bit yesterday. We're shooting for the week of October 20th, which is one week after Firefox 33 ships, and should give us time to fix the remaining blocking issues.

We talked a bit about mechanics and decided we'd like to switch it in Zeus, like we did for Beta. We'll start by switching 1% of the traffic and gradually move up from there, so we can keep an eye on load on the web backends and database server. We also want to prevent any requests with an application version less than 4.0 from switching, because Balrog can't support the types of requests that they send. I need to speak to IT to make sure this is all doable still.

I also want to have a chat with someone from the db team, to talk about scaling the load to that database.

One thing that came up is that because we're tying test channel names changes to the switch to Balrog, we must be 100% switched over before we can ship another release. This has two implications:
1) We have about 4 weeks between switching the initial 1% over, and building 34.0. We need to be switched over 100% at that point for test channels to work.
2) If we chemspill betweeen 33.0 and moving 100% of users to Balrog, we need to revert fully back to aus3 until the chemspill is shipped. (Again, so the test channels will work.)

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

10 years ago

Blocks: 1082601

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 1

•

10 years ago

With cyliang's help, we did some load testing by switching over a portion of the release channel traffic to Balrog. We started at 1% at 10:16am pacific, then 4% at 10:21am pacific, and finally 10% at 10:26am pacific.

Here's some graphs from one of the web heads: http://people.mozilla.org/~bhearsum/sattap/1de26986.png

You can see the CPU and network load grows linearly - both of which most likely represent an increase in the number of JSON blobs we're retrieving and parsing from the database. Yellow is rx on the network graph, so as expected we're receiving much more data than we're transmitting (responses to requests are < 10kb typically). The load average spikes very differently, and I'm not sure how to interpret that. About 5 minutes after increasing traffic to 10%, we started hitting max clients on the web heads, as reported by Nagios:
[11-12-2014 10:31:17] SERVICE ALERT: aus4.webapp.phx1.mozilla.com;httpd max clients;WARNING;SOFT;1;Using 256 out of 256 Clients
[11-12-2014 10:30:57] SERVICE ALERT: aus1.webapp.phx1.mozilla.com;httpd max clients;WARNING;SOFT;1;Using 256 out of 256 Clients
[11-12-2014 10:30:57] SERVICE ALERT: aus2.webapp.phx1.mozilla.com;httpd max clients;WARNING;SOFT;1;Using 256 out of 256 Clients

Around the same time I started getting "Service Unavailable" for some manual requests I was making. It's possible that increasing the max clients would've let us handle more load, but given that the CPU usage was very close to 100%, it may have just caused the machines to fall over.

The database server held up much better: http://people.mozilla.org/~bhearsum/sattap/4e34278c.png

CPU usage increased slowly, but never went higher than ~20%. Network tx increased, matching the rx we saw on the web heads. I'm not sure how much the link between those can handle, but we topped around 600Mb/sec, which is pretty high.

If we were truly close to saturating the web heads or the network link between them and the db server with only 10% of release channel traffic being sent to them, it's clear that we need some application level improvements to handle the full load. I'd like to talk to someone from webops and maybe dbops to make sure I'm reading all of this data correctly first, though.

If network load is going to be a bottleneck, we probably need some sort of application caching of queries to the database. If CPU load on the web heads is going to be a bottleneck, we probably need to cache parsed JSON blobs (which will spike memory - so we'd need to watch out for that). If we do either one of these it may be trivial to just do both, since they're different parts of the same operation.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 2

•

10 years ago

cyliang put together a nice graphite dashboard, too, if anyone wants to poke at the data: https://graphite-phx1.mozilla.org/dashboard/#AUS

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 3

•

10 years ago

I went over the data with cturra today and he agrees that we're going to need some app caching to handle the full release channel load. We've got bug 671488 filed ages ago to add some sort of caching to Balrog, so we can take care of it there.

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

9 years ago

Blocks: diesnippets

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

9 years ago

Depends on: 1119844

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

9 years ago

Blocks: 1120420

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 4

•

9 years ago

Attached file latest comparison urls — Details

This is 10.0 through 34.0.5, Firefox and Thunderbird, release channel, all locales. A comparison between aus3 and aus4 is a full pass, with two caveats:
* sha512 vs SHA512 (the client doesn't care)
* Thunderbird detailsURLs are ignored. (Balrog uses the newer, more correct URLs.)

This needs to be run again after 35.0 ships, because things currently point at 34.0.5.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 5

•

9 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #4)
> Created attachment 8547642 [details]
> latest comparison urls
> 
> This is 10.0 through 34.0.5, Firefox and Thunderbird, release channel, all
> locales. A comparison between aus3 and aus4 is a full pass, with two caveats:
> * sha512 vs SHA512 (the client doesn't care)
> * Thunderbird detailsURLs are ignored. (Balrog uses the newer, more correct
> URLs.)
> 
> This needs to be run again after 35.0 ships, because things currently point
> at 34.0.5.

I ran them again, another full pass! Nothing can stop us now!!

Nick Thomas [:nthomas] (UTC+12)

Comment 6

•

9 years ago

How about adding some partner urls, and possibly getting a log from aus3 ?

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 7

•

9 years ago

(In reply to Nick Thomas [:nthomas] from comment #6)
> How about adding some partner urls, and possibly getting a log from aus3 ?

Good idea about the partner urls. What do you mean by getting a log from aus3 though? To look for other edge case updates?

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 8

•

9 years ago

Attached file partner urls used in testing — Details

I did a bunch of testing of partner urls. This list contains a subset of our partner channels (I stripped out a bunch of yahoo ones), and every single locale that we ship ANY partner build for (so, we're testing things like zh-TW yandex updates, which don't exist, but it was easier to generate this list than have partner-specific locale lists).

All of these passed:
PASS: /update/3/Firefox/10.0/20120129021758/WINNT_x86-msvc/tr/release-cck-yandexua/Windows_NT%205.2/default/default/update.xml?force=1
PASS: /update/3/Firefox/10.0/20120129021758/WINNT_x86-msvc/uk/release-cck-yandexua/Windows_NT%205.2/default/default/update.xml?force=1
PASS: /update/3/Firefox/10.0/20120129021758/WINNT_x86-msvc/vi/release-cck-yandexua/Windows_NT%205.2/default/default/update.xml?force=1
PASS: /update/3/Firefox/10.0/20120129021758/WINNT_x86-msvc/zh-TW/release-cck-yandexua/Windows_NT%205.2/default/default/update.xml?force=1
Tested 216000 paths.
Pass count: 215980
Fail count: 0
Error count: 0

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 9

•

9 years ago

This is done \o/. Load looks good on the web heads, and has gone down to almost nothing on the aus3 web heads.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

4 years ago

Product: Release Engineering → Release Engineering Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

flip the switch to move release users over to balrog

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect)

Tracking

(Not tracked)

People

(Reporter: bhearsum, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Updated

Comment 1

Comment 2

Comment 3

Updated

Updated

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Attachment

General

Description

File Name

Content Type