1075192 - figure out zlb traffic script for shipping balrog to release users

Reporter

Description

•

11 years ago

We decided that we need at least the following: * Don't redirect for application versions less than 4.0 * Want to start at 1%, and move the knob up slowly. Filing here for now, will move to IT after discussing with cturra.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 1

•

11 years ago

I talked with Chris about this today. We decided that the best way to implement the application version restrictions was to do clever string matching, because Zeus doesn't support regexes. So, something like: if url.contains("Firefox/3.") or url.contains("Firefox 2.") or url.contains("Thunderbird 3.") ....: # don't redirect I'll provide the full list of things we shouldn't redirect on before we make the switch. We also talked about redirecting a certain percentage of users. Chris, I believe you said that you weren't sure how to do this off the top of your head, but that you were pretty sure it was possible, and you'd look into it. We think that the best way to do this when we're ready is to move the knob up slowly until we get to a point where we can see what the load curve looks like. At that point, we will probably have a good idea how to deal with it. Eg, if we have linear growth, and don't exhaust or resources before 50%, double the nodes will probably suffice. Or if we have exponential growth and exhaust resources @ 10% we probably need to look at some app level caching.

Assignee: nobody → server-ops-webops

Component: Balrog: Backend → WebOps: Product Delivery

Product: Release Engineering → Infrastructure & Operations

QA Contact: bhearsum → nmaul

Version: unspecified → other

:kanban

Updated

•

11 years ago

Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/1401]

Chris Turra [:cturra]

Assignee

Updated

•

11 years ago

Assignee: server-ops-webops → cturra

Chris Turra [:cturra]

Assignee

Comment 2

•

11 years ago

i have good news. i have successfully written a traffic script that will help us trottle the percent of traffic we want to direct to aus4 while we slowly roll this out. the basic concept can be found below. i want to do a little more testing to see how well the random number generator will work for us as we increase the numbers. * in the example below, you will see the traffic script is looking for the 'cturra' header. this is in place we can test without impacting any other nightly/aurora/beta traffic. --- $host = http.getHostHeader(); $cturra = http.getHeader("cturra"); $trottle = 1; if ( string.containsI($host, "aus3.mozilla.org") && string.containsI($cturra, "true") ){ if ( math.random(100) < $trottle ){ http.redirect("http://www.cturra.com"); } }

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 3

•

11 years ago

Attached file list of things not to redirect for — Details

Chris, here's the list of things that we _shouldn't_ redirect for. It's 216 items long :(. If that's too much to handle, let me know and I'll try to find a way to shrink it into more clever substrings.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 4

•

11 years ago

Can you also encase this in if foo.contains("cturra") like we did for https://bugzilla.mozilla.org/show_bug.cgi?id=1041745#c3 ? I'm hoping to have QE test there at the end of the week.

Flags: needinfo?(cturra)

Chris Turra [:cturra]

Assignee

Comment 5

•

11 years ago

i just wrote, but have not applied, the following traffic script. it first checks for cturra/cturra-cdntest in the path, then confirms an older version is not being requested. finally, it should (aus3->aus4) pool select about 1% of traffic. ---- $host = http.getHostHeader(); $path = http.getPath(); $trottle = 1; if (string.containsI($host, "aus3.mozilla.org")){ if ( string.containsI($path, "/cturra/") || string.containsI($path, "/cturra-cdntest/") ){ if ( !string.containsI($path, "Firefox/1.5") || !string.containsI($path, "Firefox/2.") || !string.containsI($path, "Firefox/3.") || !string.containsI($path, "Thunderbird/1.5") || !string.containsI($path, "Thunderbird/2.") || !string.containsI($path, "Thunderbird/3.") ){ if ( math.random(100) < $trottle ){ pool.select("aus4-prod-https"); } # end throttle check } # end old version check } # end cturra/test check } # end aus3 check

Flags: needinfo?(cturra)

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 6

•

11 years ago

Comment #5 looks sensible to me, but I'm having trouble verifying it with a 1% throttle - it never seems to return anything except aus3 backends - I think the 30s zeus cache is playing a factor here too, though. Can we change it to 90 or something else really high for testing purposes?

Chris Turra [:cturra]

Assignee

Comment 7

•

11 years ago

just a follow up on our irc conversation. this rule hadn't been applied, which is why you weren't seeing this rule "working" yet. for testing, we increased the throttle to 100% and have applied to the aus3 pool in zeus. let me know how the testing goes and if you need this rule updated at any point.

Nick Thomas [:nthomas] (UTC+12)

Comment 8

•

11 years ago

(In reply to Chris Turra [:cturra] from comment #5) > if ( math.random(100) < $trottle ){ > pool.select("aus4-prod-https"); > } # end throttle check Potential typo there with '$trottle'.

Chris Turra [:cturra]

Assignee

Comment 9

•

11 years ago

there was a typo! tho, b/c the parameter was misspelled during declaration, it works as expected. i have updated the spelling.

:kanban-engops

Updated

•

11 years ago

Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/1401] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2138] [kanban:https://kanbanize.com/ctrl_board/4/1401]

:kanban-engops

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2138] [kanban:https://kanbanize.com/ctrl_board/4/1401] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2147] [kanban:https://kanbanize.com/ctrl_board/4/1401]

Chris Turra [:cturra]

Assignee

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2147] [kanban:https://kanbanize.com/ctrl_board/4/1401] → [kanban:https://kanbanize.com/ctrl_board/4/1401]

C. Liang [:cyliang]

Comment 10

•

11 years ago

In testing with bhearsum, we removed the cturra wrapping and dialed back the throttle. After collecting some data, the aus3-ignore-old-builds was disabled.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 11

•

11 years ago

With cyliang's help, we did some load testing by switching over a portion of the release channel traffic to Balrog. We started at 1% at 10:16am pacific, then 4% at 10:21am pacific, and finally 10% at 10:26am pacific. Here's some graphs from one of the web heads: http://people.mozilla.org/~bhearsum/sattap/1de26986.png You can see the CPU and network load grows linearly - both of which most likely represent an increase in the number of JSON blobs we're retrieving and parsing from the database. Yellow is rx on the network graph, so as expected we're receiving much more data than we're transmitting (responses to requests are < 10kb typically). The load average spikes very differently, and I'm not sure how to interpret that. About 5 minutes after increasing traffic to 10%, we started hitting max clients on the web heads, as reported by Nagios: [11-12-2014 10:31:17] SERVICE ALERT: aus4.webapp.phx1.mozilla.com;httpd max clients;WARNING;SOFT;1;Using 256 out of 256 Clients [11-12-2014 10:30:57] SERVICE ALERT: aus1.webapp.phx1.mozilla.com;httpd max clients;WARNING;SOFT;1;Using 256 out of 256 Clients [11-12-2014 10:30:57] SERVICE ALERT: aus2.webapp.phx1.mozilla.com;httpd max clients;WARNING;SOFT;1;Using 256 out of 256 Clients Around the same time I started getting "Service Unavailable" for some manual requests I was making. It's possible that increasing the max clients would've let us handle more load, but given that the CPU usage was very close to 100%, it may have just caused the machines to fall over. The database server held up much better: http://people.mozilla.org/~bhearsum/sattap/4e34278c.png CPU usage increased slowly, but never went higher than ~20%. Network tx increased, matching the rx we saw on the web heads. I'm not sure how much the link between those can handle, but we topped around 600Mb/sec, which is pretty high. -- If we were truly close to saturating the web heads or the network link between them and the db server with only 10% of release channel traffic being sent to them, it's clear that we need some application level improvements to handle the full load. I'd like to talk to someone from webops and maybe dbops to make sure I'm reading all of this data correctly first, though. If network load is going to be a bottleneck, we probably need some sort of application caching of queries to the database. If CPU load on the web heads is going to be a bottleneck, we probably need to cache parsed JSON blobs (which will spike memory - so we'd need to watch out for that). If we do either one of these it may be trivial to just do both, since they're different parts of the same operation.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 12

•

11 years ago

Oops, my comment was actually meant for bug 1075542.

Chris Turra [:cturra]

Assignee

Comment 13

•

11 years ago

we can probably close this bug off now that we've sorted out the traffic script rules around throttling releases/etc \o/

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard