figure out zlb traffic script for shipping balrog to release users

RESOLVED FIXED

Status

RESOLVED FIXED
4 years ago
2 years ago

People

(Reporter: bhearsum, Assigned: cturra)

Tracking

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/1401] )

Attachments

(1 attachment)

We decided that we need at least the following:
* Don't redirect for application versions less than 4.0
* Want to start at 1%, and move the knob up slowly.

Filing here for now, will move to IT after discussing with cturra.
I talked with Chris about this today. We decided that the best way to implement the application version restrictions was to do clever string matching, because Zeus doesn't support regexes. So, something like:
if url.contains("Firefox/3.") or url.contains("Firefox 2.") or url.contains("Thunderbird 3.") ....:
  # don't redirect

I'll provide the full list of things we shouldn't redirect on before we make the switch.

We also talked about redirecting a certain percentage of users. Chris, I believe you said that you weren't sure how to do this off the top of your head, but that you were pretty sure it was possible, and you'd look into it.

We think that the best way to do this when we're ready is to move the knob up slowly until we get to a point where we can see what the load curve looks like. At that point, we will probably have a good idea how to deal with it. Eg, if we have linear growth, and don't exhaust or resources before 50%, double the nodes will probably suffice. Or if we have exponential growth and exhaust resources @ 10% we probably need to look at some app level caching.
Assignee: nobody → server-ops-webops
Component: Balrog: Backend → WebOps: Product Delivery
Product: Release Engineering → Infrastructure & Operations
QA Contact: bhearsum → nmaul
Version: unspecified → other

Updated

4 years ago
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/1401]
(Assignee)

Updated

4 years ago
Assignee: server-ops-webops → cturra
(Assignee)

Comment 2

4 years ago
i have good news. i have successfully written a traffic script that will help us trottle the percent of traffic we want to direct to aus4 while we slowly roll this out. the basic concept can be found below. i want to do a little more testing to see how well the random number generator will work for us as we increase the numbers.

* in the example below, you will see the traffic script is looking for the 'cturra' header. this is in place we can test without impacting any other nightly/aurora/beta traffic.

---
 
$host    = http.getHostHeader();
$cturra  = http.getHeader("cturra");
$trottle = 1;

if ( string.containsI($host, "aus3.mozilla.org") &&
     string.containsI($cturra, "true")
    ){
   
   if ( math.random(100) < $trottle ){
     http.redirect("http://www.cturra.com");
   }
  
}
Created attachment 8505012 [details]
list of things not to redirect for

Chris, here's the list of things that we _shouldn't_ redirect for. It's 216 items long :(. If that's too much to handle, let me know and I'll try to find a way to shrink it into more clever substrings.
Can you also encase this in if foo.contains("cturra") like we did for https://bugzilla.mozilla.org/show_bug.cgi?id=1041745#c3 ? I'm hoping to have QE test there at the end of the week.
Flags: needinfo?(cturra)
(Assignee)

Comment 5

4 years ago
i just wrote, but have not applied, the following traffic script. it first checks for cturra/cturra-cdntest in the path, then confirms an older version is not being requested. finally, it should (aus3->aus4) pool select about 1% of traffic.

----
$host = http.getHostHeader();
$path = http.getPath();   
$trottle = 1;

if (string.containsI($host, "aus3.mozilla.org")){  
   
   if ( string.containsI($path, "/cturra/") ||
        string.containsI($path, "/cturra-cdntest/")
      ){

      if ( !string.containsI($path, "Firefox/1.5")     ||
           !string.containsI($path, "Firefox/2.")      ||
           !string.containsI($path, "Firefox/3.")      ||
           !string.containsI($path, "Thunderbird/1.5") ||
           !string.containsI($path, "Thunderbird/2.")  ||
           !string.containsI($path, "Thunderbird/3.")
         ){
         
           if ( math.random(100) < $trottle ){
              pool.select("aus4-prod-https");
           } # end throttle check
         
      } # end old version check
      
   } # end cturra/test check
   
} # end aus3 check
Flags: needinfo?(cturra)
Comment #5 looks sensible to me, but I'm having trouble verifying it with a 1% throttle - it never seems to return anything except aus3 backends - I think the 30s zeus cache is playing a factor here too, though. Can we change it to 90 or something else really high for testing purposes?
(Assignee)

Comment 7

4 years ago
just a follow up on our irc conversation. this rule hadn't been applied, which is why you weren't seeing this rule "working" yet.

for testing, we increased the throttle to 100% and have applied to the aus3 pool in zeus. let me know how the testing goes and if you need this rule updated at any point.
(In reply to Chris Turra [:cturra] from comment #5)
>            if ( math.random(100) < $trottle ){
>               pool.select("aus4-prod-https");
>            } # end throttle check

Potential typo there with '$trottle'.
(Assignee)

Comment 9

4 years ago
there was a typo! tho, b/c the parameter was misspelled during declaration, it works as expected. i have updated the spelling.

Updated

4 years ago
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/1401] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2138] [kanban:https://kanbanize.com/ctrl_board/4/1401]

Updated

4 years ago
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2138] [kanban:https://kanbanize.com/ctrl_board/4/1401] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2147] [kanban:https://kanbanize.com/ctrl_board/4/1401]
(Assignee)

Updated

4 years ago
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2147] [kanban:https://kanbanize.com/ctrl_board/4/1401] → [kanban:https://kanbanize.com/ctrl_board/4/1401]

Comment 10

4 years ago
In testing with bhearsum, we removed the cturra wrapping and dialed back the throttle.  After collecting some data, the aus3-ignore-old-builds was disabled.
With cyliang's help, we did some load testing by switching over a portion of the release channel traffic to Balrog. We started at 1% at 10:16am pacific, then 4% at 10:21am pacific, and finally 10% at 10:26am pacific.

Here's some graphs from one of the web heads: http://people.mozilla.org/~bhearsum/sattap/1de26986.png

You can see the CPU and network load grows linearly - both of which most likely represent an increase in the number of JSON blobs we're retrieving and parsing from the database. Yellow is rx on the network graph, so as expected we're receiving much more data than we're transmitting (responses to requests are < 10kb typically). The load average spikes very differently, and I'm not sure how to interpret that. About 5 minutes after increasing traffic to 10%, we started hitting max clients on the web heads, as reported by Nagios:
[11-12-2014 10:31:17] SERVICE ALERT: aus4.webapp.phx1.mozilla.com;httpd max clients;WARNING;SOFT;1;Using 256 out of 256 Clients
[11-12-2014 10:30:57] SERVICE ALERT: aus1.webapp.phx1.mozilla.com;httpd max clients;WARNING;SOFT;1;Using 256 out of 256 Clients
[11-12-2014 10:30:57] SERVICE ALERT: aus2.webapp.phx1.mozilla.com;httpd max clients;WARNING;SOFT;1;Using 256 out of 256 Clients

Around the same time I started getting "Service Unavailable" for some manual requests I was making. It's possible that increasing the max clients would've let us handle more load, but given that the CPU usage was very close to 100%, it may have just caused the machines to fall over.

The database server held up much better: http://people.mozilla.org/~bhearsum/sattap/4e34278c.png

CPU usage increased slowly, but never went higher than ~20%. Network tx increased, matching the rx we saw on the web heads. I'm not sure how much the link between those can handle, but we topped around 600Mb/sec, which is pretty high.

--

If we were truly close to saturating the web heads or the network link between them and the db server with only 10% of release channel traffic being sent to them, it's clear that we need some application level improvements to handle the full load. I'd like to talk to someone from webops and maybe dbops to make sure I'm reading all of this data correctly first, though.

If network load is going to be a bottleneck, we probably need some sort of application caching of queries to the database. If CPU load on the web heads is going to be a bottleneck, we probably need to cache parsed JSON blobs (which will spike memory - so we'd need to watch out for that). If we do either one of these it may be trivial to just do both, since they're different parts of the same operation.
Oops, my comment was actually meant for bug 1075542.
(Assignee)

Comment 13

4 years ago
we can probably close this bug off now that we've sorted out the traffic script rules around throttling releases/etc \o/
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.