this bug is to track the 1% throttled release traffic testing on the aus4 cluster. we're testing the new caching layer between the web heads and database servers. :bhearsum and i have planned to do this work tomorrow (12/18) at 11:00 pacific. there is no impact to end users for this change. only 1% of the release traffic will be served from the new (aus4) cluster. if for whatever reason these requests fail, the user will not see any error. the update service will simply try to get an update again later.
Just to clarify: I'd like to start at 1% and slowly dial up until we either hit 100%, or crumple under load. Either way we'll be rolling back fully to aus3 at the end -- we don't want to flip the switch for good until after the holidays. I've already cleared this with Release Management and other stakeholders.
Created attachment 8538722 [details] balrog release.png We did this test today and it was awesome. We managed to easily serve 100% of the release channel traffic for more than 30 minutes, so this is a huge success. Here's the record of the percentage of release traffic we sent to aus4 at specific times: 2:05pm - 1% 2:08pm - 3% 2:11pm - 6% 2:14pm - 12% 2:17pm - 20% 2:20pm - 35% 2:25pm - 55% 2:31pm - 75% 2:37pm - 100% 3pm - back at 0% Our peak CPU usage was just over 40% on aus2-4, and just over 50% on aus1. Attached is a graph that shows the CPU and load average over time. In the nearly 1 hour we were live we served 1.4 million requests from aus4. In the prior hour we served just under 400,000. Chris, you mentioned that you still wanted to add 1 or 2 more web heads here both so we can have room to grow and be OK if we lose 2 web heads. Was there anything else you wanted to add/change? Anything else you want to add?
Anything to add before we close this out?
Great news. I expect that traffic is somewhat lower than on release day. Does this test prove that aus4 can handle release day traffic or do we need follow up testing to prove that?
(In reply to Lawrence Mandel [:lmandel] (use needinfo) from comment #4) > Great news. I expect that traffic is somewhat lower than on release day. > Does this test prove that aus4 can handle release day traffic or do we need > follow up testing to prove that? Do we actually get a spike on release day? In any case, the existing nodes are only at half capacity, and we're probably adding two more. I think we'd hold up except under incredible load.
(In reply to Ben Hearsum [:bhearsum] from comment #3) > Anything to add before we close this out? happy to close this bug off now. you will see that i have opened another with our infra team to get a couple additional aus web heads added to the aus4 cluster.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
(In reply to Chris Turra [:cturra] from comment #6) > (In reply to Ben Hearsum [:bhearsum] from comment #3) > > Anything to add before we close this out? > > happy to close this bug off now. you will see that i have opened another > with our infra team to get a couple additional aus web heads added to the > aus4 cluster. Sounds good, thanks Chris!
You need to log in before you can comment on or make changes to this bug.