Closed Bug 1026644 Opened 10 years ago Closed 10 years ago

Deploy BrowserID-Verifier 0.2.2 to Stage

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: jbonacci, Assigned: mostlygeek)

References

Details

(Whiteboard: [qa+])

Whiteboard: [qa+]
Blocks: 1014496
Version 0.2.2 tagged and ready to go; bug title updated accordingly
Summary: Deploy latest BrowserID-Verifier to Stage → Deploy BrowserID-Verifier 0.2.2
Assignee: nobody → bwong
Summary: Deploy BrowserID-Verifier 0.2.2 → Deploy BrowserID-Verifier 0.2.2 to Stage
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Verified the physical deployment to Stage.
We are now running on two c3.large instances:
i-2fb8137d
i-ca6fbde1

Verified bump in version:
rpm -qa | grep verifier
fxa-browserid-verifier-svcops 0.2.2-1 x86_64 29249727

We don't have any quick deployment verification tests at this time, so moving on to load testing, keeping the following fixes/updates/commments in mind:
https://github.com/mozilla/browserid-verifier/pull/55
https://github.com/mozilla-services/puppet-config/issues/600
The 30 minute load test looked good for the first 20 minutes, then it started showing errors in the Loads dashboard:
Tests over    337881
Successes     337743
Failures      138

I will investigate once the test is over...
Final stats:
Test was launched by     jbonacci
Run Id     0721432e-b4b4-4db4-ab5a-ee121c3c2381
Duration     30 min and 22 sec.
Started     2014-06-18 21:04:07 UTC
Ended     2014-06-18 21:34:29 UTC
State     Ended

Users     [20]
Hits     None
Agents     5
Duration     1800
Server URL     https://verifier.stage.mozaws.net

Tests over     442362 
Successes     442100
Failures     0
Errors     0
TCP Hits     442362
Opened web sockets     0
Total web sockets     0
Bytes/websockets     0
Requests / second (RPS)     242

addFailure     262

REF: https://loads.services.mozilla.com/run/0721432e-b4b4-4db4-ab5a-ee121c3c2381
OK, so this matches exactly with the number of 503s on instance ec2-54-81-41-31
The other instance is clean.

:mostlygeek can you take a look at this?
Forgot to add, in the verifier_err.log file on the same server, there are the following messages:
{"op":"bid.server","name":"bid.server","time":"2014-06-18T21:26:37.964Z","pid":2250,"v":1,"hostname":"ip-10-203-170-59","message":"too busy"}
(262 of them ;-)   )
> We don't have any quick deployment verification tests

That's what the `make test` target in the loadtest scripts are for :-)
This is reminiscent of Bug 996763 Comment 44 - 503s with only 40% CPU utilization.  I'm also surprised to see it failing on "toobusy" rather than from the computecluster.  Any chance we've busted some configuration here?  I'll dig in...
Config seems to be ok.  Checking the log, all of the toobusy errors occurred in the space of <1 second, from 2014-06-18T21:26:37.861Z to 2014-06-18T21:28:38.616Z.  The loadtest then continued successfully for another five minutes, until 2014-06-18T22:43:25.008Z.

The toobusy module polls the event-loop every 500ms, so this is likely a single latency spike triggering toobusy and then recovering quickly on the next poll.  It's quite possible we just got unlucky here, dealt with it gracefully, and recovered quickly.

It looks like we're using the default toobusy maxLag of 70ms, which is quite small.  Given the generally compute-heavy workload here, maybe we should consider bumping it upwards a little to smooth out these spikes.

CPU utilization of 40% is not awesome though.  Is this due to e.g. having more or bigger instances than in the previous test?
And, why only on one of the two instances?
Teamwork.
16:18 < rfkelly> jbonacci mostlygeek I think we can ship it
16:19 < mostlygeek> ship.it.
:-)
Status: RESOLVED → VERIFIED
Blocks: 1027392
> And, why only on one of the two instances?

Well, that's part of why I think it was just a random latency spike putting us over the line.  Only once, on only one of the instances, and recovered straight away.  I will file a puppet-config PR for a slightly higher maxLag setting.
You need to log in before you can comment on or make changes to this bug.