1026644 - Deploy BrowserID-Verifier 0.2.2 to Stage

Reporter

Description

•

10 years ago

Based on this fix:
https://github.com/mozilla/browserid-verifier/pull/55

And the comments here:
https://github.com/mozilla-services/puppet-config/issues/600

James Bonacci [:jbonacci]

Reporter

Updated

•

10 years ago

Whiteboard: [qa+]

James Bonacci [:jbonacci]

Reporter

Updated

•

10 years ago

Blocks: 1014496

Ryan Kelly [:rfkelly]

Comment 1

•

10 years ago

Version 0.2.2 tagged and ready to go; bug title updated accordingly

Summary: Deploy latest BrowserID-Verifier to Stage → Deploy BrowserID-Verifier 0.2.2

Benson Wong [:mostlygeek]

Assignee

Updated

•

10 years ago

Assignee: nobody → bwong

Benson Wong [:mostlygeek]

Assignee

Updated

•

10 years ago

Summary: Deploy BrowserID-Verifier 0.2.2 → Deploy BrowserID-Verifier 0.2.2 to Stage

Benson Wong [:mostlygeek]

Assignee

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Reporter

Comment 2

•

10 years ago

Verified the physical deployment to Stage.
We are now running on two c3.large instances:
i-2fb8137d
i-ca6fbde1

Verified bump in version:
rpm -qa | grep verifier
fxa-browserid-verifier-svcops 0.2.2-1 x86_64 29249727

We don't have any quick deployment verification tests at this time, so moving on to load testing, keeping the following fixes/updates/commments in mind:
https://github.com/mozilla/browserid-verifier/pull/55
https://github.com/mozilla-services/puppet-config/issues/600

James Bonacci [:jbonacci]

Reporter

Comment 3

•

10 years ago

The 30 minute load test looked good for the first 20 minutes, then it started showing errors in the Loads dashboard:
Tests over    337881
Successes     337743
Failures      138

I will investigate once the test is over...

James Bonacci [:jbonacci]

Reporter

Comment 4

•

10 years ago

Final stats:
Test was launched by     jbonacci
Run Id     0721432e-b4b4-4db4-ab5a-ee121c3c2381
Duration     30 min and 22 sec.
Started     2014-06-18 21:04:07 UTC
Ended     2014-06-18 21:34:29 UTC
State     Ended

Users     [20]
Hits     None
Agents     5
Duration     1800
Server URL     https://verifier.stage.mozaws.net

Tests over     442362 
Successes     442100
Failures     0
Errors     0
TCP Hits     442362
Opened web sockets     0
Total web sockets     0
Bytes/websockets     0
Requests / second (RPS)     242

addFailure     262

REF: https://loads.services.mozilla.com/run/0721432e-b4b4-4db4-ab5a-ee121c3c2381

James Bonacci [:jbonacci]

Reporter

Comment 5

•

10 years ago

OK, so this matches exactly with the number of 503s on instance ec2-54-81-41-31
The other instance is clean.

:mostlygeek can you take a look at this?

James Bonacci [:jbonacci]

Reporter

Comment 6

•

10 years ago

Forgot to add, in the verifier_err.log file on the same server, there are the following messages:
{"op":"bid.server","name":"bid.server","time":"2014-06-18T21:26:37.964Z","pid":2250,"v":1,"hostname":"ip-10-203-170-59","message":"too busy"}
(262 of them ;-)   )

Ryan Kelly [:rfkelly]

Comment 7

•

10 years ago

> We don't have any quick deployment verification tests

That's what the `make test` target in the loadtest scripts are for :-)

Ryan Kelly [:rfkelly]

Comment 8

•

10 years ago

This is reminiscent of Bug 996763 Comment 44 - 503s with only 40% CPU utilization.  I'm also surprised to see it failing on "toobusy" rather than from the computecluster.  Any chance we've busted some configuration here?  I'll dig in...

Ryan Kelly [:rfkelly]

Comment 9

•

10 years ago

Config seems to be ok.  Checking the log, all of the toobusy errors occurred in the space of <1 second, from 2014-06-18T21:26:37.861Z to 2014-06-18T21:28:38.616Z.  The loadtest then continued successfully for another five minutes, until 2014-06-18T22:43:25.008Z.

The toobusy module polls the event-loop every 500ms, so this is likely a single latency spike triggering toobusy and then recovering quickly on the next poll.  It's quite possible we just got unlucky here, dealt with it gracefully, and recovered quickly.

It looks like we're using the default toobusy maxLag of 70ms, which is quite small.  Given the generally compute-heavy workload here, maybe we should consider bumping it upwards a little to smooth out these spikes.

CPU utilization of 40% is not awesome though.  Is this due to e.g. having more or bigger instances than in the previous test?

James Bonacci [:jbonacci]

Reporter

Comment 10

•

10 years ago

And, why only on one of the two instances?

James Bonacci [:jbonacci]

Reporter

Comment 11

•

10 years ago

Teamwork.
16:18 < rfkelly> jbonacci mostlygeek I think we can ship it
16:19 < mostlygeek> ship.it.
:-)

Status: RESOLVED → VERIFIED

James Bonacci [:jbonacci]

Reporter

Updated

•

10 years ago

Blocks: 1027392

Ryan Kelly [:rfkelly]

Comment 12

•

10 years ago

> And, why only on one of the two instances?

Well, that's part of why I think it was just a random latency spike putting us over the line.  Only once, on only one of the instances, and recovered straight away.  I will file a puppet-config PR for a slightly higher maxLag setting.

Bugzilla

Quick Search

Deploy BrowserID-Verifier 0.2.2 to Stage

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

Tracking

(Not tracked)

People

(Reporter: jbonacci, Assigned: mostlygeek)

References

Details

(Whiteboard: [qa+])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Updated

Updated

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12