Dev, QA, OPs needs for monitoring the correct data, including requirements for metrics.
This covers most non-metrics-specific needs: data generation and collection, logs, monitoring (general and for load), etc... Associated GitHub links: https://github.com/mozilla/fxa-auth-server/issues/292 https://github.com/mozilla/fxa-auth-server/issues/372 https://github.com/mozilla/fxa-auth-server/pull/376 https://github.com/mozilla/fxa-auth-server/issues/351 https://github.com/mozilla/fxa-auth-server/issues/349 https://github.com/mozilla/fxa-auth-server/issues/312 https://github.com/mozilla/fxa-auth-server/issues/17 https://github.com/mozilla/fxa-auth-server/pull/28 https://github.com/mozilla/fxa-auth-server/pull/159 https://github.com/mozilla/fxa-auth-server/issues/222 https://github.com/mozilla/fxa-auth-server/issues/205 https://github.com/mozilla/fxa-auth-server/issues/349 https://github.com/mozilla/fxa-auth-server/issues/30 Note: not including relevant links from fxa-content-server or fxa-scrypt-helper at this time... Notes from the 12/5 meeting with Dev and OPs: We will have a heka-ES-kibana set up for Stage and Production all through aggregated logger. OPsView - what persona uses for availability monitoring CloudWatch - perf monitoring (More options available if we continue to use Heka in AWS)
Also, the api.md file has some good details about errors and error handling: https://github.com/mozilla/fxa-auth-server/blob/master/docs/api.md
Also, useful is the following GitHub repo that gives some current methods for deploying Dev and Load test environments, use of Heka for data gathering, use of a log aggregator, etc. https://github.com/mozilla/fxa-deployment Here is a working Heka Dashboard from a Dev site - so you can see the types of data tracked: http://ec2-50-112-66-71.us-west-2.compute.amazonaws.com:4352/ And, I am attached 3 screen captures of a working Kibana dashboard for the load test stack. The first two just show the general layout of the dashboard and the type of data graphed or tabled. The third image shows a detail of one of the GETs.
Working OPs-style monitoring for Persona is documented here: https://github.com/mozilla/identity-ops/wiki/Access%20Guide#monitoring We will be starting with the same basic model for FxA, I believe...
One thing missing from our current kibana setup is good latency monitoring, e.g. graphs of mean and peak request latency, db query running time, etc.
:rfkelly - as POC, can this be set up quickly or at all for the load test stack? Otherwise, we can certainly see what Persona has in terms of latency monitoring and go from there.
Nice reference from the GeoLocation project for hekad --> carbon/graphite: https://github.com/mozilla/ichnaea/pull/59/files
Another good location for OPs-specific issues and resolution: https://github.com/mozilla-services/puppet-config
bumping priority here.
Severity: normal → blocker
Priority: -- → P2
This is a work in progress. Most of your monitoring is happening via StackDriver. Logging is set up to run through to the dashboard. Work is still being done on that and on adding more metrics, more items to track and log. See the puppet-config GitHub repo...
I think this is done. I opened it, I am closing it. We have Heka dashboards, Kibana dashboards, and Stackdriver now for Stage and Prod.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.