All users were logged out of Bugzilla on October 13th, 2018
Metlog out, heka in. :RaFromBRC says this should be a pretty straightforward search-and-replace.
Some thoughts on the approach here, from IRC: <RaFromBRC> probably the least amount of work is to replace metlog-py w/ heka-py, w/ minimal adjustments to get it working as before <rfkelly> *nod* <RaFromBRC> but that was all done a long time ago, heka's a different beast now, and we're not so focused on a custom protocol or a custom client these days <RaFromBRC> it might make sense to write to files and parse, or to use syslog, etc. <rfkelly> ok; more in line with that fxa-auth-server does <rfkelly> and let heka pull it in from whereever <RaFromBRC> right <rfkelly> that sounds simpler longer-term <RaFromBRC> yeah, depends on the priorities <RaFromBRC> rfkelly: deciding what to do would be a matter of looking at what data is being pushed through the metlog client now, plus any other data that we know we'd want to push through heka but aren't yet <RaFromBRC> and looking at it from above to put together a strategy for how to get it all there <rfkelly> pretty sure it's limited to timing of various things, and application logs (e.g. tracebacks, warnings, etc) <RaFromBRC> yeah, the decorators <RaFromBRC> doing the timings <rfkelly> if the heka-py transition would be pretty easy, it may be worth doing that first regardless, then refactor from therre <RaFromBRC> yeah, i think that's probably wise <rfkelly> which would also help us understand/remember exactly how it's used so far <RaFromBRC> bingo <rfkelly> (I'm going to snapshot this into my bugs for reference)
Some feedback from the ops side. - we're seeing high CPU usage under load testing. Haven't profiled the code, but assuming that it is metlog + circus + stdout to file logging that is resulting in very high load - seen something like this before in campaign manager with logging to disk causing high CPU load - would like to see this under heka-py For implementation: - if we go a file stream, do we have to worry about file rotation? - it would be nicer for ops if we streamed this into heka directly via UDP to 127.0.0.1 then they are decoupled for the most part, or another input method that does not require we worry about IO performance/disk space usage/file rotation
Update re: high CPU The cause was that tokenserver was not using HTTPS connection pooling. Creating a new SSL connection to the verifier per request was crushing the box. Though we still want conversion to heka-py. To help us in our debugging.
Related GitHub issues: https://github.com/mozilla-services/puppet-config/issues/81 https://github.com/mozilla-services/puppet-config/issues/206 https://github.com/mozilla-services/puppet-config/issues/287 https://github.com/mozilla-services/puppet-config/pull/318
Status: NEW → ASSIGNED
Priority: P2 → P1
Updated list: Related GitHub issues: https://github.com/mozilla-services/puppet-config/issues/81 https://github.com/mozilla-services/puppet-config/issues/206 https://github.com/mozilla-services/puppet-config/issues/287 https://github.com/mozilla-services/puppet-config/pull/317 https://github.com/mozilla-services/puppet-config/pull/318
Is 998054 a potential dup of this?
Created attachment 8425266 [details] [diff] [review] tokenserver-simplify-metrics.diff This patch updates tokenserver for the new simplified metrics infra proposed in Bug 1012509. It's almost all just replacing metlog with stdlib logging, plus tweaking the details of a few timers. The missing half of this is a deploy config change to make it use the JSON logger in production, and then trying it out in stage to see whether heka can slurp the logs in properly.
Attachment #8425266 - Flags: review?(telliott)
Attachment #8425266 - Flags: review?(telliott) → review+
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
I can verify this when we deploy bug 1014496
Verified heka is in use via the shared heka dashboard...
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.