Closed Bug 969184 Opened 10 years ago Closed 10 years ago

Update tokenserver to use heka instead of metlog

Categories

(Cloud Services Graveyard :: Server: Token, defect, P1)

defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Assigned: rfkelly)

References

Details

(Whiteboard: [qa+])

Attachments

(1 file)

Metlog out, heka in.  :RaFromBRC says this should be a pretty straightforward search-and-replace.
Blocks: 956217
Yes please.
Whiteboard: [qa+]
Some thoughts on the approach here, from IRC:

<RaFromBRC> probably the least amount of work is to replace metlog-py w/ heka-py, w/ minimal adjustments to get it working as before
<rfkelly> *nod*
<RaFromBRC> but that was all done a long time ago, heka's a different beast now, and we're not so focused on a custom protocol or a custom client these days
<RaFromBRC> it might make sense to write to files and parse, or to use syslog, etc.
<rfkelly> ok; more in line with that fxa-auth-server does
<rfkelly> and let heka pull it in from whereever
<RaFromBRC> right
<rfkelly> that sounds simpler longer-term
<RaFromBRC> yeah, depends on the priorities
<RaFromBRC> rfkelly: deciding what to do would be a matter of looking at what data is being pushed through the metlog client now, plus any other data that we know we'd want to push through heka but aren't yet
<RaFromBRC> and looking at it from above to put together a strategy for how to get it all there
<rfkelly> pretty sure it's limited to timing of various things, and application logs (e.g. tracebacks, warnings, etc)
<RaFromBRC> yeah, the decorators
<RaFromBRC> doing the timings
<rfkelly> if the heka-py transition would be pretty easy, it may be worth doing that first regardless, then refactor from therre
<RaFromBRC> yeah, i think that's probably wise
<rfkelly> which would also help us understand/remember exactly how it's used so far
<RaFromBRC> bingo
<rfkelly> (I'm going to snapshot this into my bugs for reference)
Blocks: 972065
Some feedback from the ops side. 

- we're seeing high CPU usage under load testing. Haven't profiled the code, but assuming that it is metlog + circus + stdout to file logging that is resulting in very high load 
- seen something like this before in campaign manager with logging to disk causing high CPU load
- would like to see this under heka-py 

For implementation: 

- if we go a file stream, do we have to worry about file rotation? 
- it would be nicer for ops if we streamed this into heka directly via UDP to 127.0.0.1 then they are decoupled for the most part, or another input method that does not require we worry about IO performance/disk space usage/file rotation
Priority: -- → P2
Update re: high CPU

The cause was that tokenserver was not using HTTPS connection pooling. Creating a new SSL connection to the verifier per request was crushing the box.

Though we still want conversion to heka-py. To help us in our debugging.
QA Contact: jbonacci
Assignee: nobody → rfkelly
Is 998054 a potential dup of this?
This patch updates tokenserver for the new simplified metrics infra proposed in Bug 1012509.  It's almost all just replacing metlog with stdlib logging, plus tweaking the details of a few timers.

The missing half of this is a deploy config change to make it use the JSON logger in production, and then trying it out in stage to see whether heka can slurp the logs in properly.
Attachment #8425266 - Flags: review?(telliott)
Attachment #8425266 - Flags: review?(telliott) → review+
Depends on: 1012509
https://github.com/mozilla-services/tokenserver/commit/175f266b8423a9a48a254044dae807c615ede292
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Blocks: 1014496
I can verify this when we deploy bug 1014496
Verified heka is in use via the shared heka dashboard...
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: