Closed Bug 1151839 Opened 5 years ago Closed 5 years ago

Missing telemetry data in S3 on+after April 3rd


(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)



(Not tracked)



(Reporter: mreid, Assigned: whd)



I noticed that we were missing data with docType=main after April 2nd in the S3 data store. 

I suspect that the cause is the same that made :whd roll back a decoder change on the CEP node. Maybe it didn't get rolled back on the Data Store Loader node at the same time?

18:19 < whd> I went to deploy the cuckoo filter change today, but rolled back after seeing a bunch of decode failures:
18:19 < whd> Apr  2 22:07:49 ip-172-31-14-40 hekad: 2015/04/02 22:07:49 Decoder 'TelemetryKafkaInput6-TelemetryDecoders' error: Subdecoder 'TelemetryDecoder' decode
             error: Failed parsing: missing info object payload: <binary>
18:20 < whd> Which is related to this line:
18:20 < whd> Which was added since the last deploy.
18:23 < whd> I'll mention it in the meeting tomorrow, but it looks old-telemetry related.

I suspect that it's triggered by Line 177 rather than 130, but in any case, we should roll back the DWL node too.
The DWL was actually the only node that was updated and rolled back. The rollback procedure was simply to downgrade and restart heka. The problem exposed is that there's a cron job that runs daily to restart the service if the geoip RPM has been updated, which ran after the rollback. The solution is to downgrade the puppet RPMs as well, or disable the cron job, which I have done.

We need monitoring for this (which could very well be the decoder statistics available on the CEP), which I'll work on setting up.
Blocks: 1142543
Note that we still have the "raw" data, so it's not a case of total data loss. We will need to re-process data during the outage period (April 2 - 7) and update the published data in S3 accordingly.
Assignee: nobody → whd
Priority: -- → P1
This ending up requiring splicing the http edge decoder and telemetry decoders, because the edge decoder contains path parsing logic that happens post-landfill injection (roughly, everything after The results have been uploaded to S3, and the data it displaces has been backed up to EBS until :mreid has verified the newly decoded data looks good. I processed April 2 for comparison with known data and the landfill decode looked good to me.

Code-munging is obviously not an ideal solution to this, but it was an expedient one. :trink and I talked about various ways to deal with this; none were particularly satisfactory. The message in flight through the pipeline and its various transformations along the way can be a bit difficult to track... in this case we need information embedded in the path to route the data to kafka, which is why path parsing must occur on the edge. But since we don't want to rely on our path parsing code being error-free, we inject to the landfill before this happens, thus necessitating a hybrid edge/telemetry decoder for reading landfill data back out.
The record counts for "saved session" pings are correct (they match exactly what was there before), and the missing "main" pings are now present.

The records appear slightly different now, probably due to the fix to Fields[os]. They are generally a few bytes larger than they were before.

Overall, things look good to me. Is there anything specific you'd like me to check?
Nothing specific, I just wanted a second pair of eyes before closing this out.
Closed: 5 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.