Closed Bug 1151839 Opened 9 years ago Closed 9 years ago

Missing telemetry data in S3 on+after April 3rd

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: mreid, Assigned: whd)

References

Details

Mark Reid [:mreid]

Reporter

Description

•

9 years ago

I noticed that we were missing data with docType=main after April 2nd in the S3 data store. 

I suspect that the cause is the same that made :whd roll back a decoder change on the CEP node. Maybe it didn't get rolled back on the Data Store Loader node at the same time?

18:19 < whd> I went to deploy the cuckoo filter change today, but rolled back after seeing a bunch of decode failures:
18:19 < whd> Apr  2 22:07:49 ip-172-31-14-40 hekad: 2015/04/02 22:07:49 Decoder 'TelemetryKafkaInput6-TelemetryDecoders' error: Subdecoder 'TelemetryDecoder' decode
             error: Failed parsing: missing info object payload: <binary>
18:20 < whd> Which is related to this line: https://github.com/mozilla-services/data-pipeline/blob/master/heka/sandbox/decoders/extract_telemetry_dimensions.lua#L130
18:20 < whd> Which was added since the last deploy.
18:23 < whd> I'll mention it in the meeting tomorrow, but it looks old-telemetry related.

I suspect that it's triggered by Line 177 rather than 130, but in any case, we should roll back the DWL node too.

Wesley Dawson [:whd]

Assignee

Comment 1

•

9 years ago

The DWL was actually the only node that was updated and rolled back. The rollback procedure was simply to downgrade and restart heka. The problem exposed is that there's a cron job that runs daily to restart the service if the geoip RPM has been updated, which ran after the rollback. The solution is to downgrade the puppet RPMs as well, or disable the cron job, which I have done.

We need monitoring for this (which could very well be the decoder statistics available on the CEP), which I'll work on setting up.

Mark Reid [:mreid]

Reporter

Updated

•

9 years ago

Blocks: 1142543

Mark Reid [:mreid]

Reporter

Comment 2

•

9 years ago

Note that we still have the "raw" data, so it's not a case of total data loss. We will need to re-process data during the outage period (April 2 - 7) and update the published data in S3 accordingly.

Katie Parlante

Updated

•

9 years ago

Assignee: nobody → whd

Priority: -- → P1

Wesley Dawson [:whd]

Assignee

Comment 3

•

9 years ago

This ending up requiring splicing the http edge decoder and telemetry decoders, because the edge decoder contains path parsing logic that happens post-landfill injection (roughly, everything after https://github.com/mozilla-services/data-pipeline/blob/master/heka/sandbox/decoders/http_edge_decoder.lua#L135). The results have been uploaded to S3, and the data it displaces has been backed up to EBS until :mreid has verified the newly decoded data looks good. I processed April 2 for comparison with known data and the landfill decode looked good to me.

Code-munging is obviously not an ideal solution to this, but it was an expedient one. :trink and I talked about various ways to deal with this; none were particularly satisfactory. The message in flight through the pipeline and its various transformations along the way can be a bit difficult to track... in this case we need information embedded in the path to route the data to kafka, which is why path parsing must occur on the edge. But since we don't want to rely on our path parsing code being error-free, we inject to the landfill before this happens, thus necessitating a hybrid edge/telemetry decoder for reading landfill data back out.

Mark Reid [:mreid]

Reporter

Comment 4

•

9 years ago

The record counts for "saved session" pings are correct (they match exactly what was there before), and the missing "main" pings are now present.

The records appear slightly different now, probably due to the fix to Fields[os]. They are generally a few bytes larger than they were before.

Overall, things look good to me. Is there anything specific you'd like me to check?

Wesley Dawson [:whd]

Assignee

Comment 5

•

9 years ago

Nothing specific, I just wanted a second pair of eyes before closing this out.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Missing telemetry data in S3 on+after April 3rd

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: mreid, Assigned: whd)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Updated