Closed Bug 1335219 Opened 8 years ago Closed 8 years ago

Recent nightly pings all missing 'payload/processes', some missing 'payload/childPayloads'

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: frank, Assigned: mreid)

References

Details

Attachments

(2 files)

Notebook showing the issue 8 years ago Frank Bertsch [:frank] 65 bytes, text/plain		Details
[python_moztelemetry] mreid-moz:bug1335219 > mozilla:master 8 years ago GitHub Autolander Bot 55 bytes, text/x-github-pull-request		Details \| Review

Frank Bertsch [:frank]

Reporter

Description

•

8 years ago

It seems that recent data is missing payload information. I'm not sure how widespread this is, but we've found data > 1 week old missing this information. For all nightly builds matching 201701*, none of them had 'payload/processes', and 773150 were missing childPayloads as well. 2079321 had childPayloads. For build ids matching 20170120*, 1/3 had neither childPayloads or processes, and all were missing processes.

Frank Bertsch [:frank]

Reporter

Comment 1

•

8 years ago

Attached file Notebook showing the issue — Details

Frank Bertsch [:frank]

Reporter

Comment 2

•

8 years ago

WHD mentioned that it might be a recent upgrade of python_moztelemetry. I downgraded to 0.5.0 (currently at 0.6.0), and the issue still arose.

Chris H-C :chutten

Comment 3

•

8 years ago

I checked my personal Nightly and according to about:telemetry, it has both a childPayloads section (with simpleMeasurements, as expected) and a processes section (with parent.scalars and content.{stuff}). Recomponenting to pipeline as the problem may be server-side.

Component: Telemetry → Metrics: Pipeline

Product: Toolkit → Cloud Services

Mark Reid [:mreid]

Assignee

Comment 4

•

8 years ago

I've verified that the incoming data is correct at the CEP and in the "raw" store on S3 (for both the telemetry-2 and telemetry-3 prefixes). I also did a quick and dirty local test of the python_moztelemetry heka-parsing code and it appears that is where the data is being dropped. I'll write up a better test case to confirm.

Mark Reid [:mreid]

Assignee

Updated

•

8 years ago

Assignee: nobody → mreid

Points: --- → 1

Priority: -- → P1

GitHub Autolander Bot

Comment 5

•

8 years ago

Attached file [python_moztelemetry] mreid-moz:bug1335219 > mozilla:master — Details

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 6

•

8 years ago

What was the effect of this bug? Did it only affect analysis.tmo, or were dashboards also affected? When did it start happening? Will we be able to see payload/processes in submissions from January once the patch lands?

Flags: needinfo?(mreid)

Chris H-C :chutten

Comment 7

•

8 years ago

FWIW aggregates.tmo appears unaffected, as https://telemetry.mozilla.org dashboards are showing plenty of child-process data from nightly 54.

Mark Reid [:mreid]

Assignee

Comment 8

•

8 years ago

(In reply to Bill McCloskey (:billm) from comment #6) > What was the effect of this bug? Did it only affect analysis.tmo, or were > dashboards also affected? When did it start happening? Will we be able to > see payload/processes in submissions from January once the patch lands? Code that used the Dataset API in atmo from (I believe) versions 0.5.2 or 0.6.0 of python_moztelemetry would be affected. The data itself is not lost. The attached code change to python_moztelemetry should make all of the `payload/processes` sections visible from all historic data as well.

Flags: needinfo?(mreid)

Frank Bertsch [:frank]

Reporter

Comment 9

•

8 years ago

For anyone interested in finding out if their job was affected, the release of version 0.5.2 was 2017/01/25. You can check the logs of spark jobs, at $LOG_ROOT/node/$MASTER_EC2_INSTANCE_ID/bootstrap-actions/1/stdout.gz, and grep for 'python_moztelemetry'.

Chris H-C :chutten

Comment 10

•

8 years ago

As such I'm actually wrong and https://telemetry.mozilla.org is indeed affected, but only data collected from Jan 25 onwards. So older probes may have parent-heavy responses until the db rows are purged and refilled with correct data. :frank's working on this part.

Frank Bertsch [:frank]

Reporter

Updated

•

8 years ago

Blocks: 1335556

Mark Reid [:mreid]

Assignee

Comment 11

•

8 years ago

Ok, the immediate problem is fixed with python_moztelemetry 0.6.1 which I've just deployed. Please file new follow-up bugs for any related cleanup or backfill as needed.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Wesley Dawson [:whd]

Comment 12

•

8 years ago

The bug, as I understand it, caused access to any payload.* or environment.* objects that were not in the list that we extract specially during ingestion to be inaccessible via python_moztelemetry. Here's the list (aside, we might want to add processes here): https://github.com/mozilla-services/lua_sandbox_extensions/blob/master/moz_telemetry/io_modules/decoders/moz_telemetry/ping.lua#L156-L185 I haven't checked extensively, but this appears to be essentially limited to "payload/webrtc" and "payload/processes". I looked through all the airflow and ATMO scheduled jobs and didn't find any python jobs that were accessing anything in payload or environment that wasn't specially extracted except for aggregates (which uses "payload/processes"). The fact that aggregates.tmo sees plenty of child-process data from nightly 54 supports this, as there's only a 2-day margin between when nightly 54 was cut and this bug being introduced. So hopefully the amount of backfilling necessary is limited to "payload/processes" histograms for aggregates.tmo.

Alessio Placitelli [:Dexter]

Comment 13

•

8 years ago

(In reply to Wesley Dawson [:whd] from comment #12) > I looked through all the airflow and ATMO scheduled jobs and didn't find any > python jobs that were accessing anything in payload or environment that > wasn't specially extracted except for aggregates (which uses > "payload/processes"). The fact that aggregates.tmo sees plenty of > child-process data from nightly 54 supports this, as there's only a 2-day > margin between when nightly 54 was cut and this bug being introduced. > > So hopefully the amount of backfilling necessary is limited to > "payload/processes" histograms for aggregates.tmo. The Longitudinal job uses "payload/processes" to extract the scalar data, see https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/Longitudinal.scala#L810

Wesley Dawson [:whd]

Comment 14

•

8 years ago

(In reply to Alessio Placitelli [:Dexter] from comment #13) > The Longitudinal job uses "payload/processes" to extract the scalar data, > see > https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/ > com/mozilla/telemetry/views/Longitudinal.scala#L810 The Longitudinal job is written in Scala and is thus unaffected by this bug.

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Recent nightly pings all missing 'payload/processes', some missing 'payload/childPayloads'

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: frank, Assigned: mreid)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Comment 12

Comment 13

Comment 14

Updated

Attachment

General

Description

File Name

Content Type