Closed Bug 1335219 Opened 7 years ago Closed 7 years ago

Recent nightly pings all missing 'payload/processes', some missing 'payload/childPayloads'

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: frank, Assigned: mreid)

References

Details

Attachments

(2 files)

It seems that recent data is missing payload information. I'm not sure how widespread this is, but we've found data > 1 week old missing this information. 

For all nightly builds matching 201701*, none of them had 'payload/processes', and 773150 were missing childPayloads as well. 2079321 had childPayloads.

For build ids matching 20170120*, 1/3 had neither childPayloads or processes, and all were missing processes.
WHD mentioned that it might be a recent upgrade of python_moztelemetry. I downgraded to 0.5.0 (currently at 0.6.0), and the issue still arose.
I checked my personal Nightly and according to about:telemetry, it has both a childPayloads section (with simpleMeasurements, as expected) and a processes section (with parent.scalars and content.{stuff}).

Recomponenting to pipeline as the problem may be server-side.
Component: Telemetry → Metrics: Pipeline
Product: Toolkit → Cloud Services
I've verified that the incoming data is correct at the CEP and in the "raw" store on S3 (for both the telemetry-2 and telemetry-3 prefixes).

I also did a quick and dirty local test of the python_moztelemetry heka-parsing code and it appears that is where the data is being dropped.

I'll write up a better test case to confirm.
Assignee: nobody → mreid
Points: --- → 1
Priority: -- → P1
What was the effect of this bug? Did it only affect analysis.tmo, or were dashboards also affected? When did it start happening? Will we be able to see payload/processes in submissions from January once the patch lands?
Flags: needinfo?(mreid)
FWIW aggregates.tmo appears unaffected, as https://telemetry.mozilla.org dashboards are showing plenty of child-process data from nightly 54.
(In reply to Bill McCloskey (:billm) from comment #6)
> What was the effect of this bug? Did it only affect analysis.tmo, or were
> dashboards also affected? When did it start happening? Will we be able to
> see payload/processes in submissions from January once the patch lands?

Code that used the Dataset API in atmo from (I believe) versions 0.5.2 or 0.6.0 of python_moztelemetry would be affected.

The data itself is not lost. The attached code change to python_moztelemetry should make all of the `payload/processes` sections visible from all historic data as well.
Flags: needinfo?(mreid)
For anyone interested in finding out if their job was affected, the release of version 0.5.2 was 2017/01/25. You can check the logs of spark jobs, at $LOG_ROOT/node/$MASTER_EC2_INSTANCE_ID/bootstrap-actions/1/stdout.gz, and grep for 'python_moztelemetry'.
As such I'm actually wrong and https://telemetry.mozilla.org is indeed affected, but only data collected from Jan 25 onwards. So older probes may have parent-heavy responses until the db rows are purged and refilled with correct data. :frank's working on this part.
Blocks: 1335556
Ok, the immediate problem is fixed with python_moztelemetry 0.6.1 which I've just deployed.

Please file new follow-up bugs for any related cleanup or backfill as needed.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
The bug, as I understand it, caused access to any payload.* or environment.* objects that were not in the list that we extract specially during ingestion to be inaccessible via python_moztelemetry. Here's the list (aside, we might want to add processes here):

https://github.com/mozilla-services/lua_sandbox_extensions/blob/master/moz_telemetry/io_modules/decoders/moz_telemetry/ping.lua#L156-L185

I haven't checked extensively, but this appears to be essentially limited to "payload/webrtc" and "payload/processes".

I looked through all the airflow and ATMO scheduled jobs and didn't find any python jobs that were accessing anything in payload or environment that wasn't specially extracted except for aggregates (which uses "payload/processes"). The fact that aggregates.tmo sees plenty of child-process data from nightly 54 supports this, as there's only a 2-day margin between when nightly 54 was cut and this bug being introduced.

So hopefully the amount of backfilling necessary is limited to "payload/processes" histograms for aggregates.tmo.
(In reply to Wesley Dawson [:whd] from comment #12)
> I looked through all the airflow and ATMO scheduled jobs and didn't find any
> python jobs that were accessing anything in payload or environment that
> wasn't specially extracted except for aggregates (which uses
> "payload/processes"). The fact that aggregates.tmo sees plenty of
> child-process data from nightly 54 supports this, as there's only a 2-day
> margin between when nightly 54 was cut and this bug being introduced.
> 
> So hopefully the amount of backfilling necessary is limited to
> "payload/processes" histograms for aggregates.tmo.

The Longitudinal job uses "payload/processes" to extract the scalar data, see https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/Longitudinal.scala#L810
(In reply to Alessio Placitelli [:Dexter] from comment #13)
> The Longitudinal job uses "payload/processes" to extract the scalar data,
> see
> https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/
> com/mozilla/telemetry/views/Longitudinal.scala#L810

The Longitudinal job is written in Scala and is thus unaffected by this bug.
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: