Closed
Bug 1335219
Opened 8 years ago
Closed 8 years ago
Recent nightly pings all missing 'payload/processes', some missing 'payload/childPayloads'
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: frank, Assigned: mreid)
References
Details
Attachments
(2 files)
It seems that recent data is missing payload information. I'm not sure how widespread this is, but we've found data > 1 week old missing this information.
For all nightly builds matching 201701*, none of them had 'payload/processes', and 773150 were missing childPayloads as well. 2079321 had childPayloads.
For build ids matching 20170120*, 1/3 had neither childPayloads or processes, and all were missing processes.
Reporter | ||
Comment 1•8 years ago
|
||
Reporter | ||
Comment 2•8 years ago
|
||
WHD mentioned that it might be a recent upgrade of python_moztelemetry. I downgraded to 0.5.0 (currently at 0.6.0), and the issue still arose.
Comment 3•8 years ago
|
||
I checked my personal Nightly and according to about:telemetry, it has both a childPayloads section (with simpleMeasurements, as expected) and a processes section (with parent.scalars and content.{stuff}).
Recomponenting to pipeline as the problem may be server-side.
Component: Telemetry → Metrics: Pipeline
Product: Toolkit → Cloud Services
Assignee | ||
Comment 4•8 years ago
|
||
I've verified that the incoming data is correct at the CEP and in the "raw" store on S3 (for both the telemetry-2 and telemetry-3 prefixes).
I also did a quick and dirty local test of the python_moztelemetry heka-parsing code and it appears that is where the data is being dropped.
I'll write up a better test case to confirm.
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → mreid
Points: --- → 1
Priority: -- → P1
Comment 5•8 years ago
|
||
What was the effect of this bug? Did it only affect analysis.tmo, or were dashboards also affected? When did it start happening? Will we be able to see payload/processes in submissions from January once the patch lands?
Flags: needinfo?(mreid)
Comment 7•8 years ago
|
||
FWIW aggregates.tmo appears unaffected, as https://telemetry.mozilla.org dashboards are showing plenty of child-process data from nightly 54.
Assignee | ||
Comment 8•8 years ago
|
||
(In reply to Bill McCloskey (:billm) from comment #6)
> What was the effect of this bug? Did it only affect analysis.tmo, or were
> dashboards also affected? When did it start happening? Will we be able to
> see payload/processes in submissions from January once the patch lands?
Code that used the Dataset API in atmo from (I believe) versions 0.5.2 or 0.6.0 of python_moztelemetry would be affected.
The data itself is not lost. The attached code change to python_moztelemetry should make all of the `payload/processes` sections visible from all historic data as well.
Flags: needinfo?(mreid)
Reporter | ||
Comment 9•8 years ago
|
||
For anyone interested in finding out if their job was affected, the release of version 0.5.2 was 2017/01/25. You can check the logs of spark jobs, at $LOG_ROOT/node/$MASTER_EC2_INSTANCE_ID/bootstrap-actions/1/stdout.gz, and grep for 'python_moztelemetry'.
Comment 10•8 years ago
|
||
As such I'm actually wrong and https://telemetry.mozilla.org is indeed affected, but only data collected from Jan 25 onwards. So older probes may have parent-heavy responses until the db rows are purged and refilled with correct data. :frank's working on this part.
Assignee | ||
Comment 11•8 years ago
|
||
Ok, the immediate problem is fixed with python_moztelemetry 0.6.1 which I've just deployed.
Please file new follow-up bugs for any related cleanup or backfill as needed.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment 12•8 years ago
|
||
The bug, as I understand it, caused access to any payload.* or environment.* objects that were not in the list that we extract specially during ingestion to be inaccessible via python_moztelemetry. Here's the list (aside, we might want to add processes here):
https://github.com/mozilla-services/lua_sandbox_extensions/blob/master/moz_telemetry/io_modules/decoders/moz_telemetry/ping.lua#L156-L185
I haven't checked extensively, but this appears to be essentially limited to "payload/webrtc" and "payload/processes".
I looked through all the airflow and ATMO scheduled jobs and didn't find any python jobs that were accessing anything in payload or environment that wasn't specially extracted except for aggregates (which uses "payload/processes"). The fact that aggregates.tmo sees plenty of child-process data from nightly 54 supports this, as there's only a 2-day margin between when nightly 54 was cut and this bug being introduced.
So hopefully the amount of backfilling necessary is limited to "payload/processes" histograms for aggregates.tmo.
Comment 13•8 years ago
|
||
(In reply to Wesley Dawson [:whd] from comment #12)
> I looked through all the airflow and ATMO scheduled jobs and didn't find any
> python jobs that were accessing anything in payload or environment that
> wasn't specially extracted except for aggregates (which uses
> "payload/processes"). The fact that aggregates.tmo sees plenty of
> child-process data from nightly 54 supports this, as there's only a 2-day
> margin between when nightly 54 was cut and this bug being introduced.
>
> So hopefully the amount of backfilling necessary is limited to
> "payload/processes" histograms for aggregates.tmo.
The Longitudinal job uses "payload/processes" to extract the scalar data, see https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/Longitudinal.scala#L810
Comment 14•8 years ago
|
||
(In reply to Alessio Placitelli [:Dexter] from comment #13)
> The Longitudinal job uses "payload/processes" to extract the scalar data,
> see
> https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/
> com/mozilla/telemetry/views/Longitudinal.scala#L810
The Longitudinal job is written in Scala and is thus unaffected by this bug.
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•