[meta] Ingest mach / Firefox build system data

NEW
Unassigned

Status

Data Platform and Tools
Pipeline Ingestion
P3
normal
2 years ago
2 months ago

People

(Reporter: Katie Parlante, Unassigned)

Tracking

(Depends on: 1 bug, Blocks: 1 bug)

Details

(Reporter)

Description

2 years ago
Dan Minor and his team would like to start collecting local developer build data through the pipeline.

Initial meeting notes:
https://public.etherpad-mozilla.org/p/local-developer-build-metrics-telemetry

Some additional info from the meeting:
- Getting a first pass going is a Q1 goal for dminor
- They have their own tools for analysis once the data is stored in s3
- opt-in, likely low volume and slow growth to start
- for first pass, probably ok to make data available to all mozillians
- json-schema probably a good idea, perhaps do validation

Updated

2 years ago
Priority: -- → P3
Summary: Ingest mach / Firefox build system data → [meta] Ingest mach / Firefox build system data
Depends on: 1242017

Comment 1

2 years ago
I've filed Bug 1244160 to track creating a json-schema for this data.

Since this is early on and we're not asking for any processing other than getting the data into s3, my plan is to create a fairly non-restrictive schema. We can lock things down later on once we've collected some data and had an opportunity to iterate on what is being reported.
mreid and I chatted about this on #telemetry just now. He believes that for now, submitting data into the existing telemetry endpoint would be fine, and we can change it to point to a custom endpoint later if we need to. He wanted to touch base with people on the platform team to make sure they're OK with us submitting non-Firefox data there, though.
Flags: needinfo?(mreid)
I thought we were doing this already.

You can post build metrics JSON docs to this endpoint (must conform to the secret format, or be signed with hawk credentials)
http://54.149.253.188:80/build-metrics-dev

It will respond with the URL of the bucket/file that data is saved; albeit the bucket is private, so only those with auth can access it.

The code for the server, and example client, is here
https://github.com/klahnakoski/MoDataSubmission

I also have a block of code (not in use, barely tested) that translates these resource objects to denormalized JSON documents.
https://github.com/klahnakoski/ActiveData-ETL/blob/dev/activedata_etl/imports/resource_usage.py

I believed the next step was to let telemetry deal with the contents of the bucket.
We do have code to submit data on an opt-in basis. I looked at my local machine (where I enabled opt-in a while ago) and it was failing to submit, probably because the server IP we have in the code is no longer correct:
https://dxr.mozilla.org/mozilla-central/rev/1d025ac534a6333a8170a59a95a8a3673d4028ee/build/submit_telemetry_data.py#17

Changing the IP to the one you provided seems to work locally, but our goal is to have this enabled by default so I think we need an endpoint that's more stable.

Comment 5

a year ago
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #2)
> mreid and I chatted about this on #telemetry just now. He believes that for
> now, submitting data into the existing telemetry endpoint would be fine, and
> we can change it to point to a custom endpoint later if we need to. He
> wanted to touch base with people on the platform team to make sure they're
> OK with us submitting non-Firefox data there, though.

I got some feedback from :trink and :whd, and we figure that while sending through the telemetry endpoint would technically work, it makes more sense to treat this as a new endpoint (per bug 1242017).

It's great news that there's already a JSON Schema for these docs at [1], though a few enhancements would be great. Particularly, the "array" type fields should have a data type for their elements (using the "items" field), and I noticed in :ted's example submission there are a few extra fields that are not covered by the current schema. If we can get these tweaks in, we could consider writing this data out directly in parquet form to be queryable via sql.telemetry.m.o
Flags: needinfo?(mreid)
We had a meeting today to hash out the details here:
https://public.etherpad-mozilla.org/p/20170508-mach-telemetry-ingestion

The summary is that the plan is to use the in-development "generic ingestion service", ETA mid-June. There's an API spec available (to be provided by mreid) and we can code to that.

Updated

a year ago
Component: Metrics: Pipeline → Pipeline Ingestion
Product: Cloud Services → Data Platform and Tools
You need to log in before you can comment on or make changes to this bug.