1541065 - Create a Draft Schema to Provide to Perf team for importing Talos/Perf Data into Parquet

Assignee

Description

•

6 years ago

Brief Description of the request (required):

Currently Talos data is tucked away behind REST API or ActiveData. The former is not good for data analysis workflows (lots of data required) and the latter has a non standard query interface. We want to get the raw data into Parquet

Business purpose for this request (required):

Analyzing the data in its current state is either working with REST apis or ActiveData custom query syntax. All our analysts are trained to use SparkPython or SQL. If all our data gets stored in this format there is little lead time for an analyst to study this data.

Requested timelines for the request or how this fits into roadmaps or critical decisions (required):

Next two weeks

Links to any assets (e.g Start of a PHD, BRD; any document that helps describe the project):

https://metrics.mozilla.com/~sguha/mz/indicesv3/talos/example_talos_measures.html

Name of Data Scientist (If Applicable):

Saptarshi and Corey

Please note if it is found that not enough information has been given this will delay the triage of this request.

"Saptarshi Guha[:joy]"

Assignee

Comment 1

•

6 years ago

A first draft

Based of the following query

https://treeherder.mozilla.org/api/project/mozilla-central/performance/signatures/?interval=3000

fields wanted are

Name: Signature-Info

signature
channel (here mozilla-central)
id
machine_platform
framework_id (ideally translated to a test string)
option_collection_hash (converted using output from call to
https://treeherder.mozilla.org/api/optioncollectionhash/
suite
has_subtests : ** should be FALSE or missing ** (we do not want signatures corresponding
to aggregates, our analyses will aggregate the raw performance data)
no need to record parent_signature since this will refer to the aggregated signature
extra_options
lower_is_better

This gives information regarding the signature which is a combination of
machine platform, options and suite. But doesn't tell us about the actual run
of such a test. Note, we don't want the signature info and the performance data in one table.
The performance queries contains a pushID which corresponds to the set of commits
the performance tests are checked on and the jobID which corresponds to an actual Talos(and build)
run. A given jobid corresponds to one pushid but a pushid can have many jobids (corresponding
to the different tests and platforms)

From this query e.g.

https://treeherder.mozilla.org/api/project/mozilla-central/performance/data/?interval=86400&signatures=8d3706a1ca47dc996fdcdc77e95f8d7b141d207d

we want to include with the signature information

job_id
push_id
push_timestamp
value

Finally, we would like to include (or at least join on push_id and channel) a push table (i.e. bugs
related to this push) via

https://treeherder.mozilla.org/api/project/mozilla-central/push/447720/

the result is nicely structured and would fit well into a parquet Struct/Array
schema.

Eric does this sound like a good start?

Flags: needinfo?(esmyth)

"Saptarshi Guha[:joy]"

Assignee

Comment 2

•

6 years ago

Also i think the update frequency ought be 2-3x a day (to match rate of pushes). For example, between "2019-04-07 14:48:20 PDT" and "2019-04-08 12:30:06 PDT" (21 hours), there were on mozilla-central

7 pushes
that launched 3540 jobs
across 189 test suites

Running this at least 3x a day would likely match the push rate of mozilla-central

Mala

Updated

•

6 years ago

Assignee: nobody → sguha

Status: NEW → ASSIGNED

Points: --- → 3

esmyth

Updated

•

6 years ago

Flags: needinfo?(esmyth)

Corey Dow-Hygelund [:ccd]

Comment 3

•

6 years ago

I am investigating MySQL instance, and ability to connect from Python/R/Databricks to determine:
1.) if migration to parquet is necessary
2.) if so, if the MySQL DB should be utilized for importing

This is awaiting read-only access to MySQL instance

https://sqoop.apache.org/
https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#python-example
https://treeherder.readthedocs.io/accessing_data.html#direct-database-access

Corey Dow-Hygelund [:ccd]

Updated

•

6 years ago

Depends on: 1543445

Corey Dow-Hygelund [:ccd]

Comment 4

•

6 years ago

If migration is necessary, appears we can utilize https://sqoop.apache.org/ to convert. However, this still requires specifying conversion, specifically mapping between SQL tables to parquet. A 1-1 mapping is likely not desirable, given the normalized and, consequently, numerous tables of the SQL DB. This could clutter the existing parquet tables.

A subset of tables?
Condensing the tales into a flatter, non-normalized format?

"Saptarshi Guha[:joy]"

Assignee

Comment 5

•

6 years ago

As a first cut does the table example provided above work? I believe most of that could be got from the perf datum, signatures and push tables.

Corey Dow-Hygelund [:ccd]

Comment 6

•

6 years ago

To parquet is necessary for the following reasons:

MySQL DB is only for development purposes: not available for DS purposes
Potential analytics required by data science can be very computation and/or memory demanding. Parquet format allows distribution of the analytics across standard or custom clusters. Therefore, we will not be overly punishing any DB server (resulting in angry DBAs).

Corey Dow-Hygelund [:ccd]

Comment 7

•

6 years ago

In reference to :ekyle comments:

:joy
I do not need an explanation about why you want the data; I am already convinced. Please elaborate on how you want the data, because I may be able to help."

Kyle,
My understanding about the Mozilla generic ingestor, is that it consumes a JSON at an HTTPS endpoint that conforms to a predefined schema. An example of this process for our Webpagetest instance.

Therefore, we could use your insight/help in how to emit new ActiveData records (or transform those going into ActiveData), into the (yet to be defined) valid JSON format.

Currently, we are implementing the BigQuery/parquet schema for the data residing in ActiveData. Therefore, we are in the data architecture phase. Once complete, the process of implementing the JSON schema to validate against can begin.

Flags: needinfo?(klahnakoski)

Kyle Lahnakoski [:ekyle]

Comment 8

•

6 years ago

Attached file Example perf schema out of Elasticsearch — Details

Flags: needinfo?(klahnakoski)

Kyle Lahnakoski [:ekyle]

Comment 9

•

6 years ago

ActiveData has a a few tables, each with their own schema, none of them defined by humans.

The original PERFHERDER records, found in the logs, have a human-declared schema: https://github.com/mozilla/treeherder/blob/master/schemas/performance-artifact.json

I attached an example schema out of ES. It is not a formal JSONSchema, but may be good enough for you to auto-translate to one.

Corey Dow-Hygelund [:ccd]

Comment 10

•

5 years ago

The page snapshots, used in each run_test, are updated periodically. However, there is no field currently in ActiveData or elsewhere which stores this information. This information is key in analyzing these datasets, due to the strong dependency of the metrics on the page snapshot.

1.) This information needs to be made available, if not emitted from TaskCluster. Once available, it need to be retrieved.
2.) The schema needs to be updated to support this information.

Corey Dow-Hygelund [:ccd]

Updated

•

5 years ago

Depends on: 1558055

Corey Dow-Hygelund [:ccd]

Comment 11

•

5 years ago

Page snapshot update timestamps are now being added to the log:

https://bugzilla.mozilla.org/show_bug.cgi?id=1558764

Jess Mokrzecki [:jmok]

Comment 12

•

5 years ago

Work for the DS team is now tracked in Jira. You can search with the Data Science Jira project for the corresponding ticket.

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Resolution: --- → INACTIVE