Closed Bug 1541065 Opened 6 years ago Closed 5 years ago

Create a Draft Schema to Provide to Perf team for importing Talos/Perf Data into Parquet

Categories

(Data Science :: General, task)

x86_64
macOS
task
Not set
normal
Points:
3

Tracking

(Not tracked)

RESOLVED INACTIVE

People

(Reporter: joy, Assigned: joy)

References

(Depends on 1 open bug)

Details

Attachments

(1 file)

Brief Description of the request (required):

Currently Talos data is tucked away behind REST API or ActiveData. The former is not good for data analysis workflows (lots of data required) and the latter has a non standard query interface. We want to get the raw data into Parquet

Business purpose for this request (required):

Analyzing the data in its current state is either working with REST apis or ActiveData custom query syntax. All our analysts are trained to use SparkPython or SQL. If all our data gets stored in this format there is little lead time for an analyst to study this data.

Requested timelines for the request or how this fits into roadmaps or critical decisions (required):

Next two weeks

Links to any assets (e.g Start of a PHD, BRD; any document that helps describe the project):

https://metrics.mozilla.com/~sguha/mz/indicesv3/talos/example_talos_measures.html

Name of Data Scientist (If Applicable):

Saptarshi and Corey

Please note if it is found that not enough information has been given this will delay the triage of this request.

A first draft

Based of the following query

https://treeherder.mozilla.org/api/project/mozilla-central/performance/signatures/?interval=3000

fields wanted are

Name: Signature-Info

  • signature
  • channel (here mozilla-central)
  • id
  • machine_platform
  • framework_id (ideally translated to a test string)
  • option_collection_hash (converted using output from call to
    https://treeherder.mozilla.org/api/optioncollectionhash/
  • suite
  • has_subtests : ** should be FALSE or missing ** (we do not want signatures corresponding
    to aggregates, our analyses will aggregate the raw performance data)
  • no need to record parent_signature since this will refer to the aggregated signature
  • extra_options
  • lower_is_better

This gives information regarding the signature which is a combination of
machine platform, options and suite. But doesn't tell us about the actual run
of such a test. Note, we don't want the signature info and the performance data in one table.
The performance queries contains a pushID which corresponds to the set of commits
the performance tests are checked on and the jobID which corresponds to an actual Talos(and build)
run. A given jobid corresponds to one pushid but a pushid can have many jobids (corresponding
to the different tests and platforms)

From this query e.g.

https://treeherder.mozilla.org/api/project/mozilla-central/performance/data/?interval=86400&signatures=8d3706a1ca47dc996fdcdc77e95f8d7b141d207d

we want to include with the signature information

  • job_id
  • push_id
  • push_timestamp
  • value

Finally, we would like to include (or at least join on push_id and channel) a push table (i.e. bugs
related to this push) via

https://treeherder.mozilla.org/api/project/mozilla-central/push/447720/

the result is nicely structured and would fit well into a parquet Struct/Array
schema.

Eric does this sound like a good start?

Flags: needinfo?(esmyth)

Also i think the update frequency ought be 2-3x a day (to match rate of pushes). For example, between "2019-04-07 14:48:20 PDT" and "2019-04-08 12:30:06 PDT" (21 hours), there were on mozilla-central

  • 7 pushes
  • that launched 3540 jobs
  • across 189 test suites

Running this at least 3x a day would likely match the push rate of mozilla-central

Assignee: nobody → sguha
Status: NEW → ASSIGNED
Points: --- → 3
Flags: needinfo?(esmyth)

I am investigating MySQL instance, and ability to connect from Python/R/Databricks to determine:
1.) if migration to parquet is necessary
2.) if so, if the MySQL DB should be utilized for importing

This is awaiting read-only access to MySQL instance

https://sqoop.apache.org/
https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#python-example
https://treeherder.readthedocs.io/accessing_data.html#direct-database-access

Depends on: 1543445

If migration is necessary, appears we can utilize https://sqoop.apache.org/ to convert. However, this still requires specifying conversion, specifically mapping between SQL tables to parquet. A 1-1 mapping is likely not desirable, given the normalized and, consequently, numerous tables of the SQL DB. This could clutter the existing parquet tables.

  • A subset of tables?
  • Condensing the tales into a flatter, non-normalized format?

As a first cut does the table example provided above work? I believe most of that could be got from the perf datum, signatures and push tables.

To parquet is necessary for the following reasons:

  1. MySQL DB is only for development purposes: not available for DS purposes
  2. Potential analytics required by data science can be very computation and/or memory demanding. Parquet format allows distribution of the analytics across standard or custom clusters. Therefore, we will not be overly punishing any DB server (resulting in angry DBAs).

In reference to :ekyle comments:

:joy
I do not need an explanation about why you want the data; I am already convinced. Please elaborate on how you want the data, because I may be able to help."

Kyle,
My understanding about the Mozilla generic ingestor, is that it consumes a JSON at an HTTPS endpoint that conforms to a predefined schema. An example of this process for our Webpagetest instance.

Therefore, we could use your insight/help in how to emit new ActiveData records (or transform those going into ActiveData), into the (yet to be defined) valid JSON format.

Currently, we are implementing the BigQuery/parquet schema for the data residing in ActiveData. Therefore, we are in the data architecture phase. Once complete, the process of implementing the JSON schema to validate against can begin.

Flags: needinfo?(klahnakoski)
Flags: needinfo?(klahnakoski)

ActiveData has a a few tables, each with their own schema, none of them defined by humans.

The original PERFHERDER records, found in the logs, have a human-declared schema: https://github.com/mozilla/treeherder/blob/master/schemas/performance-artifact.json

I attached an example schema out of ES. It is not a formal JSONSchema, but may be good enough for you to auto-translate to one.

The page snapshots, used in each run_test, are updated periodically. However, there is no field currently in ActiveData or elsewhere which stores this information. This information is key in analyzing these datasets, due to the strong dependency of the metrics on the page snapshot.

1.) This information needs to be made available, if not emitted from TaskCluster. Once available, it need to be retrieved.
2.) The schema needs to be updated to support this information.

Depends on: 1558055

Page snapshot update timestamps are now being added to the log:

https://bugzilla.mozilla.org/show_bug.cgi?id=1558764

Work for the DS team is now tracked in Jira. You can search with the Data Science Jira project for the corresponding ticket.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: