Corey Dow-Hygelund [:ccd]

Reporter

Description

•

6 years ago

•

Edited

Please provide access to Corey Dow-Hygelund for direct read-only access to the TreeHerder MySQL instance.

Use-case

Access performance data from Python (or R) for use in modeling performance, specifically regarding regressions and comparison of benchmark data to telemetry from Nightly/Beta/Release branches.

Utilizing a JDBC driver from Databricks for usage within Spark framework.
Programmatic access with Python using sqlalchemy library.

https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#python-example

https://treeherder.readthedocs.io/accessing_data.html#direct-database-access

Corey Dow-Hygelund [:ccd]

Reporter

Updated

•

6 years ago

Blocks: 1541065

"Saptarshi Guha[:joy]"

Comment 1

•

6 years ago

Could i gain access too(same use case)

Corey Dow-Hygelund [:ccd]

Reporter

Updated

•

6 years ago

Blocks: 1542404

Karl Thiessen [:kthiessen, he/him]

Comment 2

•

6 years ago

Hi, folks. I'm Karl, and I manage the Treeherder team; I've just come back from a week of PTO. Let me check in with the team and find out who currently has the access (and the time) to make read-only replicas.

NI'ing myself to remind me to come back to this bug with an update in less than 48 hours.

Assignee: nobody → kthiessen

Flags: needinfo?(kthiessen)

Sarah Clements [:sclements]

Comment 3

•

6 years ago

Hi Corey, can you give more details/context about what you're planning to do with the read-only access? I'm not familiar with data-bricks.

Sarah Clements [:sclements]

Comment 4

•

6 years ago

•

Edited

If you're just wanting access to data, per :jmaher ActiveData should have all the Perfherder data (:ekyle is a good person to ask about this) and redash has a full read only database as well. The TH read-only replica is specifically used for development purposes.

"Saptarshi Guha[:joy]"

Comment 5

•

6 years ago

I think some of the reasons i can outline

a) activedata has a custom query syntax, rather we would like to stick to SQL used from within R/Python
b) we want data going back months (for analysis reasons) and iirc active data failed here (no sure about redash)
c) analysis can often be interactive and redash doesn't easily enabled CLI/Databricks based analysis (you need to write the query in redash, get the URL to the CSV file and then download that)

Sarah Clements [:sclements]

Comment 6

•

6 years ago

Joel, is there anything you can add?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 7

•

6 years ago

activedata should have a long history and I only use it via python via https://github.com/mozilla/active-data-recipes, I believe that should work for you to query it.

the other option of redash is to use a python library to access it:
https://github.com/mozilla/libmozdata/blob/master/libmozdata/redash.py

I have done the redash query -> csv file -> small import shim many times, that is frustrating.

Flags: needinfo?(jmaher)

Corey Dow-Hygelund [:ccd]

Reporter

Comment 8

•

6 years ago

Saptarshi and I have both tried to utilize ActiveData, and, though we can access the data, its retrieval functionality will not work for data science purposes. I give additional reasons below in addition to what Saptarshi gave above. Two such data science tasks that require access to this data are determining a single metric that encapsulates the important information of multiple tp6 runs, and investigating how Raptor/Talos measures relate to what we are observing in telemetry.

An important item to mention is that all data scientists at Mozilla are well-versed in SQL. For this performance data to be effectively utilized for advanced analytics requires it to be in a commonly accessible format, such as SQL, in frameworks that enable advanced analytics such as Jupyter, DataBricks, or RStudio. Currently that isn't possible, but will be enabled if this data is made available as parquet tables in our telemetry pipeline.

ActiveData:

Query times are prohibitive, in addition to data munging times
Syntax is custom and not well-documented.
In addition syntax is prohibitive, such as lack of ability to filter
Custom code must be written that requires significant data munging
Requires building custom cronjobs to build the requisite datasets. This isn't maintainable nor efficient.

libmozdata

Requires implementing the queries on the redash interface to generate a query ID. Interactive analytics is difficult.
Library isn't well documented and requires reading code-base to effectively utilize. This is prohibitive for general DS use.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 9

•

6 years ago

Thanks for the explanation- The only other questions I would have is what data do you specifically need? Do you need raw replicates, or summarizes reported values (i.e. geomean)? Activedata would have the raw data, but treeherder/perfherder only stores the top level metrics.

Possibly based on your use cases the summarized top level metrics is all you need.

Corey Dow-Hygelund [:ccd]

Reporter

Comment 10

•

6 years ago

•

Edited

We need the raw replicate values, especially for the single metric task, as the results of this might replace geomean as the aggregation metric.

Does this mean that the MySQL instance only stores the top level metrics? If this is correct, then we need to utilize the data coming from ActiveData.

"Saptarshi Guha[:joy]"

Comment 11

•

6 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #9)

Thanks for the explanation- The only other questions I would have is what data do you specifically need? Do you need raw replicates, or summarizes reported values (i.e. geomean)? Activedata would have the raw data, but treeherder/perfherder only stores the top level metrics.

Possibly based on your use cases the summarized top level metrics is all you need.

as corey pointed out we need the raw data. Moreover IIRC the redash tables (which are based on the MSQL data source) do contain the raw data? This is definitively true using the REST API which i guess also has its data derived from the MSQL tables.

"Saptarshi Guha[:joy]"

Comment 12

•

6 years ago

Hmm, do you mean the field "replicates" in the actual job data url is not present MySQL? For example consider this query

select performance_datum .*,  performance_signature .*
from performance_datum  join   performance_signature 
on performance_datum.signature_id = performance_signature.id
where  performance_datum.push_id=447731 
and performance_datum.job_id=234973585

output here: https://sql.telemetry.mozilla.org/queries/62346/source

[1] we can get the job log url: https://treeherder.mozilla.org/api/project/mozilla-central/job-log-url/?job_id=234973585
[2] and from here the actual log: https://taskcluster-artifacts.net/H1xhkACaTGqfAQIYIV62eA/0/public/logs/live_backing.log

we can search for the first entry 44.19 (corresponding to id=755748654 in the result of the above query) which is the overall score of raptor-stylebench-firefox which has several subtests one of which is Structural pseudo classes/Removing leaf elements - 4/Sync and that value is 0.2 (see id=755748658 in the result of the above query). If you search for 0.49 in the link above [2], we see Structural pseudo classes/Removing leaf elements - 4/Sync has replicates

{"name": "Structural pseudo classes/Removing leaf elements - 4/Sync", "lowerIsBetter": true, "alertThreshold": 2.0, "replicates": [1.58, 1.52, 1.12, 1.4, 1.08, 1.04, 1.54, 1.08, 1.1, 1.46, 1.3, 1.12, 1.3, 1.4, 1.14, 1.4, 1.24, 1.0, 1.46, 1.16, 1.08, 1.4, 1.2, 1.2, 1.14], "value": 1.2, "unit": "score"}

Q: you mean the Perfheder/Treeherder/Redash data does not contain the replicates data?

cheers
Saptarshi

Flags: needinfo?(joel.maher)

Karl Thiessen [:kthiessen, he/him]

Comment 13

•

6 years ago

I'm going to hand this bug to Joel, as he has more historical context than I do. Please continue to work toward consensus about the best place to get the data you need, and my team will do what we can to make it easy for you.

Assignee: kthiessen → jmaher

Flags: needinfo?(kthiessen)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 14

•

6 years ago

if you look at redash treeherder database:
select * from performance_datum where job_id='240144999';

this is related to this log file:
https://taskcluster-artifacts.net/ZJG3sUI2SCO9V_gkJuYaTQ/0/public/logs/live_backing.log

we only have the summarized values (for each subtest as well, but only the summarized value which typically is the geometric mean of the replicates), not the raw replicates

on the flipside, we have all the raw metrics accessible via activedata:
https://activedata.allizom.org/tools/query.html#query_id=QsO5yJej

it did take me longer to find the data in activedata (15 minutes) than redash (5 minutes). After the activedata upgrade in october the uptime and response time is on par with redash. It is still behind in processing (sometimes a few days, typically <24 hours)

:joy, does this make sense? possibly we can help with some code to bridge the gap on discoverability so this could be less painful

Flags: needinfo?(joel.maher) → needinfo?(sguha)

Kyle Lahnakoski [:ekyle]

Comment 15

•

6 years ago

:joy

Last I heard, there was plan to provide me with some code that will accept JSON documents and write parquet files. The hope was the ActiveData-ETL pipeline would use that code so your toolset can read it. Maybe that is not enough? Since you will still need to scan an S3 bucket to find all the parquet files.

Data

You can scan the S3 bucket for all the files containing JSON perf records:

Streaming

There have been other requests for large amounts of data. Maybe we can work together on something that will deliver the data in a format that is acceptable? The query language would not change, but that is a minor setup cost compared to the ongoing extraction.

request for more data
There are 5.5million raptor records for 2019 200million overall.
ES can now "scroll" through data maybe this is "easy".

SQL/JDBC

It seems you are not just asking for data, but also compute power providing a SQL/JDBC interface. We do not have all three in one spot:

We have the data, but it is S3
JDBC/SQL can be had with a Treeherder replica, but it is too weak for what you plan.
We have computer power in the ES cluster; but given your concerns about the ActiveData query interface, I am sure you will reject the ES Scripting required

Maybe we can get ActiveData to provide a Sqlite database containing some slice of the data? This will provide the data in SQL-queryable form, but I suspect you have a particular flavor of SQL in mind.

"Saptarshi Guha[:joy]"

Comment 16

•

6 years ago

Hi Kyle, the approach of accepting JSON and writing a (specific) JSON-schema compliant output is still on. It's finding resources. Corey will describe more why we want our Performance and Build data ingested via Mozilla's Generic Ingestion tool.

Flags: needinfo?(sguha) → needinfo?(cdowhygelund)

Kyle Lahnakoski [:ekyle]

Comment 17

•

6 years ago

:joy

I do not need an explanation about why you want the data; I am already convinced. Please elaborate on how you want the data, because I may be able to help.

Corey Dow-Hygelund [:ccd]

Reporter

Comment 18

•

6 years ago

Kyle, moving this conversation to the bug initiating this process of adding perf/build data to the Mozilla ingestion pipeline.

Flags: needinfo?(cdowhygelund)

Corey Dow-Hygelund [:ccd]

Reporter

Updated

•

6 years ago

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → WONTFIX

Request read-only TreeHerder MySQL access

Use-case

Data

Streaming

SQL/JDBC