SSL Ratios dashboard stopped updating April 17th
Categories
(Data Platform and Tools :: General, defect, P1)
Tracking
(Not tracked)
People
(Reporter: jcj, Assigned: rmiller)
References
Details
Attachments
(1 file)
The dataset for SSL Ratios, added in Bug 1414839, used by Let's Encrypt (https://docs.telemetry.mozilla.org/datasets/other/ssl/reference.html) is populated from data-dumps via this query on Redash:
https://sql.telemetry.mozilla.org/queries/49323/source#table
The data has stalled as of 17 April 2019, and that matches Redash's note that it's been three months since the query last ran.
Do we need a new URL / Redash query?
Comment 1•6 years ago
|
||
I hit the execute button just now and it completed fine, in 11 minutes. Maybe we need to turn the scheduled job off and on again.
Comment 2•6 years ago
|
||
Daniel, what do you think about moving this into bigquery-etl (and changing the redash query to "select * from ssl_data" to maintain the existing public-facing plumbing)?
Rob, can you take a look at why this query stopped running on its schedule in STMO?
Updated•6 years ago
|
Comment 3•6 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #2)
Daniel, what do you think about moving this into bigquery-etl (and changing the redash query to "select * from ssl_data" to maintain the existing public-facing plumbing)?
sounds perfect
Assignee | ||
Comment 4•6 years ago
|
||
Okay, we've done some digging on this. The query in question had a 'schedule_failures' value set to 17, which thanks to exponential back-off on the retry schedule would mean that it wouldn't run again for 6 months. Jason Thomas (primary STMO ops person) reset that back to 0 so scheduling should work again.
We suspect that we got into this state thanks to repeated failures related to the (now resolved) https://bugzilla.mozilla.org/show_bug.cgi?id=1547779. Hopefully things should now be in a functional state. :mreid, can you confirm that all is now working as expected?
Comment 5•6 years ago
|
||
It looks like the data is still stale as of July 12th - Re:dash shows "Updated 6 days ago" in the bottom right, which matches when Tim ran it manually in comment 1.
Assignee | ||
Comment 6•6 years ago
|
||
The 'schedule_failures' value was just reset this morning, so we should probably wait 24 hours to see if it runs as expected on the daily schedule.
Comment 7•6 years ago
|
||
What time of day would you expect the data to be updated? It's still showing the same update time.
Comment 8•6 years ago
|
||
I ran this query a couple times manually and it is failing against Athena after 10-15 minutes with Internal Error. I think we should try using Presto or switch to BigQuery.
Assignee | ||
Comment 9•6 years ago
|
||
I checked in w :mreid, he's already opened https://bugzilla.mozilla.org/show_bug.cgi?id=1567281 about implementing this in BigQuery.
For now, I've changed the existing query from Athena to Presto. Running it manually was successful, returning in 2 minutes. I'm going to leave it like this over the weekend, we can check on Monday and if it's been updating as expected we'll close this issue as resolved.
Updated•6 years ago
|
Comment 10•6 years ago
|
||
I grabbed a recently updated version of the file and the normalized_pageloads
data went to 0.0 for all data points, old and new:
2019-07-17 00:00:00.000,Windows_NT,SG,0.8,0.0,0.8
2019-07-18 00:00:00.000,Windows_NT,SE,0.8,0.0,0.8
2019-07-18 00:00:00.000,Windows_NT,MY,0.7,0.0,0.4
2019-07-18 00:00:00.000,Linux,AU,0.8,0.0,0.9
......
2016-12-08 00:00:00.000,Windows_NT,BE,0.0,0.0,0.5
2016-11-20 00:00:00.000,Windows_NT,TW,0.0,0.0,0.4
2017-01-12 00:00:00.000,Windows_NT,RO,0.0,0.0,0.5
2016-11-18 00:00:00.000,Windows_NT,BE,0.0,0.0,0.5
Assignee | ||
Comment 11•6 years ago
|
||
Okay, I've switched this back to using the Athena data source while we're digging in to why Presto might be rounding these very small numbers down to zero. I've just run the query by hand, and it completed successfully. It's now scheduled to run daily at a non-peak time, we'll be watching it carefully to make sure it completes. Meanwhile, work on the BigQuery version is still underway.
Comment 12•6 years ago
|
||
Thank you! I've confirmed that the data looks correct, is update to date, and renders as expected when graphed. https://letsencrypt.org/stats
Comment 13•6 years ago
|
||
The Athena query continues to fail fairly often, now even when run manually.
We're working on a more robust ETL job over in bug 1567281 and investigating why we get zeros from Presto in bug 1568621.
Comment 14•6 years ago
|
||
The original query continues to fail pretty often, so we should get this moved over to the new dataset. I wrote a quick query[1] to compare with the last successful run on Athena and it looks like the normalized_pageloads
field is ever-so-slightly higher in the new data.
Allen can you take a look at this and see if this query looks equivalent to the original at [2], and if so, why that one column might look slightly different?
Once we resolve that difference, we should be good to update the original to use a simple query against the new dataset.
[1] https://sql.telemetry.mozilla.org/queries/64225/source
[2] https://sql.telemetry.mozilla.org/queries/49323/source
Comment 15•6 years ago
|
||
The normalization is over the total number of pageloads in the dataset over all days, so the difference in which days are included seems to account for the difference.
Comment 16•6 years ago
|
||
Looks like the moz-fx-data-derived-datasets.telemetry_derived.ssl_ratios_v1
table is missing several days worth of data:
2018-10-31
2018-11-12
2018-11-15
2018-11-23
2018-12-09
2018-12-12
And data is missing after 2019-07-29 - I believe the query still needs to be scheduled in Airflow.
Allen, should we handle that in this bug or should I reopen Bug 1567281?
Comment 17•6 years ago
|
||
I'm not familiar with the airflow situation. relud, can you address scheduling issues for this?
Comment 18•6 years ago
|
||
Updated•6 years ago
|
Comment 20•6 years ago
|
||
airflow has been moved to read the bigquery table.
Updated•3 years ago
|
Description
•