Closed Bug 1565659 Opened 6 years ago Closed 6 years ago

SSL Ratios dashboard stopped updating April 17th

Categories

(Data Platform and Tools :: General, defect, P1)

defect
Points:
1

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jcj, Assigned: rmiller)

References

Details

Attachments

(1 file)

The dataset for SSL Ratios, added in Bug 1414839, used by Let's Encrypt (https://docs.telemetry.mozilla.org/datasets/other/ssl/reference.html) is populated from data-dumps via this query on Redash:

https://sql.telemetry.mozilla.org/queries/49323/source#table

The data has stalled as of 17 April 2019, and that matches Redash's note that it's been three months since the query last ran.

Do we need a new URL / Redash query?

I hit the execute button just now and it completed fine, in 11 minutes. Maybe we need to turn the scheduled job off and on again.

Daniel, what do you think about moving this into bigquery-etl (and changing the redash query to "select * from ssl_data" to maintain the existing public-facing plumbing)?

Rob, can you take a look at why this query stopped running on its schedule in STMO?

Flags: needinfo?(rmiller)
Flags: needinfo?(dthorn)
Assignee: nobody → rmiller

(In reply to Mark Reid [:mreid] from comment #2)

Daniel, what do you think about moving this into bigquery-etl (and changing the redash query to "select * from ssl_data" to maintain the existing public-facing plumbing)?

sounds perfect

Flags: needinfo?(dthorn)

Okay, we've done some digging on this. The query in question had a 'schedule_failures' value set to 17, which thanks to exponential back-off on the retry schedule would mean that it wouldn't run again for 6 months. Jason Thomas (primary STMO ops person) reset that back to 0 so scheduling should work again.

We suspect that we got into this state thanks to repeated failures related to the (now resolved) https://bugzilla.mozilla.org/show_bug.cgi?id=1547779. Hopefully things should now be in a functional state. :mreid, can you confirm that all is now working as expected?

Flags: needinfo?(rmiller) → needinfo?(mreid)
See Also: → 1567281

It looks like the data is still stale as of July 12th - Re:dash shows "Updated 6 days ago" in the bottom right, which matches when Tim ran it manually in comment 1.

Flags: needinfo?(mreid)

The 'schedule_failures' value was just reset this morning, so we should probably wait 24 hours to see if it runs as expected on the daily schedule.

What time of day would you expect the data to be updated? It's still showing the same update time.

Flags: needinfo?(rmiller)

I ran this query a couple times manually and it is failing against Athena after 10-15 minutes with Internal Error. I think we should try using Presto or switch to BigQuery.

I checked in w :mreid, he's already opened https://bugzilla.mozilla.org/show_bug.cgi?id=1567281 about implementing this in BigQuery.

For now, I've changed the existing query from Athena to Presto. Running it manually was successful, returning in 2 minutes. I'm going to leave it like this over the weekend, we can check on Monday and if it's been updating as expected we'll close this issue as resolved.

Flags: needinfo?(rmiller)
Points: --- → 1
Priority: -- → P1

I grabbed a recently updated version of the file and the normalized_pageloads data went to 0.0 for all data points, old and new:

2019-07-17 00:00:00.000,Windows_NT,SG,0.8,0.0,0.8
2019-07-18 00:00:00.000,Windows_NT,SE,0.8,0.0,0.8
2019-07-18 00:00:00.000,Windows_NT,MY,0.7,0.0,0.4
2019-07-18 00:00:00.000,Linux,AU,0.8,0.0,0.9
......
2016-12-08 00:00:00.000,Windows_NT,BE,0.0,0.0,0.5
2016-11-20 00:00:00.000,Windows_NT,TW,0.0,0.0,0.4
2017-01-12 00:00:00.000,Windows_NT,RO,0.0,0.0,0.5
2016-11-18 00:00:00.000,Windows_NT,BE,0.0,0.0,0.5

Okay, I've switched this back to using the Athena data source while we're digging in to why Presto might be rounding these very small numbers down to zero. I've just run the query by hand, and it completed successfully. It's now scheduled to run daily at a non-peak time, we'll be watching it carefully to make sure it completes. Meanwhile, work on the BigQuery version is still underway.

Thank you! I've confirmed that the data looks correct, is update to date, and renders as expected when graphed. https://letsencrypt.org/stats

The Athena query continues to fail fairly often, now even when run manually.

We're working on a more robust ETL job over in bug 1567281 and investigating why we get zeros from Presto in bug 1568621.

The original query continues to fail pretty often, so we should get this moved over to the new dataset. I wrote a quick query[1] to compare with the last successful run on Athena and it looks like the normalized_pageloads field is ever-so-slightly higher in the new data.

Allen can you take a look at this and see if this query looks equivalent to the original at [2], and if so, why that one column might look slightly different?

Once we resolve that difference, we should be good to update the original to use a simple query against the new dataset.

[1] https://sql.telemetry.mozilla.org/queries/64225/source
[2] https://sql.telemetry.mozilla.org/queries/49323/source

Flags: needinfo?(ashort)

The normalization is over the total number of pageloads in the dataset over all days, so the difference in which days are included seems to account for the difference.

Flags: needinfo?(ashort)

Looks like the moz-fx-data-derived-datasets.telemetry_derived.ssl_ratios_v1 table is missing several days worth of data:
2018-10-31
2018-11-12
2018-11-15
2018-11-23
2018-12-09
2018-12-12

And data is missing after 2019-07-29 - I believe the query still needs to be scheduled in Airflow.

Allen, should we handle that in this bug or should I reopen Bug 1567281?

Flags: needinfo?(ashort)

I'm not familiar with the airflow situation. relud, can you address scheduling issues for this?

Flags: needinfo?(ashort) → needinfo?(dthorn)

airflow PR filed

Flags: needinfo?(dthorn)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED

airflow has been moved to read the bigquery table.

Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: