Closed
Bug 1471207
Opened 7 years ago
Closed 7 years ago
Submission date aggregates missing from past two All Hands
Categories
(Data Platform and Tools :: General, enhancement, P1)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: chutten, Assigned: klukas)
References
()
Details
https://mzl.la/2KsKBYx
We have submission date aggregates up to Wednesday, December 13, 2017. We have submission date aggregates from Monday, December 18, 2017 up to Monday, June 11, 2018. We have submission date aggregates from Monday, June 18, 2018 until the present day (ish).
But not in between.
During (parts of) the past two All Hands we have no submission date aggregates, which is causing more than a few alerts and some confusion on the list: https://groups.google.com/forum/#!topic/mozilla.dev.telemetry-alerts/G4lT5E9HnS8
Is this an expected result of a backfill? (if so, why do we have submission date aggregates from 0604 up to 0611?) Or is this as weird to you as it is to me?
Assignee | ||
Comment 1•7 years ago
|
||
Based on https://bugzilla.mozilla.org/show_bug.cgi?id=1467860 it looks like :frank did some manual backfilling outside of Airflow up to about 20180610, so that likely explains the observed start of the outage is 6/11 rather than 6/4.
Airflow was redeployed on 6/20 and we made some logic changes that week to allow Airflow to write to the aggregates db, and then Airflow was able to start working down a backlog of jobs, but I don't have detailed notes of what date that backlog started from.
Looking at DAG runs in the Airflow UI [0], though, shows missing executions for 6/12 through 6/17, plus 6/19.
I will investigate if I can generate new executions for those missing dates.
[0] https://workflow.telemetry.mozilla.org/admin/dagrun/?search=telemetry_aggregates
Assignee: nobody → jklukas
Assignee | ||
Comment 2•7 years ago
|
||
We have jobs scheduled in Airflow for the missing dates (see [1]). They should run sequentially and take ~4 hours apiece, so we should be back in business tomorrow. If all goes well with that, I'll also schedule backfill jobs for the December dates.
[1] https://workflow.telemetry.mozilla.org/admin/airflow/tree?dag_id=telemetry_aggregates
Assignee | ||
Comment 3•7 years ago
|
||
I indeed see missing executions in [0] for 12/15 through 12/17, so we should hopefully be able to fill those in via Airflow as well.
Reporter | ||
Updated•7 years ago
|
Priority: -- → P1
Comment 4•7 years ago
|
||
Another hypothesis: People don't allow Nightly to restart while they're running around at All Hands meetings?
Assignee | ||
Comment 5•7 years ago
|
||
6/12, 6/13, 6/14, and 6/19 are now processed. Had to unblock a task this morning to get the others to keep running. Also kicking off December jobs now.
Reporter | ||
Comment 6•7 years ago
|
||
I can see those now in the builddate aggregates (https://mzl.la/2K9SHpk) but the submission_date aggregates are still missing for those days (https://mzl.la/2KsKBYx). Why is that?
Assignee | ||
Comment 7•7 years ago
|
||
The builddate aggregates appear to have entries for all dates, even the ones that still haven't finished processing (6/15, 6/16, 6/17), so it's likely those have always been fine? Or were those definitely broken too?
I will look more closely at logs for 6/12 and friends to see if they truly reported success.
Reporter | ||
Comment 8•7 years ago
|
||
The builddate aggregates always had data, but the volume was suspiciously low for the affected periods. (this is probably data that was submitted by users still on those builds when the aggregator started working normally again)
Assignee | ||
Comment 9•7 years ago
|
||
I logged into the read replica via psql and it certainly looks like the results of the successful jobs are in the DB. I see this list of tables:
...
submission_date_nightly_62_20180613
submission_date_nightly_62_20180614
submission_date_nightly_62_20180618
submission_date_nightly_62_20180619
...
The missing tables there correspond to the dates for the jobs still waiting to run. Each of those tables has about the same number of entries, too (~1.4M).
So, AFAICT, the data exists in the DB, but I may be making an incorrect assumption about what TMO is querying.
I think I've reached the limit of what I can poke around at, :chutten, unless you can give me some details about what the query that TMO is issuing for the submission_date aggregates.
Reporter | ||
Comment 10•7 years ago
|
||
I don't know the query, but I know this is the url: https://aggregates.telemetry.mozilla.org/aggregates_by/submission_date/channels/nightly/dates/
For nightly/62 it doesn't report 20180613 or the other missing dates. Let's see if I can poke at the code and come up with the query that URL'd run...
select * from list_buildids('submission_date', 'nightly')
And list_buildids is here: https://github.com/mozilla/python_mozaggregator/blob/8ae5ba5050c56804cce143707c25fcca233def12/mozaggregator/sql.py#L195
Not sure if that's enough to help, but maybe it's something?
Assignee | ||
Comment 11•7 years ago
|
||
Fascinating. I ran list_buildids directly and I see the dates that have completed. For example:
> select * from list_buildids('submission_date', 'nightly') where buildid = '20180613' limit 3;
version | buildid
---------+----------
62 | 20180613
61 | 20180613
60 | 20180613
Is it possible that we're getting a cached response from aggregates.telemetry.mozilla.org?
I do see that there's a header (Cache-Control max-age=89999) that makes it sound like it could cache the response for 25 hours, which could potentially explain what we're seeing here.
Reporter | ||
Comment 12•7 years ago
|
||
Hm, a fresh load on a fresh profile on a disused aurora build still gets me no love. I wonder if there's an intermediary cache that needs invalidation. I'll try again tomorrow and see :)
Assignee | ||
Comment 13•7 years ago
|
||
Airflow now shows all dates as complete for June and December. I still don't see a change in graphs when I visit https://mzl.la/2KsKBYx in Firefox Nightly, but if I open a new private browsing window and navigate to that page, only 6/16 and 6/17 are missing now.
Definitely looks like we have multiple levels of caching going on. I'm expecting the last two missing dates will eventually show up once invalidation happens.
Reporter | ||
Comment 14•7 years ago
|
||
Looks good to me. Thanks, :klukas!
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•3 years ago
|
Component: Datasets: Telemetry Aggregates → General
You need to log in
before you can comment on or make changes to this bug.
Description
•