Closed Bug 1471207 Opened 7 years ago Closed 7 years ago

Submission date aggregates missing from past two All Hands

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chutten, Assigned: klukas)

References

()

Details

https://mzl.la/2KsKBYx We have submission date aggregates up to Wednesday, December 13, 2017. We have submission date aggregates from Monday, December 18, 2017 up to Monday, June 11, 2018. We have submission date aggregates from Monday, June 18, 2018 until the present day (ish). But not in between. During (parts of) the past two All Hands we have no submission date aggregates, which is causing more than a few alerts and some confusion on the list: https://groups.google.com/forum/#!topic/mozilla.dev.telemetry-alerts/G4lT5E9HnS8 Is this an expected result of a backfill? (if so, why do we have submission date aggregates from 0604 up to 0611?) Or is this as weird to you as it is to me?
See Also: → 1467860
Based on https://bugzilla.mozilla.org/show_bug.cgi?id=1467860 it looks like :frank did some manual backfilling outside of Airflow up to about 20180610, so that likely explains the observed start of the outage is 6/11 rather than 6/4. Airflow was redeployed on 6/20 and we made some logic changes that week to allow Airflow to write to the aggregates db, and then Airflow was able to start working down a backlog of jobs, but I don't have detailed notes of what date that backlog started from. Looking at DAG runs in the Airflow UI [0], though, shows missing executions for 6/12 through 6/17, plus 6/19. I will investigate if I can generate new executions for those missing dates. [0] https://workflow.telemetry.mozilla.org/admin/dagrun/?search=telemetry_aggregates
Assignee: nobody → jklukas
We have jobs scheduled in Airflow for the missing dates (see [1]). They should run sequentially and take ~4 hours apiece, so we should be back in business tomorrow. If all goes well with that, I'll also schedule backfill jobs for the December dates. [1] https://workflow.telemetry.mozilla.org/admin/airflow/tree?dag_id=telemetry_aggregates
I indeed see missing executions in [0] for 12/15 through 12/17, so we should hopefully be able to fill those in via Airflow as well.
Priority: -- → P1
Another hypothesis: People don't allow Nightly to restart while they're running around at All Hands meetings?
6/12, 6/13, 6/14, and 6/19 are now processed. Had to unblock a task this morning to get the others to keep running. Also kicking off December jobs now.
I can see those now in the builddate aggregates (https://mzl.la/2K9SHpk) but the submission_date aggregates are still missing for those days (https://mzl.la/2KsKBYx). Why is that?
The builddate aggregates appear to have entries for all dates, even the ones that still haven't finished processing (6/15, 6/16, 6/17), so it's likely those have always been fine? Or were those definitely broken too? I will look more closely at logs for 6/12 and friends to see if they truly reported success.
The builddate aggregates always had data, but the volume was suspiciously low for the affected periods. (this is probably data that was submitted by users still on those builds when the aggregator started working normally again)
I logged into the read replica via psql and it certainly looks like the results of the successful jobs are in the DB. I see this list of tables: ... submission_date_nightly_62_20180613 submission_date_nightly_62_20180614 submission_date_nightly_62_20180618 submission_date_nightly_62_20180619 ... The missing tables there correspond to the dates for the jobs still waiting to run. Each of those tables has about the same number of entries, too (~1.4M). So, AFAICT, the data exists in the DB, but I may be making an incorrect assumption about what TMO is querying. I think I've reached the limit of what I can poke around at, :chutten, unless you can give me some details about what the query that TMO is issuing for the submission_date aggregates.
I don't know the query, but I know this is the url: https://aggregates.telemetry.mozilla.org/aggregates_by/submission_date/channels/nightly/dates/ For nightly/62 it doesn't report 20180613 or the other missing dates. Let's see if I can poke at the code and come up with the query that URL'd run... select * from list_buildids('submission_date', 'nightly') And list_buildids is here: https://github.com/mozilla/python_mozaggregator/blob/8ae5ba5050c56804cce143707c25fcca233def12/mozaggregator/sql.py#L195 Not sure if that's enough to help, but maybe it's something?
Fascinating. I ran list_buildids directly and I see the dates that have completed. For example: > select * from list_buildids('submission_date', 'nightly') where buildid = '20180613' limit 3; version | buildid ---------+---------- 62 | 20180613 61 | 20180613 60 | 20180613 Is it possible that we're getting a cached response from aggregates.telemetry.mozilla.org? I do see that there's a header (Cache-Control max-age=89999) that makes it sound like it could cache the response for 25 hours, which could potentially explain what we're seeing here.
Hm, a fresh load on a fresh profile on a disused aurora build still gets me no love. I wonder if there's an intermediary cache that needs invalidation. I'll try again tomorrow and see :)
Airflow now shows all dates as complete for June and December. I still don't see a change in graphs when I visit https://mzl.la/2KsKBYx in Firefox Nightly, but if I open a new private browsing window and navigate to that page, only 6/16 and 6/17 are missing now. Definitely looks like we have multiple levels of caching going on. I'm expecting the last two missing dates will eventually show up once invalidation happens.
Looks good to me. Thanks, :klukas!
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Datasets: Telemetry Aggregates → General
You need to log in before you can comment on or make changes to this bug.