Closed Bug 1353396 Opened 7 years ago Closed 7 years ago

Hardware survey default view is not populating graphs

Categories

(Cloud Services :: Metrics: Product Metrics, defect, P1)

defect
Points:
1

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: thuelbert, Assigned: Dexter)

References

Details

(Whiteboard: [measurement:client])

1. point your browser at https://metrics.mozilla.com/firefox-hardware-survey/ 
2. eye ball graphs

results: no values

expected: populate graphs

notes: this is a regression, but not sure when it broke
Assignee: nobody → alessio.placitelli
Points: --- → 1
Priority: -- → P1
Whiteboard: [measurement:client]
tl;dr - The website is up and running now, with the latest data.

This is a bit weird. The job didn't fail on Airflow the last time it ran (like the previous week), but produced an awkward data unit:

>  {
>    "date": "2017-03-26",
>    "broken": 0,
>    "inactive": 1
>  },

This is basically saying that, for the ETL job, all users were inactive the past week. Which is, obviously, not true. In fact, running the job again by spawning a cluster produces the correct output.

Inspecting the output of the job that ran over Airflow doesn't provide any useful insight.

I'll investigate a bit more on the root causes of this tomorrow.
There was no useful insight from the Airflow log. In order to gather more evidence about the issue, I filed a PR to make the ETL job validate outgoing data before trying to push it to S3. This PR also produces more informative logs.

I've put together a postmortem document at [2].

Please note that this PR doesn't fix the problem, but rather makes the scheduled job fail on invalid data so that we don't break the public-facing website. An email is sent when the job fails.

[1] - https://github.com/mozilla/firefox-hardware-report/pull/26
[2] - https://docs.google.com/document/d/15n4VHHaxOBshFn3e8Eh2CkRwXF74_T8FXD71Dm1AXP8/edit
The PR from comment 2 was merged. If something wrong happens when the ETL job is triggered with Airflow, we should have enough information to figure out why it's failing.
Depends on: 1355153
Looks like  the fixes from bug 1355153 worked and the HW-report website updated correctly this week. Closing this as Fixed.

For additional information about the outage, see the postmortem doc in comment 2.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.