Closed Bug 1355153 Opened 7 years ago Closed 7 years ago

Airflow ETL jobs access old version of the Longitudinal table

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: whd)

References

Details

(Whiteboard: [SvcOps])

In the hardware-report ETL job [1], scheduled through Airflow, we open the Longitudinal table using the Metastore:

> sqlContext.sql("SELECT ... FROM longitudinal")

Then we look through the users to find the ones that were active the week previous to the job execution.

This has been working for months and started failing 3 weeks ago, by reporting that all users were inactive for the reference week (see bug 1353396).

It turns out that the failure overlaps with the deployment this PR [2], which could be making the ETL job access a table older than 7 days.

When running the same ETL job outside of Airflow, everything works fine.

[1] - https://github.com/mozilla/firefox-hardware-report/tree/master/report
[2] - https://github.com/mozilla/emr-bootstrap-spark/pull/75
Blocks: 1353396
Blake, did you had the chance to look into this? Can this be fixed or should we change all the dependent jobs?
Flags: needinfo?(bimsland)
Assignee: nobody → whd
Flags: needinfo?(whd)
Priority: -- → P1
Whiteboard: [SvcOps]
Flags: needinfo?(bimsland)
From IRC discussion, it appears that the reason this is occurring is that the longitudinal dataset isn't added to the centralized metastore via p2h immediately after it is generated. This was fine when every spark instance re-ran p2h, but it means that currently jobs run as dependent of longitudinal will most likely see a week old version of the data. I think the easiest solution would be to run p2h at the end of the longitudinal job on airflow, in https://github.com/mozilla/telemetry-airflow/blob/master/jobs/longitudinal_view.sh. In theory this means we can also remove the cron jobs to update p2h for longitudinal. In fact, it may make sense to do this for all our jobs that generate datasets that have dependencies, and possibly all datasets that are generated.
Flags: needinfo?(whd)
:whd robotblake and I looked into this a bit during Q3/Q4 2016. Basically we need to remotely call p2h. I made a hack, and I know robotblake tested out running webcat, which apparently was not up to the task. Either way we should then have a Python client that we can just call, so that it's easy for users to load datasets as well.

[0] https://github.com/fbertsch/parquet2hive_server
[1] https://cwiki.apache.org/confluence/display/Hive/WebHCat
The other option is using the --sql flag and using beeline to connect to the remote hive port.
That seems like the best way. Can we then just use pyhive? Would be nice to have a user-friendly wrapper around this, so the user doesn't have to deal with the sql output or the metastore DNS.
I've filed https://github.com/mozilla/emr-bootstrap-spark/pull/79 and
https://github.com/mozilla/telemetry-airflow/pull/102 which should fix this.

I manually tested that the environment variables set in the bootstrap script are available to the airflow script, so I think this should be all we need.

If this approach works, we can begin moving our cron-based p2h updates into airflow where the jobs are generated.
The latest run of longitudinal used the post-run p2h update and appears to have succeeded. I checked the output of the hardware report and saw a latest output date of "2017-04-09", whereas the previous week's run had "2017-03-26" (these dates themselves were somewhat confusing to me as the job ran on 2017-04-16, but it appears to be the way airflow dates work). As these dates are two weeks apart I'm assuming the p2h update worked and am closing this bug.

The more general solution for running p2h after dataset creation is the subject of bug #1356703.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.