Closed Bug 1355153 Opened 7 years ago Closed 7 years ago

Airflow ETL jobs access old version of the Longitudinal table

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: whd)

References

Details

(Whiteboard: [SvcOps])

Alessio Placitelli [:Dexter]

Reporter

Description

•

7 years ago

In the hardware-report ETL job [1], scheduled through Airflow, we open the Longitudinal table using the Metastore:

> sqlContext.sql("SELECT ... FROM longitudinal")

Then we look through the users to find the ones that were active the week previous to the job execution.

This has been working for months and started failing 3 weeks ago, by reporting that all users were inactive for the reference week (see bug 1353396).

It turns out that the failure overlaps with the deployment this PR [2], which could be making the ETL job access a table older than 7 days.

When running the same ETL job outside of Airflow, everything works fine.

[1] - https://github.com/mozilla/firefox-hardware-report/tree/master/report
[2] - https://github.com/mozilla/emr-bootstrap-spark/pull/75

Alessio Placitelli [:Dexter]

Reporter

Updated

•

7 years ago

Blocks: 1353396

Alessio Placitelli [:Dexter]

Reporter

Comment 1

•

7 years ago

Blake, did you had the chance to look into this? Can this be fixed or should we change all the dependent jobs?

Flags: needinfo?(bimsland)

Jason Thomas [:jason]

Updated

•

7 years ago

Assignee: nobody → whd

Flags: needinfo?(whd)

Priority: -- → P1

Whiteboard: [SvcOps]

Jason Thomas [:jason]

Updated

•

7 years ago

Flags: needinfo?(bimsland)

Wesley Dawson [:whd]

Assignee

Comment 2

•

7 years ago

From IRC discussion, it appears that the reason this is occurring is that the longitudinal dataset isn't added to the centralized metastore via p2h immediately after it is generated. This was fine when every spark instance re-ran p2h, but it means that currently jobs run as dependent of longitudinal will most likely see a week old version of the data. I think the easiest solution would be to run p2h at the end of the longitudinal job on airflow, in https://github.com/mozilla/telemetry-airflow/blob/master/jobs/longitudinal_view.sh. In theory this means we can also remove the cron jobs to update p2h for longitudinal. In fact, it may make sense to do this for all our jobs that generate datasets that have dependencies, and possibly all datasets that are generated.

Flags: needinfo?(whd)

Frank Bertsch [:frank]

Comment 3

•

7 years ago

:whd robotblake and I looked into this a bit during Q3/Q4 2016. Basically we need to remotely call p2h. I made a hack, and I know robotblake tested out running webcat, which apparently was not up to the task. Either way we should then have a Python client that we can just call, so that it's easy for users to load datasets as well.

[0] https://github.com/fbertsch/parquet2hive_server
[1] https://cwiki.apache.org/confluence/display/Hive/WebHCat

Blake Imsland [:robotblake]

Comment 4

•

7 years ago

The other option is using the --sql flag and using beeline to connect to the remote hive port.

Frank Bertsch [:frank]

Comment 5

•

7 years ago

That seems like the best way. Can we then just use pyhive? Would be nice to have a user-friendly wrapper around this, so the user doesn't have to deal with the sql output or the metastore DNS.

Wesley Dawson [:whd]

Assignee

Comment 6

•

7 years ago

I've filed https://github.com/mozilla/emr-bootstrap-spark/pull/79 and
https://github.com/mozilla/telemetry-airflow/pull/102 which should fix this.

I manually tested that the environment variables set in the bootstrap script are available to the airflow script, so I think this should be all we need.

If this approach works, we can begin moving our cron-based p2h updates into airflow where the jobs are generated.

Wesley Dawson [:whd]

Assignee

Comment 7

•

7 years ago

The latest run of longitudinal used the post-run p2h update and appears to have succeeded. I checked the output of the hardware report and saw a latest output date of "2017-04-09", whereas the previous week's run had "2017-03-26" (these dates themselves were somewhat confusing to me as the job ran on 2017-04-16, but it appears to be the way airflow dates work). As these dates are two weeks apart I'm assuming the p2h update worked and am closing this bug.

The more general solution for running p2h after dataset creation is the subject of bug #1356703.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Airflow ETL jobs access old version of the Longitudinal table

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, enhancement, P1)

Tracking

(Not tracked)

People

(Reporter: Dexter, Assigned: whd)

References

Details

(Whiteboard: [SvcOps])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated