Open Bug 1883727 Opened 4 months ago Updated 15 days ago

bqetl_artifact_deployment should run after schema deployment

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

People

(Reporter: benwu, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [dataplatform])

There have been a couple issues in the past week caused by the bqetl_artifact_deployment dag running a few hours before either schema generation or deployment. A couple options:

  • make bqetl_artifact_deployment depend on schema deployment
    • I think this is done with a jenkins job owned by SRE instead of in airflow so it's not a simple airflow dependency
  • schedule bqetl_artifact_deployment a few hours later
    • not as robust but this should work if schema deployment is done on a fixed schedule
    • need to check what the impact of this is/see if anything else indirectly depends this dag

Some details about the issues this caused:

On 2024-02-26, org.mozilla.ios.TikTok-Reporter.TikTok-ReporterShare was added to probe scraper and the next day, artifact deployment ran before schemas were deployed. This caused the derived glean usage tables to not be generated but since the schemas were deployed afterwards, bigquery-etl CI generated the sql in the generated-sql branch. This resulted in failed builds on main because the tables weren't found when dry running the sql. link to a CI failure

On 2024-03-04, a couple new fields were added to client_info in all glean pings (PR). The publish_views task ran before schema deployment so it didn't know about these new fields. The generated ping views in the fenix dataset union client_info in the org_mozilla_firefox* tables with a a struct listing a set of fields in org_mozilla_fenix*. The view didn't know about the new fields so the union failed due to a schema mismatch.

In both cases, rerunning either table or view deployment fixed the issue.

Whiteboard: [dataplatform]

Another option:

  • trigger bqetl_artifact_deployment from Jenkins. We had a setup briefly for triggering this DAG from CircleCI, so maybe we can do something very similar in Jenkins once schemas have been deployed.

We discussed this in the infra-wg meeting today and the consensus is that triggering off of Jenkins after schemas deploys makes sense. We'll want to switch to the cloudfunction based invocation for DAG triggers to simplify Jenkins config, which I talked to Mikael about earlier this week.

In the future (hopefully soon, as things should be sped up a bit with the recent dryrun access changes) we want to trigger artifact deployment on Airflow directly after bqetl merges. Without this, relying on Jenkins to trigger the DAG will only cause artifact deployments after an MPS change, so we should continue to schedule it in a cron-like fashion until that point.

There's a specific case where we don't want to trigger artifact deployment directly from CI, which is on the introduction of new datasets. In CI logic we should conditionally detect dataset changes and only trigger airflow deployments when (in the common case) no datasets are added. When datasets are added, the standard jenkins pipeline trigger of private-generated-sql will eventually result in datasets being deployed and subsequently triggering artifact deployment.

There's also a specific case where Jenkins still needs to deploy views until we sort out how to better handle authorized views. For this case I think it makes sense to add something like --only-authorized akin to --skip-authorized to have Jenkins only ever be responsible for deploying authorized views. If we don't add something like this we'd waste a bunch of time deploying views on Jenkins and then triggering bqetl artifact deployment on airflow to (more robustly and visibly) do the same.

Regressions: 1902284

We'll want to switch to the cloudfunction based invocation for DAG triggers to simplify Jenkins config, which I talked to Mikael about earlier this week.

We hammered this out at the work week so I think setting up Jenkins to trigger artifact deployment after schemas deploys is unblocked and will possibly land this week.

We don't do sql generation in the dag anymore so just triggering the dag now might not do anything since it will deploy the sql in the image that was generated based on the previous schemas. For example, this wouldn't fix the fenix views issues we had. One thing we could do is add a dag parameter to include sql generation in the run. Or is it possible/sensible to trigger circleci from jenkins?

We've hit the issue of incorrect view schemas for the third time in 2 weeks: https://bugzilla.mozilla.org/show_bug.cgi?id=1904009. We should go with triggering artifact deployment DAG after schemas deploy.

(In reply to Ben Wu [:benwu] from comment #4)

(...) One thing we could do is add a dag parameter to include sql generation in the run. Or is it possible/sensible to trigger circleci from jenkins?

First one seems faster so leaves views broken for a shorter time.

See Also: → 1904009
You need to log in before you can comment on or make changes to this bug.