Closed Bug 1566623 Opened 5 years ago Closed 5 years ago

TAAR weekly DAG uses incorrect start_time param and does not run regularly

Categories

(Data Platform and Tools Graveyard :: Operations, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: vng, Assigned: hwoo)

Details

(Whiteboard: [DataOps])

Attachments

(2 files)

https://workflow.telemetry.mozilla.org/tree?dag_id=taar_weekly

The TAAR weekly DAG doesn't seem to be running or throwing errors at all.

There's no error that I can debug right now as it doesn't look like Airflow is executing the job at all.

Last time this occured, the DAG had to be forced to start.

Whiteboard: [DataOps]

It looks like we may need to manually force the runs to start until the most recent runs are in a successful state. Also, the dag points to url https://raw.githubusercontent.com/mozilla/python_mozetl/master/bin/mozetl-databricks.sh which doesn't seem to exist.

Assignee: nobody → hwoo
Priority: -- → P1

Crap. I've fixed the file extension in the DAG script in https://github.com/mozilla/telemetry-airflow/pull/554

Let's see if the file extension is the only issue with this DAG running.

I took a quick look at this and tried resetting all of the job runs to a failed state. I run into the following error when I try to set one of the jobs to failed (or success).

-------------------------------------------------------------------------------
Node: airflow-prod-airflow-app-1-web-745cd79659-nfd7b
-------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/local/lib/python2.7/site-packages/airflow/www_rbac/decorators.py", line 121, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/flask_appbuilder/security/decorators.py", line 26, in wraps
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/airflow/www_rbac/decorators.py", line 56, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/airflow/www_rbac/views.py", line 1121, in failed
    future, past, State.FAILED)
  File "/usr/local/lib/python2.7/site-packages/airflow/www_rbac/views.py", line 1092, in _mark_task_instance_state
    commit=False)
  File "/usr/local/lib/python2.7/site-packages/airflow/utils/db.py", line 73, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/airflow/api/common/experimental/mark_tasks.py", line 100, in set_state
    dates = dag.date_range(start_date=start_date, end_date=end_date)
  File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 3475, in date_range
    num=num, delta=self._schedule_interval)
  File "/usr/local/lib/python2.7/site-packages/airflow/utils/dates.py", line 84, in date_range
    end_date = timezone.make_naive(end_date, tz)
  File "/usr/local/lib/python2.7/site-packages/airflow/utils/timezone.py", line 141, in make_naive
    o = value.astimezone(timezone)
  File "/usr/local/lib/python2.7/site-packages/pendulum/tz/timezone_info.py", line 99, in fromutc
    tzinfo = self._tz._tzinfos[self._tz._transitions[idx]._tzinfo_index]
IndexError: list index out of range

I recommend setting all the job runs for taar_weekly.clients_daily and taar_weekly.taar_ensemble between 2019-06-09 and 2019-07-14 to a failure state. This ensures that each date has a valid entry in the Airflow database. Then, all of the job runs should be cleared to backfill the job.

When I look at https://workflow.telemetry.mozilla.org/tree?dag_id=taar_weekly - none of the jobs have backfilled yet.

Do we need to do something to force the backfills to occur?

Flags: needinfo?(amiyaguchi)

Yes, it looks like a backfill will need to be run manually from the instance.

:hwoo, could you backfill the taar_weekly dag and see if the issue resolves, please?

airflow backfill \
    --start_date 2019-06-09 \
    --end_date 2019-07-14 \
    --reset_dagruns \
    taar_weekly

Airflow CLI docs

Flags: needinfo?(amiyaguchi) → needinfo?(hwoo)

So the backfill command given fails with logs pasted below. I get the same error (TypeError: can't compare datetime.datetime to str) when removing the --reset_dagruns flag as well. Odd, since other backfills don't throw this error.

I was able to manually run the job via the UI however, by starting with clients_daily and then taar_ensemble. I did have to manually kick off each step though. Is it fine if you manually kick these off since theres only 5 weekly runs left? (As opposed to troubleshooting why) You can do so by clicking the white box, and then clicking "Run".

$ airflow backfill --start_date 2019-06-09 --end_date 2019-07-14 --reset_dagruns taar_weekly
[2019-07-25 01:31:30,581] {settings.py:174} INFO - settings.configure_orm(): Using pool settings. pool_size=5, pool_recycle=3600, pid=257425
/usr/local/lib/python2.7/site-packages/airflow/utils/helpers.py:356: DeprecationWarning: Importing 'BaseSensorOperator' directly from 'airflow.operators' has been deprecated. Please import from 'airflow.operators.[operator_module]' instead. Support for direct imports will be dropped entirely in Airflow 2.0.
DeprecationWarning)
[2019-07-25 01:31:40,887] {default_celery.py:90} WARNING - You have configured a result_backend of redis://10.0.0.4:6379/1, it is highly recommended to use an alternative result_backend (i.e. a database).
[2019-07-25 01:31:51,506] {init.py:51} INFO - Using executor CeleryExecutor
[2019-07-25 01:31:53,689] {models.py:273} INFO - Filling up the DagBag from /app/pvmount/telemetry-airflow/dags
[2019-07-25 01:31:54,817] {credentials.py:925} INFO - Found credentials in environment variables.
[2019-07-25 01:31:57,881] {models.py:360} INFO - File /app/pvmount/telemetry-airflow/dags/init.py assumed to contain no DAGs. Skipping.
/usr/local/lib/python2.7/site-packages/airflow/operators/dummy_operator.py:35: PendingDeprecationWarning: Invalid arguments were passed to DummyOperator (task_id: clients_daily_v6_dummy). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
*args: ()
**kwargs: {'job_name': 'A placeholder for the implicit clients daily dependency'}
super(DummyOperator, self).init(*args, **kwargs)
[2019-07-25 01:32:02,277] {models.py:2532} WARNING - start_date for <Task(MozDatabricksSubmitRunOperator): taar_ensemble> isn't datetime.datetime
/usr/local/lib/python2.7/site-packages/airflow/utils/helpers.py:356: DeprecationWarning: Importing 'BashOperator' directly from 'airflow.operators' has been deprecated. Please import from 'airflow.operators.[operator_module]' instead. Support for direct imports will be dropped entirely in Airflow 2.0.
DeprecationWarning)
You are about to delete these 3 tasks:
<TaskInstance: taar_weekly.clients_daily 2019-06-09 00:00:00+00:00 [None]>
<TaskInstance: taar_weekly.clients_daily 2019-06-16 00:00:00+00:00 [None]>
<TaskInstance: taar_weekly.taar_ensemble 2019-06-09 00:00:00+00:00 [None]>

Are you sure? (yes/no):
yes
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 32, in <module>
args.func(args)
File "/usr/local/lib/python2.7/site-packages/airflow/utils/cli.py", line 74, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/airflow/bin/cli.py", line 220, in backfill
rerun_failed_tasks=args.rerun_failed_tasks,
File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 4324, in run
job.run()
File "/usr/local/lib/python2.7/site-packages/airflow/jobs.py", line 202, in run
self._execute()
File "/usr/local/lib/python2.7/site-packages/airflow/utils/db.py", line 73, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/airflow/jobs.py", line 2440, in _execute
session=session)
File "/usr/local/lib/python2.7/site-packages/airflow/utils/db.py", line 69, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/airflow/jobs.py", line 2380, in _execute_for_run_dates
dag_run = self._get_dag_run(next_run_date, session=session)
File "/usr/local/lib/python2.7/site-packages/airflow/utils/db.py", line 69, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/airflow/jobs.py", line 2043, in _get_dag_run
run.verify_integrity(session=session)
File "/usr/local/lib/python2.7/site-packages/airflow/utils/db.py", line 69, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 5326, in verify_integrity
if task.start_date > self.execution_date and not self.is_backfill:
TypeError: can't compare datetime.datetime to str

Group: mozilla-employee-confidential
Flags: needinfo?(hwoo)

Can we reset the start date of this job instead?

The ensemble task is extremely intensive and expensive to run. The backfill here isn't going to generate different data for each of the backfills because of the way TAAR ensemble works.

I think I found the problem. The start_date parameter was added to avoid errors under test, but it was added as a string without dashes. The source actually mentions that this parameter should be a datetime.datetime, which is why the stacktrace has the following lines:

File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 5326, in verify_integrity
if task.start_date > self.execution_date and not self.is_backfill:
TypeError: can't compare datetime.datetime to str
Summary: TAAR weekly DAG and taar_ensemble job doesn't seem to be running or throwing errors → TAAR weekly DAG uses incorrect start_time param and does not run regularly
Group: mozilla-employee-confidential
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: