Closed Bug 1589411 Opened 6 years ago Closed 6 years ago

main_summary timed out twice for 10/15, generating three duplicate copies of the same data

Categories

(Data Platform and Tools :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mconnor, Unassigned)

Details

https://sql.telemetry.mozilla.org/queries/63630#162821 shows a weird spike that's around 3x expected volume.

Doing a quick breakdown by engine, it looks like the counts for October 15th are uniformly about 3 times what they were on the 14th, no matter what the engine:

14th: https://sql.telemetry.mozilla.org/queries/65407/source
15th: https://sql.telemetry.mozilla.org/queries/65406/source

This indicates to me that we probably just accidentally scheduled the job three times are something equally silly. I'll try to fix it.

Summary: data issue on 10/15 → search aggregates data issue on 10/15

This seems to be a bug related to scheduling main_summary on Databricks. It looks like the main summary job for 2019-10-15 hit the 4 hour retry limit on the first two attempts, but failed to remove the partial data on the last and final attempt. This means that downstream jobs will need to be reprocessed.

For reference, these are the jobs that failed on Databricks in the order that they appear in the airflow logs:

The timeout message is as follows:

[2019-10-16 05:00:33,163] {logging_mixin.py:95} INFO - [2019-10-16 05:00:33,163] {timeout.py:41} ERROR - Process timed out
[2019-10-16 05:00:33,174] {logging_mixin.py:95} INFO - [2019-10-16 05:00:33,174] {base_hook.py:83} INFO - Using connection to: id: databricks_default. Host: dbc-caf9527b-e073.cloud.databricks.com, Port: None, Schema: None, Login: dataops+airflow@mozilla.com, Password: XXXXXXXX, extra: {}
[2019-10-16 05:00:33,175] {logging_mixin.py:95} INFO - [2019-10-16 05:00:33,174] {databricks_hook.py:123} INFO - Using basic auth.
[2019-10-16 05:00:33,314] {databricks_operator.py:286} INFO - Task: main_summary with run_id: 19818 was requested to be cancelled.
[2019-10-16 05:00:33,324] {models.py:1788} ERROR - Timeout
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 1652, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/app/pvmount/telemetry-airflow/plugins/moz_databricks.py", line 246, in execute
    super(MozDatabricksSubmitRunOperator, self).execute(context)
  File "/usr/local/lib/python2.7/site-packages/airflow/contrib/operators/databricks_operator.py", line 279, in execute
    _handle_databricks_operator_execution(self, hook, self.log, context)
  File "/usr/local/lib/python2.7/site-packages/airflow/contrib/operators/databricks_operator.py", line 91, in _handle_databricks_operator_execution
    time.sleep(operator.polling_period_seconds)
  File "/usr/local/lib/python2.7/site-packages/airflow/utils/timeout.py", line 42, in handle_timeout
    raise AirflowTaskTimeout(self.error_message)
AirflowTaskTimeout: Timeout
[2019-10-16 05:00:33,428] {models.py:1811} INFO - Marking task as UP_FOR_RETRY

The written files sorted by timestamps in the s3 bucket illustrate the unexpected behavior of the day's run:

$ aws s3 ls s3://telemetry-parquet/main_summary/v4/submission_date_s3=20191015/sample_id=1/ | sort

2019-10-15 21:11:40          0 _started_1829421004631413645
2019-10-15 21:11:53  632179467 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-43641.c000.snappy.parquet
2019-10-15 21:14:50  618449172 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-43641.c001.snappy.parquet
2019-10-15 21:17:41  628159619 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-43641.c002.snappy.parquet
2019-10-15 21:20:34  624111730 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-43641.c003.snappy.parquet
2019-10-15 21:23:27  624747294 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-43641.c004.snappy.parquet
2019-10-15 21:26:24  620872605 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-43641.c005.snappy.parquet
2019-10-15 21:30:40  106874072 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-43641.c006.snappy.parquet
2019-10-15 21:52:45  621423445 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-46847.c000.snappy.parquet
2019-10-15 21:55:38  625756810 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-46847.c001.snappy.parquet
2019-10-15 23:51:15  623783516 part-00069-tid-4333148849994260298-3c64137e-f730-4a07-b1cd-d019641c7bf6-32070.c000.snappy.parquet
2019-10-15 23:54:03  631457622 part-00069-tid-4333148849994260298-3c64137e-f730-4a07-b1cd-d019641c7bf6-32070.c001.snappy.parquet
2019-10-15 23:56:55  627433720 part-00069-tid-4333148849994260298-3c64137e-f730-4a07-b1cd-d019641c7bf6-32070.c002.snappy.parquet
2019-10-15 23:59:44  628382537 part-00069-tid-4333148849994260298-3c64137e-f730-4a07-b1cd-d019641c7bf6-32070.c003.snappy.parquet
2019-10-16 00:02:36  612779783 part-00069-tid-4333148849994260298-3c64137e-f730-4a07-b1cd-d019641c7bf6-32070.c004.snappy.parquet
2019-10-16 00:05:22  629447320 part-00069-tid-4333148849994260298-3c64137e-f730-4a07-b1cd-d019641c7bf6-32070.c005.snappy.parquet
2019-10-16 00:08:10  104627863 part-00069-tid-4333148849994260298-3c64137e-f730-4a07-b1cd-d019641c7bf6-32070.c006.snappy.parquet
2019-10-16 02:05:51          0 _started_4333148849994260298
2019-10-16 05:09:26  618755127 part-00069-tid-8319816361844412313-19650789-1730-41bf-ad8b-dab8dceff2f7-47365.c000.snappy.parquet
2019-10-16 05:12:10  629436180 part-00069-tid-8319816361844412313-19650789-1730-41bf-ad8b-dab8dceff2f7-47365.c001.snappy.parquet
2019-10-16 05:14:57  618260389 part-00069-tid-8319816361844412313-19650789-1730-41bf-ad8b-dab8dceff2f7-47365.c002.snappy.parquet
2019-10-16 05:17:41  631481012 part-00069-tid-8319816361844412313-19650789-1730-41bf-ad8b-dab8dceff2f7-47365.c003.snappy.parquet
2019-10-16 05:20:28  623351887 part-00069-tid-8319816361844412313-19650789-1730-41bf-ad8b-dab8dceff2f7-47365.c004.snappy.parquet
2019-10-16 05:23:13  628055732 part-00069-tid-8319816361844412313-19650789-1730-41bf-ad8b-dab8dceff2f7-47365.c005.snappy.parquet
2019-10-16 05:25:59  103997061 part-00069-tid-8319816361844412313-19650789-1730-41bf-ad8b-dab8dceff2f7-47365.c006.snappy.parquet
2019-10-16 06:37:10          0 _started_8319816361844412313
2019-10-16 06:49:25          0 _SUCCESS
2019-10-16 06:49:25        724 _committed_8319816361844412313

There are 3 transaction id's (tid): 1829421004631413645, 4333148849994260298, and 8319816361844412313. These are used as part of the metadata for marking when the partitions are being written to disk via the _started and _committed files.

:sunahsuh brought up that this behavior might be specific to the Databricks module that commits partitions to s3. The expected behavior is for Spark's overwrite mode to clear the existing partitions, since none of these jobs overlapped and had at least 30 minutes of buffer between retries.

Summary: search aggregates data issue on 10/15 → main_summary timed out twice for 10/15, generating three duplicate copies of the same data

I increased the timeout from 4 to 6 hours: https://github.com/mozilla/telemetry-airflow/commit/22a6e96ea5f68c644085d211e12051d6431bfa54 and reran the main summary job by itself

Arkadiusz, can you email databricks support to see if they have any insight into why the data didn't get overwritten as expected?

The code in question is here (thanks to :sunahsuh for pointing it out to me): https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala#L388

(this should work according to stackoverflow: https://stackoverflow.com/a/38347141)

Flags: needinfo?(akomarzewski)

Meanwhile, I have retriggered the main summary job for the 15th one last time on airflow to see if the problem recurs.

(In reply to William Lachance (:wlach) (use needinfo!) from comment #3)

Arkadiusz, can you email databricks support to see if they have any insight into why the data didn't get overwritten as expected?

Sure, you're cc'd and I'll update this bug when we hear back.

Flags: needinfo?(akomarzewski)

Looks like the old files have not been deleted :

➜  ~ aws s3 ls s3://telemetry-parquet/main_summary/v4/submission_date_s3=20191015/sample_id=1/|sort

2019-10-16 06:11:40          0 _started_1829421004631413645
2019-10-16 06:11:53  632179467 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-43641.c000.snappy.parquet
2019-10-16 06:14:50  618449172 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-43641.c001.snappy.parquet
2019-10-16 06:17:41  628159619 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-43641.c002.snappy.parquet
2019-10-16 06:20:34  624111730 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-43641.c003.snappy.parquet
2019-10-16 06:23:27  624747294 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-43641.c004.snappy.parquet
2019-10-16 06:26:24  620872605 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-43641.c005.snappy.parquet
2019-10-16 06:30:40  106874072 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-43641.c006.snappy.parquet
2019-10-16 06:52:45  621423445 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-46847.c000.snappy.parquet
2019-10-16 06:55:38  625756810 part-00069-tid-1829421004631413645-c8ac8f30-4701-4137-9050-2f48b1c7f7c4-46847.c001.snappy.parquet
2019-10-16 08:51:15  623783516 part-00069-tid-4333148849994260298-3c64137e-f730-4a07-b1cd-d019641c7bf6-32070.c000.snappy.parquet
2019-10-16 08:54:03  631457622 part-00069-tid-4333148849994260298-3c64137e-f730-4a07-b1cd-d019641c7bf6-32070.c001.snappy.parquet
2019-10-16 08:56:55  627433720 part-00069-tid-4333148849994260298-3c64137e-f730-4a07-b1cd-d019641c7bf6-32070.c002.snappy.parquet
2019-10-16 08:59:44  628382537 part-00069-tid-4333148849994260298-3c64137e-f730-4a07-b1cd-d019641c7bf6-32070.c003.snappy.parquet
2019-10-16 09:02:36  612779783 part-00069-tid-4333148849994260298-3c64137e-f730-4a07-b1cd-d019641c7bf6-32070.c004.snappy.parquet
2019-10-16 09:05:22  629447320 part-00069-tid-4333148849994260298-3c64137e-f730-4a07-b1cd-d019641c7bf6-32070.c005.snappy.parquet
2019-10-16 09:08:10  104627863 part-00069-tid-4333148849994260298-3c64137e-f730-4a07-b1cd-d019641c7bf6-32070.c006.snappy.parquet
2019-10-16 11:05:51          0 _started_4333148849994260298
2019-10-16 15:49:25        724 _committed_8319816361844412313
2019-10-17 23:09:47  621312962 part-00069-tid-1830809560743930977-ca6c5318-295e-4fd9-9743-caade1f14a04-32070.c000.snappy.parquet
2019-10-17 23:12:32  627697803 part-00069-tid-1830809560743930977-ca6c5318-295e-4fd9-9743-caade1f14a04-32070.c001.snappy.parquet
2019-10-17 23:15:19  629300676 part-00069-tid-1830809560743930977-ca6c5318-295e-4fd9-9743-caade1f14a04-32070.c002.snappy.parquet
2019-10-17 23:18:04  628839540 part-00069-tid-1830809560743930977-ca6c5318-295e-4fd9-9743-caade1f14a04-32070.c003.snappy.parquet
2019-10-17 23:20:51  626150294 part-00069-tid-1830809560743930977-ca6c5318-295e-4fd9-9743-caade1f14a04-32070.c004.snappy.parquet
2019-10-17 23:23:39  619546433 part-00069-tid-1830809560743930977-ca6c5318-295e-4fd9-9743-caade1f14a04-32070.c005.snappy.parquet
2019-10-17 23:26:22  104483484 part-00069-tid-1830809560743930977-ca6c5318-295e-4fd9-9743-caade1f14a04-32070.c006.snappy.parquet
2019-10-18 00:25:13          0 _started_1830809560743930977
2019-10-18 00:48:49       1434 _committed_1830809560743930977
2019-10-18 00:48:50          0 _SUCCESS
2019-10-18 00:48:52         96 _committed_vacuum1241633759612501202

I will delete s3://telemetry-parquet/main_summary/v4/submission_date_s3=20191015/ and retrigger main summary dag. I don't expect this to happen in downstream jobs since they weren't cancelled during write transaction, but their outputs will need to be verified too.

I think the data issues are now resolved. :akomar did you want to continue to use this bug to track any fixes to main summary generation, or should we file a followup?

Flags: needinfo?(akomarzewski)

(In reply to William Lachance (:wlach) (use needinfo!) from comment #7)

I think the data issues are now resolved. :akomar did you want to continue to use this bug to track any fixes to main summary generation, or should we file a followup?

Let's close this, we're moving this job out of Databricks.

Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(akomarzewski)
Resolution: --- → FIXED
Component: Datasets: Search → Datasets: General
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.