Closed Bug 1384324 Opened 7 years ago Closed 6 years ago

Investigate long stage causing timeout in experiment aggregates

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bugzilla, Unassigned)

References

Details

Attachments

(1 file)

Experiment Analysis pref-flip-layframe-1370253 Query Plan 7 years ago Anthony Miyaguchi [:amiyaguchi] 2.75 MB, text/plain		Details

bugzilla

Reporter

Description

•

7 years ago

The experiment aggregates job for 20170724 timed out on try 1, but we have a bit more insight into it now that we're logging at the INFO level for this job. So far, this is what I've been able to determine:
- There have been 0 code changes since the last successful job (that is, since we re-ran 20170723 yesterday afternoon which took ~5 hours)
- The job timed out while running a single stage (stage 660.0) for five hours
- Re-running the experiment the job was processing at timeout on a (single-node) ad-hoc cluster completed in 8 minutes

¯\_(ツ)_/¯

Thomas Huelbert

Comment 3

•

7 years ago

Assigned to Anthony as P1

Assignee: ssuh → amiyaguchi

Priority: -- → P1

Anthony Miyaguchi [:amiyaguchi]

Comment 4

•

7 years ago

Logging is only accessible through the spark logs, these are the lines for 20170729.[1]

```
$ cat spark.log | grep "INFO ExperimentAnalysisView"

17/07/31 05:30:30 INFO ExperimentAnalysisView$: =======================================================================================
17/07/31 05:30:30 INFO ExperimentAnalysisView$: Starting experiment_analysis for date 20170729
17/07/31 05:30:38 INFO ExperimentAnalysisView$: List of experiments to process for 20170729 is: List(shield-public-infobar-display-bug1368141, pref-flip-screenshots-1369150, pref-flip-rcwn-1367951, pref-flip-jsbc-1373241, shield-lazy-client-classify, pref-flip-quantum-css-style-r1-1381147, pref-flip-layframe-1370253, pref-flip-rcwn2-1381816)
17/07/31 05:30:38 INFO ExperimentAnalysisView$: Aggregating pings for experiment shield-public-infobar-display-bug1368141
17/07/31 05:42:08 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=shield-public-infobar-display-bug1368141/date=20170729
17/07/31 05:42:08 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-screenshots-1369150
17/07/31 07:15:09 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=pref-flip-screenshots-1369150/date=20170729
17/07/31 07:15:09 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-rcwn-1367951
17/07/31 07:30:00 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=pref-flip-rcwn-1367951/date=20170729
17/07/31 07:30:00 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-jsbc-1373241
17/07/31 07:47:04 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=pref-flip-jsbc-1373241/date=20170729
17/07/31 07:47:04 INFO ExperimentAnalysisView$: Aggregating pings for experiment shield-lazy-client-classify
17/07/31 08:11:42 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=shield-lazy-client-classify/date=20170729
17/07/31 08:11:42 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-quantum-css-style-r1-1381147
17/07/31 10:23:14 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=pref-flip-quantum-css-style-r1-1381147/date=20170729
17/07/31 10:23:14 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-layframe-1370253
```

`pref-flip-layframe-1370253` is the particular experiment that is timing out. 

We currently do not collect spark application logs because spark does not have access to the logging directory in EMR. This would be a nice to have while debugging the application (bug 1329779).

[1] s3://telemetry-airflow/logs/ssuh@mozilla.com/Experiments Aggregates View/j-1MD9N3IK95CN0/node/i-09abcbd2e4a860b52/applications/spark/spark.log.gz
[2] https://github.com/mozilla/telemetry-airflow/blob/master/dags/main_summary.py#L114

Anthony Miyaguchi [:amiyaguchi]

Comment 5

•

7 years ago

Attached file Experiment Analysis pref-flip-layframe-1370253 Query Plan — Details

Anthony Miyaguchi [:amiyaguchi]

Comment 6

•

7 years ago

Increasing the number of nodes from 10 to 20 only brings up a new issues. 

```
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 36985 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)

```

Anthony Miyaguchi [:amiyaguchi]

Updated

•

7 years ago

Assignee: amiyaguchi → ssuh

Matt Grimes [:Matt_G]

Comment 7

•

7 years ago

Hey folks. Do we have a status update on this?

bugzilla

Reporter

Comment 8

•

7 years ago

We've merged in these two PRs:
https://github.com/mozilla/telemetry-batch-view/pull/274
https://github.com/mozilla/telemetry-batch-view/pull/276

But we're still seeing failures. The current theory is that we have a memory leak that's accumulating as we iterate through experiments, and Frank and I are working on two different tacks to alleviate this. One or both should be done before the end of the day.

Matt Grimes [:Matt_G]

Comment 9

•

7 years ago

Awesome. Thanks for the update!

bugzilla

Reporter

Comment 10

•

7 years ago

Aggregates job is fixed, but the import job is broken -- being tracked here: https://github.com/mozilla/experiments-viewer/issues/198

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

bugzilla

Reporter

Comment 11

•

7 years ago

I'm reopening this as a P3 since we should still try to understand why the job was so slow when running multiple sequential experiments in the same cluster so we can avoid similar situations in the future.

For the future investigator: the commit hash from before the fix was 783ea945479aaf4485af0067f37acb1ed1a97941 (https://github.com/mozilla/telemetry-batch-view/commit/783ea945479aaf4485af0067f37acb1ed1a97941#diff-f0a72dc1efa84cdd5d8ec4e13ed89fc1)

Assignee: ssuh → nobody

Status: RESOLVED → REOPENED

Priority: P1 → P3

Resolution: FIXED → ---

bugzilla

Reporter

Comment 12

•

7 years ago

> running multiple sequential experiments in the same cluster
s/cluster/spark context/

bugzilla

Reporter

Updated

•

6 years ago

Status: REOPENED → RESOLVED

Closed: 7 years ago → 6 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Assignee

Updated

•

2 years ago

Component: Datasets: Experiments → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Investigate long stage causing timeout in experiment aggregates

Categories

(Data Platform and Tools :: General, enhancement, P3)

Tracking

(Not tracked)

People

(Reporter: bugzilla, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Updated

Attachment

General

Description

File Name

Content Type