Closed Bug 1384324 Opened 7 years ago Closed 6 years ago

Investigate long stage causing timeout in experiment aggregates

Categories

(Data Platform and Tools :: General, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bugzilla, Unassigned)

References

Details

Attachments

(1 file)

The experiment aggregates job for 20170724 timed out on try 1, but we have a bit more insight into it now that we're logging at the INFO level for this job. So far, this is what I've been able to determine:
- There have been 0 code changes since the last successful job (that is, since we re-ran 20170723 yesterday afternoon which took ~5 hours)
- The job timed out while running a single stage (stage 660.0) for five hours
- Re-running the experiment the job was processing at timeout on a (single-node) ad-hoc cluster completed in 8 minutes

¯\_(ツ)_/¯
Assigned to Anthony as P1
Assignee: ssuh → amiyaguchi
Priority: -- → P1
Logging is only accessible through the spark logs, these are the lines for 20170729.[1]

```
$ cat spark.log | grep "INFO ExperimentAnalysisView"

17/07/31 05:30:30 INFO ExperimentAnalysisView$: =======================================================================================
17/07/31 05:30:30 INFO ExperimentAnalysisView$: Starting experiment_analysis for date 20170729
17/07/31 05:30:38 INFO ExperimentAnalysisView$: List of experiments to process for 20170729 is: List(shield-public-infobar-display-bug1368141, pref-flip-screenshots-1369150, pref-flip-rcwn-1367951, pref-flip-jsbc-1373241, shield-lazy-client-classify, pref-flip-quantum-css-style-r1-1381147, pref-flip-layframe-1370253, pref-flip-rcwn2-1381816)
17/07/31 05:30:38 INFO ExperimentAnalysisView$: Aggregating pings for experiment shield-public-infobar-display-bug1368141
17/07/31 05:42:08 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=shield-public-infobar-display-bug1368141/date=20170729
17/07/31 05:42:08 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-screenshots-1369150
17/07/31 07:15:09 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=pref-flip-screenshots-1369150/date=20170729
17/07/31 07:15:09 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-rcwn-1367951
17/07/31 07:30:00 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=pref-flip-rcwn-1367951/date=20170729
17/07/31 07:30:00 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-jsbc-1373241
17/07/31 07:47:04 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=pref-flip-jsbc-1373241/date=20170729
17/07/31 07:47:04 INFO ExperimentAnalysisView$: Aggregating pings for experiment shield-lazy-client-classify
17/07/31 08:11:42 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=shield-lazy-client-classify/date=20170729
17/07/31 08:11:42 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-quantum-css-style-r1-1381147
17/07/31 10:23:14 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=pref-flip-quantum-css-style-r1-1381147/date=20170729
17/07/31 10:23:14 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-layframe-1370253
```

`pref-flip-layframe-1370253` is the particular experiment that is timing out. 

We currently do not collect spark application logs because spark does not have access to the logging directory in EMR. This would be a nice to have while debugging the application (bug 1329779).

[1] s3://telemetry-airflow/logs/ssuh@mozilla.com/Experiments Aggregates View/j-1MD9N3IK95CN0/node/i-09abcbd2e4a860b52/applications/spark/spark.log.gz
[2] https://github.com/mozilla/telemetry-airflow/blob/master/dags/main_summary.py#L114
Increasing the number of nodes from 10 to 20 only brings up a new issues. 

```
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 36985 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)

```
Assignee: amiyaguchi → ssuh
Hey folks. Do we have a status update on this?
We've merged in these two PRs:
https://github.com/mozilla/telemetry-batch-view/pull/274
https://github.com/mozilla/telemetry-batch-view/pull/276

But we're still seeing failures. The current theory is that we have a memory leak that's accumulating as we iterate through experiments, and Frank and I are working on two different tacks to alleviate this. One or both should be done before the end of the day.
Awesome. Thanks for the update!
Aggregates job is fixed, but the import job is broken -- being tracked here: https://github.com/mozilla/experiments-viewer/issues/198
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
I'm reopening this as a P3 since we should still try to understand why the job was so slow when running multiple sequential experiments in the same cluster so we can avoid similar situations in the future.

For the future investigator: the commit hash from before the fix was 783ea945479aaf4485af0067f37acb1ed1a97941 (https://github.com/mozilla/telemetry-batch-view/commit/783ea945479aaf4485af0067f37acb1ed1a97941#diff-f0a72dc1efa84cdd5d8ec4e13ed89fc1)
Assignee: ssuh → nobody
Status: RESOLVED → REOPENED
Priority: P1 → P3
Resolution: FIXED → ---
> running multiple sequential experiments in the same cluster
s/cluster/spark context/
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Component: Datasets: Experiments → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: