Closed
Bug 1384324
Opened 7 years ago
Closed 6 years ago
Investigate long stage causing timeout in experiment aggregates
Categories
(Data Platform and Tools :: General, enhancement, P3)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bugzilla, Unassigned)
References
Details
Attachments
(1 file)
2.75 MB,
text/plain
|
Details |
The experiment aggregates job for 20170724 timed out on try 1, but we have a bit more insight into it now that we're logging at the INFO level for this job. So far, this is what I've been able to determine: - There have been 0 code changes since the last successful job (that is, since we re-ran 20170723 yesterday afternoon which took ~5 hours) - The job timed out while running a single stage (stage 660.0) for five hours - Re-running the experiment the job was processing at timeout on a (single-node) ad-hoc cluster completed in 8 minutes ¯\_(ツ)_/¯
Comment 4•7 years ago
|
||
Logging is only accessible through the spark logs, these are the lines for 20170729.[1] ``` $ cat spark.log | grep "INFO ExperimentAnalysisView" 17/07/31 05:30:30 INFO ExperimentAnalysisView$: ======================================================================================= 17/07/31 05:30:30 INFO ExperimentAnalysisView$: Starting experiment_analysis for date 20170729 17/07/31 05:30:38 INFO ExperimentAnalysisView$: List of experiments to process for 20170729 is: List(shield-public-infobar-display-bug1368141, pref-flip-screenshots-1369150, pref-flip-rcwn-1367951, pref-flip-jsbc-1373241, shield-lazy-client-classify, pref-flip-quantum-css-style-r1-1381147, pref-flip-layframe-1370253, pref-flip-rcwn2-1381816) 17/07/31 05:30:38 INFO ExperimentAnalysisView$: Aggregating pings for experiment shield-public-infobar-display-bug1368141 17/07/31 05:42:08 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=shield-public-infobar-display-bug1368141/date=20170729 17/07/31 05:42:08 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-screenshots-1369150 17/07/31 07:15:09 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=pref-flip-screenshots-1369150/date=20170729 17/07/31 07:15:09 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-rcwn-1367951 17/07/31 07:30:00 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=pref-flip-rcwn-1367951/date=20170729 17/07/31 07:30:00 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-jsbc-1373241 17/07/31 07:47:04 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=pref-flip-jsbc-1373241/date=20170729 17/07/31 07:47:04 INFO ExperimentAnalysisView$: Aggregating pings for experiment shield-lazy-client-classify 17/07/31 08:11:42 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=shield-lazy-client-classify/date=20170729 17/07/31 08:11:42 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-quantum-css-style-r1-1381147 17/07/31 10:23:14 INFO ExperimentAnalysisView$: Wrote aggregates to s3://telemetry-parquet/experiments_aggregates/v1/experiment_id=pref-flip-quantum-css-style-r1-1381147/date=20170729 17/07/31 10:23:14 INFO ExperimentAnalysisView$: Aggregating pings for experiment pref-flip-layframe-1370253 ``` `pref-flip-layframe-1370253` is the particular experiment that is timing out. We currently do not collect spark application logs because spark does not have access to the logging directory in EMR. This would be a nice to have while debugging the application (bug 1329779). [1] s3://telemetry-airflow/logs/ssuh@mozilla.com/Experiments Aggregates View/j-1MD9N3IK95CN0/node/i-09abcbd2e4a860b52/applications/spark/spark.log.gz [2] https://github.com/mozilla/telemetry-airflow/blob/master/dags/main_summary.py#L114
Comment 5•7 years ago
|
||
Comment 6•7 years ago
|
||
Increasing the number of nodes from 10 to 20 only brings up a new issues. ``` Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 36985 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB) ```
Updated•7 years ago
|
Assignee: amiyaguchi → ssuh
Comment 7•7 years ago
|
||
Hey folks. Do we have a status update on this?
We've merged in these two PRs: https://github.com/mozilla/telemetry-batch-view/pull/274 https://github.com/mozilla/telemetry-batch-view/pull/276 But we're still seeing failures. The current theory is that we have a memory leak that's accumulating as we iterate through experiments, and Frank and I are working on two different tacks to alleviate this. One or both should be done before the end of the day.
Comment 9•7 years ago
|
||
Awesome. Thanks for the update!
Reporter | ||
Comment 10•7 years ago
|
||
Aggregates job is fixed, but the import job is broken -- being tracked here: https://github.com/mozilla/experiments-viewer/issues/198
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 11•7 years ago
|
||
I'm reopening this as a P3 since we should still try to understand why the job was so slow when running multiple sequential experiments in the same cluster so we can avoid similar situations in the future. For the future investigator: the commit hash from before the fix was 783ea945479aaf4485af0067f37acb1ed1a97941 (https://github.com/mozilla/telemetry-batch-view/commit/783ea945479aaf4485af0067f37acb1ed1a97941#diff-f0a72dc1efa84cdd5d8ec4e13ed89fc1)
Assignee: ssuh → nobody
Status: RESOLVED → REOPENED
Priority: P1 → P3
Resolution: FIXED → ---
Reporter | ||
Comment 12•7 years ago
|
||
> running multiple sequential experiments in the same cluster
s/cluster/spark context/
Status: REOPENED → RESOLVED
Closed: 7 years ago → 6 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•2 years ago
|
Component: Datasets: Experiments → General
You need to log in
before you can comment on or make changes to this bug.
Description
•