Some of the scheduled jobs with atmo v1 are failing (see https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-list:). We should make sure that none of them fails due to bugs in our batch scheduling mechanism.
Points: --- → 2
Priority: -- → P1
Tried to repro with beta_release_os_gfx job, which failed on 10/4. Running it using aws cli successfully passed. Failed job: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-3S7U9DHV2OYQ8 Successful job: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-2HPK7IP95H7PN Unclear what the difference is.
I confirm that the timeout for this job is set to 2h. Sunah, this job has been failing for a while, is it still being used?
Ignore comment 2.
Frank, this job ran successfully today, so maybe the bootstrap process failed because pypi was unreachable.
Perhaps, but we still had another failure: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-UTQCYRYHUF71
So it looks like quite a few jobs are just timing out. I'll contact owners/file bugs about these to delete them or extend the time. Others are failing more consistently: mobile-android-clients: Failed 3/5, Timed out 2/5 ShieldOutcomesFunnel: Failed 6/6 TxP DAU MAU: Failed 7/7 txp_mau_dau_daily: Failed 1/1 created_then_anything_by_platform_macintosh: Failed 7/7 Addon analysis: Failed 5/5
Mauro and I discovered that the mobile-android-clients failed because of timing out during an s3 read: http://nbviewer.jupyter.org/urls/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/mobile-android-clients-v1/data/android-clients.ipynb Investigating the other notebooks to see what issues they are running into.
The moztelemetry implementation should deal correctly with timeouts, i.e. retry up to N times and then give up and log the error.
ShieldOutcomesFunnel is failing because it is pointing to the incorrect object in s3. Note that there is whitespace in the name, not sure why it's affecting this job in particular. job: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-1ZI9ZFNZHRX8P object being pointed to: s3://telemetry-analysis-code-2/jobs/ShieldOutcomesFunnel/Shield%20Studies%20Offer %20Response %20Outcome%20Funnel.ipynb object that should be pointed at: s3://telemetry-analysis-code-2/jobs/ShieldOutcomesFunnel/Shield%2520Studies%2520Offer%2C%2520Response%2C%2520Outcome%2520Funnel.ipynb Job runs: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-HQ1VD9WDW2MQ
created_then_anything_by_platform_macintosh is failing because of a class not found exception: com.databricks.spark.csv see this here: http://nbviewer.jupyter.org/urls/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/created_then_anything_by_platform_macintosh/data/created_then_anything_by_platform_macintosh.ipynb I *think* this should be included when users run our Jupyter notebooks. Do we have other included jars? Otherwise we should include in the documentation how to include these (e.g. sc.addPyFile('some_package.jar'))
re: ShieldOutcomesFunnel The run I put above as "Job Runs" did not actually run. I went ahead and copied the file and changed the name to something with no special chars, and it now *actually* ran. I'm not sure what happened, but I do think we should sanitize filenames. actually runs: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-317KCPAR5FTQG
Addon Analysis: This is failing due to an import error: ImportError: /home/hadoop/anaconda2/lib/libreadline.so.6: undefined symbol: PC seems to be a common problem: https://github.com/ContinuumIO/anaconda-issues/issues/152 Notebook: http://nbviewer.jupyter.org/urls/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/Addon%20analysis/data/AddonAnalysis.ipynb
TxP DAU MAU: Looks like this one is just a typical python error: NoneType has no attribute len() notebook: http://nbviewer.jupyter.org/gist/fbertsch/436a693ae4a7ad5e09bb29b4b809623a
txp_mau_dau_daily: Another typical Python error. "df is not defined" notebook: http://nbviewer.jupyter.org/urls/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/txp_mau_dau_daily/data/TxP%20-%20Mau%20Dau.ipynb
Points: 2 → 3
We've taken care of all the problem jobs.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.