Closed Bug 1307096 Opened 8 years ago Closed 8 years ago

Atmo v1 scheduled jobs are failing

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: frank)

References

Details

User Story

Some of the scheduled jobs with atmo v1 are failing (see https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-list:). We should make sure that none of them fails due to bugs in our batch scheduling mechanism.
      No description provided.
Depends on: 1307095
Severity: normal → major
Flags: needinfo?(mdoglio)
Points: --- → 2
Priority: -- → P1
Assignee: nobody → fbertsch
Flags: needinfo?(mdoglio)
Tried to repro with beta_release_os_gfx job, which failed on 10/4. Running it using aws cli successfully passed.

Failed job: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-3S7U9DHV2OYQ8
Successful job: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-2HPK7IP95H7PN

Unclear what the difference is.
I confirm that the timeout for this job is set to 2h. Sunah, this job has been failing for a while, is it still being used?
Frank, this job ran successfully today, so maybe the bootstrap process failed because pypi was unreachable.
So it looks like quite a few jobs are just timing out. I'll contact owners/file bugs about these to delete them or extend the time.

Others are failing more consistently:
mobile-android-clients: Failed 3/5, Timed out 2/5 
ShieldOutcomesFunnel: Failed 6/6
TxP DAU MAU: Failed 7/7 
txp_mau_dau_daily: Failed 1/1
created_then_anything_by_platform_macintosh: Failed 7/7
Addon analysis: Failed 5/5
Mauro and I discovered that the mobile-android-clients failed because of timing out during an s3 read: http://nbviewer.jupyter.org/urls/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/mobile-android-clients-v1/data/android-clients.ipynb

Investigating the other notebooks to see what issues they are running into.
The moztelemetry implementation should deal correctly with timeouts, i.e. retry up to N times and then give up and log the error.
ShieldOutcomesFunnel is failing because it is pointing to the incorrect object in s3. Note that there is whitespace in the name, not sure why it's affecting this job in particular.

job: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-1ZI9ZFNZHRX8P

object being pointed to: s3://telemetry-analysis-code-2/jobs/ShieldOutcomesFunnel/Shield%20Studies%20Offer %20Response %20Outcome%20Funnel.ipynb

object that should be pointed at: s3://telemetry-analysis-code-2/jobs/ShieldOutcomesFunnel/Shield%2520Studies%2520Offer%2C%2520Response%2C%2520Outcome%2520Funnel.ipynb

Job runs: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-HQ1VD9WDW2MQ
created_then_anything_by_platform_macintosh is failing because of a class not found exception: com.databricks.spark.csv

see this here: http://nbviewer.jupyter.org/urls/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/created_then_anything_by_platform_macintosh/data/created_then_anything_by_platform_macintosh.ipynb

I *think* this should be included when users run our Jupyter notebooks. Do we have other included jars? Otherwise we should include in the documentation how to include these (e.g. sc.addPyFile('some_package.jar'))
Flags: needinfo?(rvitillo)
re: ShieldOutcomesFunnel

The run I put above as "Job Runs" did not actually run. I went ahead and copied the file and changed the name to something with no special chars, and it now *actually* ran. I'm not sure what happened, but I do think we should sanitize filenames. 

actually runs: https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-details:j-317KCPAR5FTQG
Addon Analysis:

This is failing due to an import error: ImportError: /home/hadoop/anaconda2/lib/libreadline.so.6: undefined symbol: PC

seems to be a common problem: https://github.com/ContinuumIO/anaconda-issues/issues/152

Notebook: http://nbviewer.jupyter.org/urls/s3-us-west-2.amazonaws.com/telemetry-public-analysis-2/Addon%20analysis/data/AddonAnalysis.ipynb
TxP DAU MAU:

Looks like this one is just a typical python error: NoneType has no attribute len()

notebook: http://nbviewer.jupyter.org/gist/fbertsch/436a693ae4a7ad5e09bb29b4b809623a
Points: 2 → 3
Flags: needinfo?(rvitillo)
Depends on: 1308197
Depends on: 1308199
Depends on: 1308201
We've taken care of all the problem jobs.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.