Closed
Bug 1238016
Opened 8 years ago
Closed 8 years ago
Experiments scheduled job started failing Wednesday (6-Jan-2015)
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: benjamin, Unassigned)
Details
The experiments scheduled job started failing on Wednesday. There are two problems: A. the failure B. I didn't get any emails. Probably because the shell script didn't have -e and so it returned successfully. I believe that https://github.com/mozilla/telemetry-server/tree/master/mapreduce/experiments has the code that's currently running, but I'm not 100% sure. Beginning job experiments ... Today is 20160106, and we're gathering experiment data for 20160105 ./run.sh: line 32: cd: /home/ubuntu/telemetry-server: No such file or directory Starting the experiment export for 20160105 /usr/bin/python: No module named mapreduce Mapreduce job exited with code: 1 ./run.sh: line 45: cd: OLDPWD not set grep: /mnt/telemetry/data.csv: No such file or directory End of error lines. Adding header line and removing error lines... Traceback (most recent call last): File "postprocess.py", line 4, in <module> import simplejson as json ImportError: No module named simplejson Removing temp file rm: cannot remove ‘/mnt/telemetry/data.csv’: No such file or directory Listing: total 0 Done! Finished job experiments './run.sh' exited with code 0 Mark, do you know what might have changed between Tuesday and Wednesday for this?
Flags: needinfo?(mreid)
Comment 1•8 years ago
|
||
There was a release of analysis.t.m.o to deploy Spark 1.5 around that time, though I did not expect it to have any impact on "old style" scheduled jobs. I'm investigating now.
Updated•8 years ago
|
Flags: needinfo?(mreid)
Comment 2•8 years ago
|
||
Wes, a.t.m.o is currently using a worker AMI of telemetry-worker-hvm-20151214 (ami-ea4c538b) - it appears that this AMI may not have been created properly. Do you know where it came from? It seems to be missing the ~/telemetry-server directory (and maybe other things). I'll generate a new AMI now, we should update the CFN stack accordingly unless you know of a reason not to.
Flags: needinfo?(whd)
Comment 3•8 years ago
|
||
I created that AMI using the the build_ami playbook from https://github.com/mozilla/telemetry-server/tree/master/provisioning/ansible. We can probably roll back the worker AMI if there weren't any changes or build a new one.
Flags: needinfo?(whd)
Comment 4•8 years ago
|
||
Update: I created a new AMI telemetry-worker-hvm-20160111 (ami-4bb2a92a), and Wes deployed it. Jobs should start working again now. Let me know if you need help backfilling failed jobs.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•