Closed Bug 1238016 Opened 8 years ago Closed 8 years ago

Experiments scheduled job started failing Wednesday (6-Jan-2015)

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Unassigned)

Details

The experiments scheduled job started failing on Wednesday. There are two problems:

A. the failure
B. I didn't get any emails. Probably because the shell script didn't have -e and so it returned successfully.

I believe that https://github.com/mozilla/telemetry-server/tree/master/mapreduce/experiments has the code that's currently running, but I'm not 100% sure.

Beginning job experiments ...
Today is 20160106, and we're gathering experiment data for 20160105
./run.sh: line 32: cd: /home/ubuntu/telemetry-server: No such file or directory
Starting the experiment export for 20160105
/usr/bin/python: No module named mapreduce
Mapreduce job exited with code: 1
./run.sh: line 45: cd: OLDPWD not set
grep: /mnt/telemetry/data.csv: No such file or directory
End of error lines.
Adding header line and removing error lines...
Traceback (most recent call last):
  File "postprocess.py", line 4, in <module>
    import simplejson as json
ImportError: No module named simplejson
Removing temp file
rm: cannot remove ‘/mnt/telemetry/data.csv’: No such file or directory
Listing:
total 0
Done!
Finished job experiments
'./run.sh' exited with code 0

Mark, do you know what might have changed between Tuesday and Wednesday for this?
Flags: needinfo?(mreid)
There was a release of analysis.t.m.o to deploy Spark 1.5 around that time, though I did not expect it to have any impact on "old style" scheduled jobs.

I'm investigating now.
Flags: needinfo?(mreid)
Wes, a.t.m.o is currently using a worker AMI of telemetry-worker-hvm-20151214 (ami-ea4c538b) - it appears that this AMI may not have been created properly. Do you know where it came from?

It seems to be missing the ~/telemetry-server directory (and maybe other things).

I'll generate a new AMI now, we should update the CFN stack accordingly unless you know of a reason not to.
Flags: needinfo?(whd)
I created that AMI using the the build_ami playbook from https://github.com/mozilla/telemetry-server/tree/master/provisioning/ansible. We can probably roll back the worker AMI if there weren't any changes or build a new one.
Flags: needinfo?(whd)
Update: I created a new AMI telemetry-worker-hvm-20160111 (ami-4bb2a92a), and Wes deployed it.

Jobs should start working again now. Let me know if you need help backfilling failed jobs.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.