Closed Bug 1238016 Opened 8 years ago Closed 8 years ago

Experiments scheduled job started failing Wednesday (6-Jan-2015)

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: benjamin, Unassigned)

Details

Benjamin Smedberg

Reporter

Description

•

8 years ago

The experiments scheduled job started failing on Wednesday. There are two problems:

A. the failure
B. I didn't get any emails. Probably because the shell script didn't have -e and so it returned successfully.

I believe that https://github.com/mozilla/telemetry-server/tree/master/mapreduce/experiments has the code that's currently running, but I'm not 100% sure.

Beginning job experiments ...
Today is 20160106, and we're gathering experiment data for 20160105
./run.sh: line 32: cd: /home/ubuntu/telemetry-server: No such file or directory
Starting the experiment export for 20160105
/usr/bin/python: No module named mapreduce
Mapreduce job exited with code: 1
./run.sh: line 45: cd: OLDPWD not set
grep: /mnt/telemetry/data.csv: No such file or directory
End of error lines.
Adding header line and removing error lines...
Traceback (most recent call last):
  File "postprocess.py", line 4, in <module>
    import simplejson as json
ImportError: No module named simplejson
Removing temp file
rm: cannot remove â€˜/mnt/telemetry/data.csvâ€™: No such file or directory
Listing:
total 0
Done!
Finished job experiments
'./run.sh' exited with code 0

Mark, do you know what might have changed between Tuesday and Wednesday for this?

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Comment 1

•

8 years ago

There was a release of analysis.t.m.o to deploy Spark 1.5 around that time, though I did not expect it to have any impact on "old style" scheduled jobs.

I'm investigating now.

Mark Reid [:mreid]

Updated

•

8 years ago

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Comment 2

•

8 years ago

Wes, a.t.m.o is currently using a worker AMI of telemetry-worker-hvm-20151214 (ami-ea4c538b) - it appears that this AMI may not have been created properly. Do you know where it came from?

It seems to be missing the ~/telemetry-server directory (and maybe other things).

I'll generate a new AMI now, we should update the CFN stack accordingly unless you know of a reason not to.

Flags: needinfo?(whd)

Wesley Dawson [:whd]

Comment 3

•

8 years ago

I created that AMI using the the build_ami playbook from https://github.com/mozilla/telemetry-server/tree/master/provisioning/ansible. We can probably roll back the worker AMI if there weren't any changes or build a new one.

Flags: needinfo?(whd)

Mark Reid [:mreid]

Comment 4

•

8 years ago

Update: I created a new AMI telemetry-worker-hvm-20160111 (ami-4bb2a92a), and Wes deployed it.

Jobs should start working again now. Let me know if you need help backfilling failed jobs.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Experiments scheduled job started failing Wednesday (6-Jan-2015)

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect)

Tracking

(Not tracked)

People

(Reporter: benjamin, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Updated