Closed Bug 1225080 Opened 9 years ago Closed 8 years ago

a.t.m.o should support Spark 1.5

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: rvitillo)

References

Details

The spark1.5 branch of emr-bootstrap-spark contains a working set of scripts to launch an interactive job. Mark, could you add the required changes to a.t.m.o to allow users to launch Spark jobs from the dashboard?

I am going to add support for batch jobs in the next days.
Flags: needinfo?(mreid)
Flags: needinfo?(mreid) → needinfo?(whd)
I think :whd is going to be looking at this soon.

This amounts to updating the launcher scripts to use a similar command to the "Interactive job" section here:
https://github.com/mozilla/emr-bootstrap-spark/tree/spark1.5#interactive-job
Assignee: nobody → rvitillo
Mark, Wesley, who is supposed to review the patch?
Flags: needinfo?(whd)
Flags: needinfo?(mreid)
I'm reviewing this presently, which includes setting up a staging environment to test current jobs.
Flags: needinfo?(whd)
Flags: needinfo?(mreid)
In the interest of expedition I'm testing the current scheduled spark jobs via a database dump and standalone emr cluster first instead of setting up a proper staging environment.

The first job I tested completed (Addon Analysis), but had very different results from the current output (90 entries on a.t.m.o vs. 16 on spark 1.5.2). I'll look into this more presently.
(In reply to Wesley Dawson [:whd] from comment #6)
> In the interest of expedition I'm testing the current scheduled spark jobs
> via a database dump and standalone emr cluster first instead of setting up a
> proper staging environment.
> 
> The first job I tested completed (Addon Analysis), but had very different
> results from the current output (90 entries on a.t.m.o vs. 16 on spark
> 1.5.2). I'll look into this more presently.

Do you have a log of the job by chance?
I re-ran the addon analysis and found that the reason the results were mismatched was PEBKAC. I futzed with get_pings parameters since the query for "yesterday" was returning an empty RDD set. I while diagnosing this changed the 1.3.0 notebook to look at release pings instead of nightly, which caused the discrepancy.

Looking at the other scheduled jobs, everything else I've tested so far is working as expected. Spark 1.5.2 seems to lose some precision in some calculations (e.g. 8.62309073992e-05 vs 8.623090739921546e-05 in 1.3.0) but that's more or less well into noise.

I'll finish up testing the remaining jobs and then deploy to a.t.m.o.
Priority: -- → P1
Another issue I noticed and forgot to mention about spark 1.5: it seems the spark web UI has changed to do some http redirection, which makes it harder to access that interface via port forwarding. Where before I could simply forward 4040 to the local host, that now results in a redirect to a different port, but also using the internal hostname of the host itself, something like:

http://ip-172-31-10-159.us-west-2.compute.internal:20888/proxy/application_1449014891980_0011/

If you know what you are doing this can be surmounted, but it's certainly an inconvenience.
Unfortunately the interface never really worked using simple port forwarding, you should use a SOCKS proxy.
An update here: the code has been merged and I deployed it. However, when doing a final round of testing, launching a spark cluster resulted in a bootstrap failure, so I rolled back. I have not been able to reproduce the bootstrap failure when running things manually, and the logs (aside, we should enable s3 logging of EMR jobs) before the EMR cluster is terminated don't show anything fatal. I do see the emacs build failing:

make: *** [install-emacs] Error 255
/mnt/var/lib/bootstrap-actions/1/telemetry.sh: line 107: Submodule: command not found

but the final command (ipython) is succeeding. EMR says the bootstrap script is exiting with status code 2.

I've got a shadow copy of a.t.m.o running at ec2-54-213-222-151.us-west-2.compute.amazonaws.com (all scheduled jobs disabled) which I will continue to test with until I figure out what the problem is.
Looks like this was an issue with overriding the system python. https://github.com/mozilla/emr-bootstrap-spark/pull/11 has the fix. As for why it only happens when running via telemetry-dash, I haven't the faintest.
Landed.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.