Closed
Bug 1225080
Opened 9 years ago
Closed 8 years ago
a.t.m.o should support Spark 1.5
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rvitillo, Assigned: rvitillo)
References
Details
The spark1.5 branch of emr-bootstrap-spark contains a working set of scripts to launch an interactive job. Mark, could you add the required changes to a.t.m.o to allow users to launch Spark jobs from the dashboard? I am going to add support for batch jobs in the next days.
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(mreid)
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(mreid) → needinfo?(whd)
Comment 1•9 years ago
|
||
I think :whd is going to be looking at this soon. This amounts to updating the launcher scripts to use a similar command to the "Interactive job" section here: https://github.com/mozilla/emr-bootstrap-spark/tree/spark1.5#interactive-job
Assignee | ||
Comment 2•9 years ago
|
||
https://github.com/mozilla/telemetry-server/pull/133
Flags: needinfo?(whd)
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → rvitillo
Assignee | ||
Comment 3•9 years ago
|
||
https://github.com/mozilla/emr-bootstrap-spark/pull/10
Assignee | ||
Comment 4•9 years ago
|
||
Mark, Wesley, who is supposed to review the patch?
Flags: needinfo?(whd)
Flags: needinfo?(mreid)
Comment 5•9 years ago
|
||
I'm reviewing this presently, which includes setting up a staging environment to test current jobs.
Flags: needinfo?(whd)
Flags: needinfo?(mreid)
Comment 6•9 years ago
|
||
In the interest of expedition I'm testing the current scheduled spark jobs via a database dump and standalone emr cluster first instead of setting up a proper staging environment. The first job I tested completed (Addon Analysis), but had very different results from the current output (90 entries on a.t.m.o vs. 16 on spark 1.5.2). I'll look into this more presently.
Assignee | ||
Comment 7•9 years ago
|
||
(In reply to Wesley Dawson [:whd] from comment #6) > In the interest of expedition I'm testing the current scheduled spark jobs > via a database dump and standalone emr cluster first instead of setting up a > proper staging environment. > > The first job I tested completed (Addon Analysis), but had very different > results from the current output (90 entries on a.t.m.o vs. 16 on spark > 1.5.2). I'll look into this more presently. Do you have a log of the job by chance?
Comment 8•9 years ago
|
||
I re-ran the addon analysis and found that the reason the results were mismatched was PEBKAC. I futzed with get_pings parameters since the query for "yesterday" was returning an empty RDD set. I while diagnosing this changed the 1.3.0 notebook to look at release pings instead of nightly, which caused the discrepancy. Looking at the other scheduled jobs, everything else I've tested so far is working as expected. Spark 1.5.2 seems to lose some precision in some calculations (e.g. 8.62309073992e-05 vs 8.623090739921546e-05 in 1.3.0) but that's more or less well into noise. I'll finish up testing the remaining jobs and then deploy to a.t.m.o.
Updated•9 years ago
|
Priority: -- → P1
Comment 9•9 years ago
|
||
Another issue I noticed and forgot to mention about spark 1.5: it seems the spark web UI has changed to do some http redirection, which makes it harder to access that interface via port forwarding. Where before I could simply forward 4040 to the local host, that now results in a redirect to a different port, but also using the internal hostname of the host itself, something like: http://ip-172-31-10-159.us-west-2.compute.internal:20888/proxy/application_1449014891980_0011/ If you know what you are doing this can be surmounted, but it's certainly an inconvenience.
Assignee | ||
Comment 10•9 years ago
|
||
Unfortunately the interface never really worked using simple port forwarding, you should use a SOCKS proxy.
Comment 11•9 years ago
|
||
An update here: the code has been merged and I deployed it. However, when doing a final round of testing, launching a spark cluster resulted in a bootstrap failure, so I rolled back. I have not been able to reproduce the bootstrap failure when running things manually, and the logs (aside, we should enable s3 logging of EMR jobs) before the EMR cluster is terminated don't show anything fatal. I do see the emacs build failing: make: *** [install-emacs] Error 255 /mnt/var/lib/bootstrap-actions/1/telemetry.sh: line 107: Submodule: command not found but the final command (ipython) is succeeding. EMR says the bootstrap script is exiting with status code 2. I've got a shadow copy of a.t.m.o running at ec2-54-213-222-151.us-west-2.compute.amazonaws.com (all scheduled jobs disabled) which I will continue to test with until I figure out what the problem is.
Comment 12•9 years ago
|
||
Looks like this was an issue with overriding the system python. https://github.com/mozilla/emr-bootstrap-spark/pull/11 has the fix. As for why it only happens when running via telemetry-dash, I haven't the faintest.
Assignee | ||
Comment 13•8 years ago
|
||
Landed.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•