Increase cluster size for pioneer data analysis
Categories
(Data Platform and Tools Graveyard :: Operations, enhancement)
Tracking
(Not tracked)
People
(Reporter: mlopatka, Assigned: whd)
Details
I'd like to request more clusters allocated to the Pioneer analysis instances (or one of the instances)
With the size of data collected in the JESTr study I'm experiencing kernel death on simple operations (count, group, lambda filtering).
Analysis is intractable given current resources.
| Assignee | ||
Comment 1•7 years ago
|
||
We've got a few 11-node clusters, see https://mana.mozilla.org/wiki/display/SVCOPS/Access+to+Pioneer+Resources#AccesstoPioneerResources-ResourceContention. :mlopatka if you're already using those, and having issues (or switch to using them and still have issues), I will provision new clusters with more nodes until we reach a state where you're able to perform analysis.
Ah! thanks for pointing that out. I was indeed using the 1-node clusters.
I'm now having authentication issues accessing the pioneer clusters altogether.
I'm still hitting performance issues.
Currently I sample at 0.001 of the total payload (2572.122 Gb !!) just to get a count.
A temporary increase in the cluster size would be helpful so I can generate a few filtered derived data sets to work on specific analytical tasks, thought retainign the raw data is critically important as we go through exploratory analysis.
Can we coordinate a resource increase at a moment where I can spend cycles on getting the derived datasest made.
| Assignee | ||
Comment 4•7 years ago
|
||
(In reply to mlopatka from comment #3)
A temporary increase in the cluster size would be helpful so I can generate a few filtered derived data sets to work on specific analytical tasks, thought retainign the raw data is critically important as we go through exploratory analysis.
Can we coordinate a resource increase at a moment where I can spend cycles on getting the derived datasest made.
Yes, I'm currently looking into the EMR sizing we generally use for datasets of this magnitude. When I have that information we should coordinate the timing.
| Assignee | ||
Comment 5•7 years ago
|
||
I have determined that for a comparable workload (main summary, low TBs) we've used a cluster of 40 nodes, and that a job across all data takes between 6-12 hours. Since there will probably need to be some exploratory work when creating derived dataset(s) I think budgeting 3-5 days of running the expanded cluster seems reasonable. We can also decommission the 10-node clusters and use their capacity to reduce the budgetary impact if more time is required.
Let me know when you would like the cluster to be made available.
Let me know when you would like the cluster to be made available.
I have allocated 8 hours on Friday (February 8th) from (8:00am - 16:00 UTC+1) to work on the generation of derived datasets from the JESTr payload.
Is it possible to have the monstrous 40 node cluster ready to roll on that day?
| Assignee | ||
Comment 7•7 years ago
|
||
(In reply to mlopatka from comment #6)
Let me know when you would like the cluster to be made available.
I have allocated 8 hours on Friday (February 8th) from (8:00am - 16:00 UTC+1) to work on the generation of derived datasets from the JESTr payload.
Is it possible to have the monstrous 40 node cluster ready to roll on that day?
Sounds good. I will update the bug when the cluster is available, which I will provision in the evening on Thursday (PST).
| Assignee | ||
Comment 8•7 years ago
|
||
I've bumped the number of core nodes in pioneer-analysis-7.data.mozaws.net to 40. With this I ran a simple count on 1% of the data in about 10 minutes. If this is still not performant enough I will be providing contact info via email so that you can contact me in case of issues. :sunahsuh (UTC-06:00) is the best person on the data engineering side to discuss potential performance issues with.
| Assignee | ||
Comment 9•7 years ago
|
||
:mlopatka was able to work with the data and didn't run into any unexpected issues. I've scaled down the cluster for now, but there is remaining work around generating derived datasets. We're going to use this bug to coordinate additional cluster size increases for such work.
Comment 10•6 years ago
|
||
I'd like to request another large cluster for processing the JESTr data. I've got a job ready to go that I'd like to run on the full dataset (convert the raw pings into a dataframe for downstream use). I've tried using the 11-node clusters, but it kept crashing (OOM) even after tweaking Spark's memory settings and partitioning.
I expect a 30-node cluster should be sufficient. I'd be able to launch the job as soon as it's ready. I expect to need it for <1 day for the job to run, and maybe an extra day or two in case anything goes wrong.
| Assignee | ||
Comment 11•6 years ago
|
||
I've bumped the number of core nodes in pioneer-analysis-7.data.mozaws.net to 30.
| Assignee | ||
Comment 12•6 years ago
|
||
:dzeber let me know via NI cancel when I can scale this cluster back down.
| Assignee | ||
Comment 14•6 years ago
|
||
The cluster has been scaled back down.
| Assignee | ||
Comment 15•6 years ago
|
||
Given https://bugzilla.mozilla.org/show_bug.cgi?id=1590211 we're not likely to need this bug anymore.
Updated•3 years ago
|
Description
•