Closed Bug 1457151 Opened 6 years ago Closed 6 years ago

Install R and packages on the Pioneer analysis clusters

Categories

(Data Platform and Tools Graveyard :: Operations, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rweiss, Assigned: whd)

Details

(Whiteboard: [DataOps])

Attachments

(2 files)

Please install R (https://www.r-project.org/) version 3.5 (ideally) or 3.4 on the Pioneer analysis clusters, along with the following: * the tidyverse package (https://tidyverse.tidyverse.org/) and all dependencies * ggplot2 * knitr * plyr * reshape2 * data.table
Whiteboard: [DataOps]
I'm going to take a stab at this today. Given that we don't support this workflow on ATMO/EMR clusters presently and it may be less trivial than simply running "sudo apt-get" type commands on the cluster, I may end up requesting assistance from :rweiss and :joy with respect to setting up this environment next week. In future we should support such workflows via our standard analysis infrastructure tooling, since pioneer clusters are simply a special-case variant of those with specific ACLs.
Assignee: nobody → whd
Status: NEW → ASSIGNED
pioneer-analysis-7.data.mozaws.net (10 core nodes) is available now with sparkR and these libraries. R is at version 3.4.1, the latest EMR-supported release. Rstudio is also installed and available on port 8787 but it doesn't see the metastore datasets presently for reasons yet unknown (sparkR can see them). I'll look into that next week.
I can confirm that I can access the RStudio interface here: https://screenshots.firefox.com/paUMdA7a4NpJGmG8/192.168.1.17 However, this is connecting to pioneer-analysis-1.data.aws.net. When I connect to 7, I am presented with the authentication log-in screen (https://screenshots.firefox.com/4uckCROp5DDHqi6x/192.168.1.17) I will work out of 1 for now, but I assume I should be using 7 not 1. I agree; if this exercise proves useful, we should incorporate this workflow into standard tooling. Where is the best place to track that conversation separate from this bug?
Flags: needinfo?(whd)
I used analysis-1 for testing some rstudio installation issues, but you should not be using it (it's a one node cluster). The authentication for rstudio on 7 (which is indeed the cluster you should be using) is hadoop/hadoop. However, as I mentioned, rstudio (or specifically sparklyr) is not presently hooked up to the metastore and is currently unsupported. You should be using sparkR at the console for now. We can track the rstudio work in the other bug you filed.
Flags: needinfo?(whd)
Rebecca, Are you all set with this? Katie gave me a heads up on this so I want to close the loop to make sure things are working for you or if there is additional work needed. Please let me know Thanks M
Flags: needinfo?(rweiss)
Attached file r errors
Everything was working fine up until this afternoon. I received the attached error from SparkR. At this point, I can't load any data (namely the parquet data) without getting additional errors, typically around configuring executor memory overhead and exceeding executor memory. I am getting the same error with sparklyr. Best guess is that I set off something that has run off with all memory. Can we restart this cluster?
Flags: needinfo?(rweiss) → needinfo?(whd)
Attached file output.log
Here's my output, running on the same cluster (7). I see that you are also logged in. At any rate it appears to work for me, so you might try logging out and back in again. I did not save the workspace after running these commands.
Flags: needinfo?(whd)
I will also add that I am not surprised that you are running into issues accessing the data in a somewhat untested and unsupported environment. EMR is configured to use the same tuning parameters that our standard analysis clusters use, but I don't know if they extend to SparkR or how well our Data Engineering team supports SparkR on our platform. I anticipate you may encounter more issues. I suggest :sunahsuh or possibly :joy as good points of contact for troubleshooting some of these data connectivity-related issues. I can also set up some additional clusters configured in the same way as 7 as a preemptive attempt to avoid blocking you on these kinds of failures, but if we end up needing to do nontrival performance tuning and debugging, that is likely to introduce significant delay.
Understood; it has been very clear that this is an experimental and unsupported workflow. For record keeping, it appears like this may have been ultimately related to the number of executor instances I configured as well as the amount of memory allocated to each instance and the driver. This makes sense, as the reliability of the observed behavior (both yours and mine) was inconsistent. Sometimes queries worked, sometimes they did not (and would return errors that made it seem as if the reference to the data was not persistent). Ultimately, for this particular dataset and analysis, it appears like 50 executor instances at 3G each plus 3GB given to the driver has solved this specific issue. However, in chatting with :sunahsuh, we both agreed that there is an existing question of the health of these long-running clusters. This is an issue that is independent of R and is related to the Pioneer analysis environments themselves. :melissa, what's your preferred way to raise and track this issue?
Flags: needinfo?(moconnor)
(In reply to Rebecca Weiss from comment #9) > However, in chatting with :sunahsuh, we both agreed that > there is an existing question of the health of these long-running clusters. > This is an issue that is independent of R and is related to the Pioneer > analysis environments themselves. :melissa, what's your preferred way to > raise and track this issue? The preferred way to raise and track this issue is with a separate bug. I've asked :sunahsuh to file said bug and include the particular concerns, which she will do tomorrow. It sounds like these concerns are related to long-lived EMR processes consuming potentially unbounded resources on the master node, which could probably be remedied by e.g. setting up aggressive spark history pruning, hdfs pruning etc.
Flags: needinfo?(moconnor)
I've re-imaged analysis clusters 4-5 to also include these R packages, so all the clusters designed for non-trivial computation have them. I'm going to file a separate bug for upstreaming the R installation logic to our standard analysis clusters.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: