Closed Bug 1466936 Opened 7 years ago Closed 7 years ago

Make pyspark_hyperloglog work on atmo (and airflow)

Categories

(Data Platform and Tools :: General, defect, P1)

defect
Points:
5

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wlach, Assigned: klukas)

References

Details

(Whiteboard: [DataPlatform])

:amiyaguchi made spark-hyperloglog callable from python in bug 1305087, unfortunately I get an error when trying to run from an atmo notebook (after pip installing pyspark_hyperloglog on the commandline): /mnt/anaconda2/lib/python2.7/site-packages/pyspark_hyperloglog/hll.pyc in register() 8 spark.createDataFrame([{'a': 1}]).count() 9 sc = spark.sparkContext ---> 10 sc._jvm.com.mozilla.spark.sql.hyperloglog.functions.package.registerUdf() 11 12 TypeError: 'JavaPackage' object is not callable We think this is because the scala package is not actually exposed to pyspark. Preferably, pyspark_hyperloglog would be available out-of-the-box in atmo (or whatever future services we use pyspark from e.g. databricks). We should also make sure it works from telemetry-airflow.
Blocks: 1450329
I've reproduced the same error on a spark cluster managed by Databricks. It looks like there is some issue getting the jar registered with the Spark Executors. Here is the notebook with the same error in comment #1: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/16095 The gist for the ipynb can be found at this gist: https://gist.github.com/acmiyaguchi/9c05b7ecefd35e9e0aaa67c8354090ff
Points: --- → 2
Priority: -- → P2
adding that this blocks the release of MozData, because MozData is going to have the same issue, due to using the same method for distributing jars.
Blocks: 1460343
Whiteboard: [DataPlatform]
No longer blocks: 1460343
Component: Telemetry Analysis Service (ATMO) → Telemetry APIs for Analysis
Assignee: amiyaguchi → jklukas
Priority: P2 → P1
I recently used a library called GraphFrames which is available through the spark-package repository.[1] The python files in "$ROOT/python" are compiled via `python -m compileall` and then included in the jar.[2] This is how pyspark-hyperloglog _should_ work, but something isn't set up correctly. The package should be attached to the cluster as per the story in bug 1305087 and the instructions on the spark-hyperloglog package.[3] I think the reason that the job has failed is because we've been installing the pyspark package instead of the spark jar. [1] https://github.com/graphframes/graphframes [2] https://github.com/databricks/sbt-spark-package/blob/master/src/main/scala/sbtsparkpackage/SparkPackagePlugin.scala#L178-L203 [3] https://spark-packages.org/package/vitillo/spark-hyperloglog
I'm not convinced that the python packaging we're doing on spark-hyperloglog is going to work. It works for pyspark, I think, because they end up providing the pyspark executable and can reference jars from there. Even if we correctly get the jars included in the python wheel, I'm not sure there's a good option for making sure they're accessible from Spark. It seems like spark "packages" may be the way to go. The spark-hyperloglog project is already set up with the relevant sbt plugin to create a zip uploadable to spark-packages.org which is then possible to pass to Spark via --package. If we can get our current code onto spark-packages.org, then we can make it available by default in ATMO by modifying the configuration file [4] to set spark.jars.packages to download spark-hyperloglog which should include making the python packaging available to pyspark shell and notebooks. I'm going to try forking the repo and uploading my fork to spark-packages.org to prove the concept. [4] https://console.aws.amazon.com/s3/object/telemetry-spark-emr-2/configuration/configuration.json
Sadly, it appears that running pyspark with `--packages` doesn't pull in the python lib from the package. Based on a few issues I'm seeing on various pyspark packages, this may be an EMR-specific problem; something about the overall Spark configuration. One likely culprit is that EMR sets "spark.executorEnv.PYTHONPATH". So, the quick-and-dirty change here is to require the python lib and jar to be separate things. Python lib pulled down via pip install and jar either via S3 directly or via spark-packages.org.
It looks like `--packages` actually does get an entry on sys.path on EMR, but the entry doesn't actually exist on the filesystem: ls: cannot access /mnt/spark-1eee4ab9-b8f3-4748-9d56-3553f549b47f/userFiles-5eedf4d9-8885-4a97-ac8a-9e154eb7b65d/graphframes_graphframes-0.5.0-spark2.1-s_2.11.jar: No such file or directory
Ugly workaround for python to be able to access these packages: import sys pyfiles = str(sc.getConf().get(u'spark.submit.pyFiles')).split(',') sys.path.extend(pyfiles)
I've submitted a support case to AWS [5] asking about EMR's strange behavior here: [5] https://console.aws.amazon.com/support/v1?region=us-west-2#/case/?displayId=5170206571&language=en
My general plan of action at this point is the following: - Remove the pyspark_hyperloglog package from PyPI - PR mozilla/spark-hyperloglog to clean it up for including the python bits in a spark-packages.org package - Create a spark-packages.org package for the project Then, we make this package available in the following places: - As a library on our databricks clusters; this should just work - Add to new ATMO clusters by adding the lib to a new 'spark.jars.packages' config value; I believe this would happen in emr-bootstrap-spark [6] - Figure out how to make it available to Airflow On ATMO clusters, the lib would then be available in Scala. To make it available in pySpark or notebooks, it would be necessary to include the workaround of extending sys.path based on the content of spark configuration. [6] https://github.com/mozilla/emr-bootstrap-spark/blob/master/ansible/files/configuration/configuration.json
We've landed a first deploy of spark-hyperloglog at spark-packages.org [7] and it's usable from ATMO command line via: $ pyspark --packages mozilla:spark-hyperloglog:2.2.0 >>> import sys; sys.path.extend(sc.getConf().get(u'spark.submit.pyFiles')).split(',')) >>> from pyspark_hyperloglog import hll >>> hll.register() To have each new SparkSession load the package automatically, you can modify the spark cluster configuration: echo "spark.jars.packages mozilla:spark-hyperloglog:2.2.0" | sudo tee --append /usr/lib/spark/conf/spark-defaults.conf That should allow Jupyter notebooks and the like to run the same code as the pyspark session above. You could also avoid having to run the `sys.path` line by doing a `pip install pyspark_hyperloglog`. Still to think about here is whether we can mirror the package to our s3 mavenrepo. Also todo is to get this added by default to our EMR bootstrap and to Databricks. [7] https://spark-packages.org/package/mozilla/spark-hyperloglog
I forgot to mention that it's necessary to restart Spark to make any changes to spark-defaults.conf available: $ sudo restart hadoop-yarn-resourcemanager
`sudo restart` appears to have been a lie. You have to stop and then start: sudo stop hadoop-yarn-resourcemanager sudo start hadoop-yarn-resourcemanager
The "mozilla:spark-hyperloglog:2.2.0" package is now "attached" to both the datascience and shared_serverless clusters on databricks. Here's a notebook that demonstrates it working and shows that it's pulling the python bindings from the package jar: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/18509
Working on a PR for emr-bootstrap to get this package loaded by default in ATMO: https://github.com/mozilla/emr-bootstrap-spark/pull/393
Would like to merge this on Monday along with various python package update PRs. Asking :mreid about policy around that.
Had discussion in Slack and sounds like we generally see the risk/reward ratio as low for pyup-bot PRs, so I'm just going to merge the py4j update and my PR on Monday.
PRs are merged, so closing this out. In summary, the conclusion here is that we'd ideally just be able to include the python files at the root of the jar, publish that jar to our S3 mavenrepo, and then add the package via maven coordinates to Databricks clusters and to emr-bootstrap-spark. The above is sufficient for Databricks, but not for EMR due to a bug there. Best option seems to be to additionally publish the python library (without any jar) to pyPI and include that in python-requirements.txt in emr-bootstrap-spark. AWS has confirmed this is a known issue, but will not give any ETA for fixing the behavior, so we should be prepared to support this work-around long-term.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
When we deployed this, we learned it was only compatible with EMR 5.13+, so had to roll back. I've posted a WIP PR for a new approach that relies on staging an uberjar with all our spark desired spark packages assembled together: https://github.com/mozilla/emr-bootstrap-spark/pull/407
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
See Also: → 1477760
Points: 2 → 5
Broke out uberjar creation to https://github.com/mozilla/telemetry-spark-packages-assembly and now have that integrated to PR for review that will get this running on ATMO: https://github.com/mozilla/emr-bootstrap-spark/pull/407
Depends on: 1478093
The emr-bootstrap change is deployed and an email sent to fx-data-platform.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Also sent email to fx-data-dev
Component: Telemetry APIs for Analysis → General
You need to log in before you can comment on or make changes to this bug.