Closed Bug 1466936 Opened 7 years ago Closed 7 years ago

Make pyspark_hyperloglog work on atmo (and airflow)

Categories

(Data Platform and Tools :: General, defect, P1)

Product:

Component:

Type:

defect

Priority:

P1

Severity:

normal

Points:

5

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: wlach, Assigned: klukas)

References

Details

(Whiteboard: [DataPlatform])

William Lachance (:wlach)

Reporter

Description

•

7 years ago

:amiyaguchi made spark-hyperloglog callable from python in bug 1305087, unfortunately I get an error when trying to run from an atmo notebook (after pip installing pyspark_hyperloglog on the commandline): /mnt/anaconda2/lib/python2.7/site-packages/pyspark_hyperloglog/hll.pyc in register() 8 spark.createDataFrame([{'a': 1}]).count() 9 sc = spark.sparkContext ---> 10 sc._jvm.com.mozilla.spark.sql.hyperloglog.functions.package.registerUdf() 11 12 TypeError: 'JavaPackage' object is not callable We think this is because the scala package is not actually exposed to pyspark. Preferably, pyspark_hyperloglog would be available out-of-the-box in atmo (or whatever future services we use pyspark from e.g. databricks). We should also make sure it works from telemetry-airflow.

William Lachance (:wlach)

Reporter

Updated

•

7 years ago

Blocks: 1450329

Anthony Miyaguchi [:amiyaguchi]

Comment 1

•

7 years ago

I've reproduced the same error on a spark cluster managed by Databricks. It looks like there is some issue getting the jar registered with the Spark Executors. Here is the notebook with the same error in comment #1: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/16095 The gist for the ipynb can be found at this gist: https://gist.github.com/acmiyaguchi/9c05b7ecefd35e9e0aaa67c8354090ff

Anthony Miyaguchi [:amiyaguchi]

Updated

•

7 years ago

Points: --- → 2

Priority: -- → P2

Daniel Thorn [:relud]

Comment 2

•

7 years ago

adding that this blocks the release of MozData, because MozData is going to have the same issue, due to using the same method for distributing jars.

Blocks: 1460343

Mark Reid [:mreid]

Updated

•

7 years ago

Whiteboard: [DataPlatform]

Daniel Thorn [:relud]

Updated

•

7 years ago

No longer blocks: 1460343

William Lachance (:wlach)

Reporter

Updated

•

7 years ago

Component: Telemetry Analysis Service (ATMO) → Telemetry APIs for Analysis

Jeff Klukas [:klukas] (UTC-4)

Assignee

Updated

•

7 years ago

Assignee: amiyaguchi → jklukas

Priority: P2 → P1

Anthony Miyaguchi [:amiyaguchi]

Comment 3

•

7 years ago

I recently used a library called GraphFrames which is available through the spark-package repository.[1] The python files in "$ROOT/python" are compiled via `python -m compileall` and then included in the jar.[2] This is how pyspark-hyperloglog _should_ work, but something isn't set up correctly. The package should be attached to the cluster as per the story in bug 1305087 and the instructions on the spark-hyperloglog package.[3] I think the reason that the job has failed is because we've been installing the pyspark package instead of the spark jar. [1] https://github.com/graphframes/graphframes [2] https://github.com/databricks/sbt-spark-package/blob/master/src/main/scala/sbtsparkpackage/SparkPackagePlugin.scala#L178-L203 [3] https://spark-packages.org/package/vitillo/spark-hyperloglog

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 4

•

7 years ago

I'm not convinced that the python packaging we're doing on spark-hyperloglog is going to work. It works for pyspark, I think, because they end up providing the pyspark executable and can reference jars from there. Even if we correctly get the jars included in the python wheel, I'm not sure there's a good option for making sure they're accessible from Spark. It seems like spark "packages" may be the way to go. The spark-hyperloglog project is already set up with the relevant sbt plugin to create a zip uploadable to spark-packages.org which is then possible to pass to Spark via --package. If we can get our current code onto spark-packages.org, then we can make it available by default in ATMO by modifying the configuration file [4] to set spark.jars.packages to download spark-hyperloglog which should include making the python packaging available to pyspark shell and notebooks. I'm going to try forking the repo and uploading my fork to spark-packages.org to prove the concept. [4] https://console.aws.amazon.com/s3/object/telemetry-spark-emr-2/configuration/configuration.json

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 5

•

7 years ago

Sadly, it appears that running pyspark with `--packages` doesn't pull in the python lib from the package. Based on a few issues I'm seeing on various pyspark packages, this may be an EMR-specific problem; something about the overall Spark configuration. One likely culprit is that EMR sets "spark.executorEnv.PYTHONPATH". So, the quick-and-dirty change here is to require the python lib and jar to be separate things. Python lib pulled down via pip install and jar either via S3 directly or via spark-packages.org.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 6

•

7 years ago

It looks like `--packages` actually does get an entry on sys.path on EMR, but the entry doesn't actually exist on the filesystem: ls: cannot access /mnt/spark-1eee4ab9-b8f3-4748-9d56-3553f549b47f/userFiles-5eedf4d9-8885-4a97-ac8a-9e154eb7b65d/graphframes_graphframes-0.5.0-spark2.1-s_2.11.jar: No such file or directory

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 7

•

7 years ago

Ugly workaround for python to be able to access these packages: import sys pyfiles = str(sc.getConf().get(u'spark.submit.pyFiles')).split(',') sys.path.extend(pyfiles)

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 8

•

7 years ago

I've submitted a support case to AWS [5] asking about EMR's strange behavior here: [5] https://console.aws.amazon.com/support/v1?region=us-west-2#/case/?displayId=5170206571&language=en

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 9

•

7 years ago

My general plan of action at this point is the following: - Remove the pyspark_hyperloglog package from PyPI - PR mozilla/spark-hyperloglog to clean it up for including the python bits in a spark-packages.org package - Create a spark-packages.org package for the project Then, we make this package available in the following places: - As a library on our databricks clusters; this should just work - Add to new ATMO clusters by adding the lib to a new 'spark.jars.packages' config value; I believe this would happen in emr-bootstrap-spark [6] - Figure out how to make it available to Airflow On ATMO clusters, the lib would then be available in Scala. To make it available in pySpark or notebooks, it would be necessary to include the workaround of extending sys.path based on the content of spark configuration. [6] https://github.com/mozilla/emr-bootstrap-spark/blob/master/ansible/files/configuration/configuration.json

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 10

•

7 years ago

We've landed a first deploy of spark-hyperloglog at spark-packages.org [7] and it's usable from ATMO command line via: $ pyspark --packages mozilla:spark-hyperloglog:2.2.0 >>> import sys; sys.path.extend(sc.getConf().get(u'spark.submit.pyFiles')).split(',')) >>> from pyspark_hyperloglog import hll >>> hll.register() To have each new SparkSession load the package automatically, you can modify the spark cluster configuration: echo "spark.jars.packages mozilla:spark-hyperloglog:2.2.0" | sudo tee --append /usr/lib/spark/conf/spark-defaults.conf That should allow Jupyter notebooks and the like to run the same code as the pyspark session above. You could also avoid having to run the `sys.path` line by doing a `pip install pyspark_hyperloglog`. Still to think about here is whether we can mirror the package to our s3 mavenrepo. Also todo is to get this added by default to our EMR bootstrap and to Databricks. [7] https://spark-packages.org/package/mozilla/spark-hyperloglog

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 11

•

7 years ago

I forgot to mention that it's necessary to restart Spark to make any changes to spark-defaults.conf available: $ sudo restart hadoop-yarn-resourcemanager

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 12

•

7 years ago

`sudo restart` appears to have been a lie. You have to stop and then start: sudo stop hadoop-yarn-resourcemanager sudo start hadoop-yarn-resourcemanager

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 13

•

7 years ago

The "mozilla:spark-hyperloglog:2.2.0" package is now "attached" to both the datascience and shared_serverless clusters on databricks. Here's a notebook that demonstrates it working and shows that it's pulling the python bindings from the package jar: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/18509

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 14

•

7 years ago

Working on a PR for emr-bootstrap to get this package loaded by default in ATMO: https://github.com/mozilla/emr-bootstrap-spark/pull/393

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 15

•

7 years ago

Would like to merge this on Monday along with various python package update PRs. Asking :mreid about policy around that.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 16

•

7 years ago

Had discussion in Slack and sounds like we generally see the risk/reward ratio as low for pyup-bot PRs, so I'm just going to merge the py4j update and my PR on Monday.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 17

•

7 years ago

PRs are merged, so closing this out. In summary, the conclusion here is that we'd ideally just be able to include the python files at the root of the jar, publish that jar to our S3 mavenrepo, and then add the package via maven coordinates to Databricks clusters and to emr-bootstrap-spark. The above is sufficient for Databricks, but not for EMR due to a bug there. Best option seems to be to additionally publish the python library (without any jar) to pyPI and include that in python-requirements.txt in emr-bootstrap-spark. AWS has confirmed this is a known issue, but will not give any ETA for fixing the behavior, so we should be prepared to support this work-around long-term.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 18

•

7 years ago

When we deployed this, we learned it was only compatible with EMR 5.13+, so had to roll back. I've posted a WIP PR for a new approach that relies on staging an uberjar with all our spark desired spark packages assembled together: https://github.com/mozilla/emr-bootstrap-spark/pull/407

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Jeff Klukas [:klukas] (UTC-4)

Assignee

Updated

•

7 years ago

See Also: → 1477760

Jeff Klukas [:klukas] (UTC-4)

Assignee

Updated

•

7 years ago

Points: 2 → 5

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 19

•

7 years ago

Broke out uberjar creation to https://github.com/mozilla/telemetry-spark-packages-assembly and now have that integrated to PR for review that will get this running on ATMO: https://github.com/mozilla/emr-bootstrap-spark/pull/407

Jeff Klukas [:klukas] (UTC-4)

Assignee

Updated

•

7 years ago

Depends on: 1478093

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 20

•

7 years ago

The emr-bootstrap change is deployed and an email sent to fx-data-platform.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 21

•

7 years ago

Also sent email to fx-data-dev

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: Telemetry APIs for Analysis → General

You need to log in before you can comment on or make changes to this bug.