Closed Bug 1363136 Opened 7 years ago Closed 4 years ago

Install version of py4j that ships with the cluster's Spark version in EMR bootstrap script

Categories

(Data Platform and Tools :: General, enhancement, P3)

x86
macOS
enhancement
Points:
1

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: bugzilla, Unassigned)

References

Details

We're pretty far behind in the py4j version we're installing in our EMR bootstrap script (https://github.com/mozilla/emr-bootstrap-spark/blob/b3c7412b2f6b61c02b27125d6cad5935c16985ad/ansible/files/bootstrap/telemetry.sh#L164) -- this version doesn't work with the new SparkSession API introduced in Spark 2.0. Spark ships with a version of py4j, and unless there's another dependency on py4j that I don't know about, it seems prudent to install the version of that the cluster's Spark version is shipping with. That version is located here: $SPARK_HOME/python/lib/py4j-0.*.*-src.zip (0.10.3 on my EMR 5.2.1 cluster)

We can either add the py4j zip directly to PYTHONPATH (which is how the pyspark script does it) or we can do some terrible things with awk/a regex to install the correct version via pip. OR we can just punt this down the road and manually update the py4j version in telemetry.sh.
Component: General → Spark
Priority: -- → P3
Blocks: 1357749
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
Component: Spark → General
You need to log in before you can comment on or make changes to this bug.