Closed Bug 1298123 Opened 9 years ago Closed 8 years ago

Refactor dataset classes so they do not use case classes

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: harter, Assigned: harter)

References

Details

Attachments

(1 file)

In order to create Spark Datasets, we need spark classes so we can impute types. These are currently implemented as case classes, which are limited to 22 fields in Scala. We'll need to refactor these classes to remove this restriction.
Priority: P1 → P2
Blocks: 1298120
Now that I've researched this problem, I'll suggest an alternate solution. The 22 field limit for case classes is removed in Scala 2.11 [0]. Instead of defining a custom class implementing Product, let's upgrade to 2.11. Unfortunately, upgrading the telemetry-batch-view library to 2.11 creates a dependency conflict with spark-hyperloglog. Spark-HyperLogLog requires a 2.10 scalatest library, but this dependency should really be limited to the test scope. I've created a pull request [1] to fix. Roberto: do we need to explicitly upgrade the clusters to use 2.11 or will the config change be enough? Also, I'm unclear on how we're distributing spark-hyperloglog. Do we need to upgrade that package to 2.11 as well? Comments and concerns are welcome. For context, I took a crack at implementing the case class replacement [2]. Unfortunately, we'd need to update the valSeq variable manually whenever adding new fields. This is as clean as I could get the implementation without upgrading Scala. [0] https://issues.scala-lang.org/browse/SI-7296 [1] https://github.com/vitillo/spark-hyperloglog/pull/1 [2] https://github.com/harterrt/telemetry-batch-view/blob/4b4e40601abab30881c5c90986fec37f5466372f/src/main/scala/com/mozilla/telemetry/views/CrossSectionalView.scala#L30
Flags: needinfo?(rvitillo)
(In reply to Ryan Harter [:harter] from comment #1) > Now that I've researched this problem, I'll suggest an alternate solution. > The 22 field limit for case classes is removed in Scala 2.11 [0]. > Instead of defining a custom class implementing Product, let's upgrade to > 2.11. > > Unfortunately, upgrading the telemetry-batch-view library to 2.11 creates a > dependency conflict with spark-hyperloglog. > Spark-HyperLogLog requires a 2.10 scalatest library, but this dependency > should really be limited to the test scope. > I've created a pull request [1] to fix. I don't think we can't easily move to 2.11 since the current Spark version we are using in production is using 2.10. That said, moving to EMR release 5 should do the trick. That might not happen soon though.
Flags: needinfo?(rvitillo)
Roberto, I believe I was able to create an example view and run the tests on a cluster with Spark 2.11 by removing the HLL library and the packages which depend on it. Is there any other problem I should look out for?
Flags: needinfo?(rvitillo)
(In reply to Ryan Harter [:harter] from comment #3) > Roberto, I believe I was able to create an example view and run the tests on > a cluster with Spark 2.11 by removing the HLL library and the packages which > depend on it. Is there any other problem I should look out for? All our jobs have to be tested with Scala 2.11 before we can do the upgrade in production. Furthermore, afaik, Scala 2.10 is not binary compatible with Scala 2.11 and we should not mix code compiled for different major versions in production.
Flags: needinfo?(rvitillo)
Priority: P2 → P1
SGTM, Roberto. As discussed offline, we don't want to block the cross sectional dataset on upgrading to spark 2.11. I've added a work around PR.
Attachment #8788644 - Flags: review?(rvitillo)
Attachment #8788644 - Flags: review?(rvitillo) → review+
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: