Closed
Bug 1298123
Opened 9 years ago
Closed 8 years ago
Refactor dataset classes so they do not use case classes
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: harter, Assigned: harter)
References
Details
Attachments
(1 file)
In order to create Spark Datasets, we need spark classes so we can impute types. These are currently implemented as case classes, which are limited to 22 fields in Scala. We'll need to refactor these classes to remove this restriction.
Assignee | ||
Updated•9 years ago
|
Priority: P1 → P2
Assignee | ||
Comment 1•9 years ago
|
||
Now that I've researched this problem, I'll suggest an alternate solution.
The 22 field limit for case classes is removed in Scala 2.11 [0].
Instead of defining a custom class implementing Product, let's upgrade to 2.11.
Unfortunately, upgrading the telemetry-batch-view library to 2.11 creates a dependency conflict with spark-hyperloglog.
Spark-HyperLogLog requires a 2.10 scalatest library, but this dependency should really be limited to the test scope.
I've created a pull request [1] to fix.
Roberto: do we need to explicitly upgrade the clusters to use 2.11 or will the config change be enough?
Also, I'm unclear on how we're distributing spark-hyperloglog.
Do we need to upgrade that package to 2.11 as well?
Comments and concerns are welcome.
For context, I took a crack at implementing the case class replacement [2].
Unfortunately, we'd need to update the valSeq variable manually whenever adding new fields.
This is as clean as I could get the implementation without upgrading Scala.
[0] https://issues.scala-lang.org/browse/SI-7296
[1] https://github.com/vitillo/spark-hyperloglog/pull/1
[2] https://github.com/harterrt/telemetry-batch-view/blob/4b4e40601abab30881c5c90986fec37f5466372f/src/main/scala/com/mozilla/telemetry/views/CrossSectionalView.scala#L30
Flags: needinfo?(rvitillo)
Comment 2•8 years ago
|
||
(In reply to Ryan Harter [:harter] from comment #1)
> Now that I've researched this problem, I'll suggest an alternate solution.
> The 22 field limit for case classes is removed in Scala 2.11 [0].
> Instead of defining a custom class implementing Product, let's upgrade to
> 2.11.
>
> Unfortunately, upgrading the telemetry-batch-view library to 2.11 creates a
> dependency conflict with spark-hyperloglog.
> Spark-HyperLogLog requires a 2.10 scalatest library, but this dependency
> should really be limited to the test scope.
> I've created a pull request [1] to fix.
I don't think we can't easily move to 2.11 since the current Spark version we are using in production is using 2.10. That said, moving to EMR release 5 should do the trick. That might not happen soon though.
Flags: needinfo?(rvitillo)
Assignee | ||
Comment 3•8 years ago
|
||
Roberto, I believe I was able to create an example view and run the tests on a cluster with Spark 2.11 by removing the HLL library and the packages which depend on it. Is there any other problem I should look out for?
Flags: needinfo?(rvitillo)
Comment 4•8 years ago
|
||
(In reply to Ryan Harter [:harter] from comment #3)
> Roberto, I believe I was able to create an example view and run the tests on
> a cluster with Spark 2.11 by removing the HLL library and the packages which
> depend on it. Is there any other problem I should look out for?
All our jobs have to be tested with Scala 2.11 before we can do the upgrade in production. Furthermore, afaik, Scala 2.10 is not binary compatible with Scala 2.11 and we should not mix code compiled for different major versions in production.
Flags: needinfo?(rvitillo)
Assignee | ||
Updated•8 years ago
|
Priority: P2 → P1
Assignee | ||
Comment 5•8 years ago
|
||
Assignee | ||
Comment 6•8 years ago
|
||
SGTM, Roberto. As discussed offline, we don't want to block the cross sectional dataset on upgrading to spark 2.11. I've added a work around PR.
Assignee | ||
Updated•8 years ago
|
Attachment #8788644 -
Flags: review?(rvitillo)
Updated•8 years ago
|
Attachment #8788644 -
Flags: review?(rvitillo) → review+
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•