Closed Bug 1298123 Opened 9 years ago Closed 8 years ago

Refactor dataset classes so they do not use case classes

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: harter, Assigned: harter)

References

Details

Attachments

(1 file)

https://github.com/mozilla/telemetry-batch-view/pull/108 8 years ago Ryan Harter [:harter] 56 bytes, text/x-github-pull-request	rvitillo : review+	Details \| Review

Ryan Harter [:harter]

Assignee

Description

•

9 years ago

In order to create Spark Datasets, we need spark classes so we can impute types. These are currently implemented as case classes, which are limited to 22 fields in Scala. We'll need to refactor these classes to remove this restriction.

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Priority: P1 → P2

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Blocks: 1298120

Ryan Harter [:harter]

Assignee

Comment 1

•

9 years ago

Now that I've researched this problem, I'll suggest an alternate solution. The 22 field limit for case classes is removed in Scala 2.11 [0]. Instead of defining a custom class implementing Product, let's upgrade to 2.11. Unfortunately, upgrading the telemetry-batch-view library to 2.11 creates a dependency conflict with spark-hyperloglog. Spark-HyperLogLog requires a 2.10 scalatest library, but this dependency should really be limited to the test scope. I've created a pull request [1] to fix. Roberto: do we need to explicitly upgrade the clusters to use 2.11 or will the config change be enough? Also, I'm unclear on how we're distributing spark-hyperloglog. Do we need to upgrade that package to 2.11 as well? Comments and concerns are welcome. For context, I took a crack at implementing the case class replacement [2]. Unfortunately, we'd need to update the valSeq variable manually whenever adding new fields. This is as clean as I could get the implementation without upgrading Scala. [0] https://issues.scala-lang.org/browse/SI-7296 [1] https://github.com/vitillo/spark-hyperloglog/pull/1 [2] https://github.com/harterrt/telemetry-batch-view/blob/4b4e40601abab30881c5c90986fec37f5466372f/src/main/scala/com/mozilla/telemetry/views/CrossSectionalView.scala#L30

Flags: needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Comment 2

•

8 years ago

(In reply to Ryan Harter [:harter] from comment #1) > Now that I've researched this problem, I'll suggest an alternate solution. > The 22 field limit for case classes is removed in Scala 2.11 [0]. > Instead of defining a custom class implementing Product, let's upgrade to > 2.11. > > Unfortunately, upgrading the telemetry-batch-view library to 2.11 creates a > dependency conflict with spark-hyperloglog. > Spark-HyperLogLog requires a 2.10 scalatest library, but this dependency > should really be limited to the test scope. > I've created a pull request [1] to fix. I don't think we can't easily move to 2.11 since the current Spark version we are using in production is using 2.10. That said, moving to EMR release 5 should do the trick. That might not happen soon though.

Flags: needinfo?(rvitillo)

Ryan Harter [:harter]

Assignee

Comment 3

•

8 years ago

Roberto, I believe I was able to create an example view and run the tests on a cluster with Spark 2.11 by removing the HLL library and the packages which depend on it. Is there any other problem I should look out for?

Flags: needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Comment 4

•

8 years ago

(In reply to Ryan Harter [:harter] from comment #3) > Roberto, I believe I was able to create an example view and run the tests on > a cluster with Spark 2.11 by removing the HLL library and the packages which > depend on it. Is there any other problem I should look out for? All our jobs have to be tested with Scala 2.11 before we can do the upgrade in production. Furthermore, afaik, Scala 2.10 is not binary compatible with Scala 2.11 and we should not mix code compiled for different major versions in production.

Flags: needinfo?(rvitillo)

Ryan Harter [:harter]

Assignee

Updated

•

8 years ago

Priority: P2 → P1

Ryan Harter [:harter]

Assignee

Comment 5

•

8 years ago

Attached file https://github.com/mozilla/telemetry-batch-view/pull/108 — Details

Ryan Harter [:harter]

Assignee

Comment 6

•

8 years ago

SGTM, Roberto. As discussed offline, we don't want to block the cross sectional dataset on upgrading to spark 2.11. I've added a work around PR.

Ryan Harter [:harter]

Assignee

Updated

•

8 years ago

Attachment #8788644 - Flags: review?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Updated

•

8 years ago

Attachment #8788644 - Flags: review?(rvitillo) → review+

Roberto Agostino Vitillo (:rvitillo)

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Refactor dataset classes so they do not use case classes

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: harter, Assigned: harter)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Updated

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type