Closed Bug 1318709 Opened 8 years ago Closed 6 years ago

Create longitudinal dataset with 100% of pre-release data

Tracking

(Not tracked)

Status:

RESOLVED INVALID

People

(Reporter: rvitillo, Assigned: frank)

References

Details

User Story

This dataset should contain 100% of data on nightly & aurora with all histograms (opt-in & opt-out). Ideally we would like to have a fraction of Beta users as well (e.g. 10%).

Roberto Agostino Vitillo (:rvitillo)

Reporter

Description

•

8 years ago

No description provided.

Thomas Huelbert

Comment 1

•

8 years ago

hey BDS, are we correct in thinking this is something you are currently working on?

Flags: needinfo?(benjamin)

Thomas Huelbert

Updated

•

8 years ago

Assignee: nobody → benjamin

Benjamin Smedberg

Comment 2

•

8 years ago

Me personally? Not at all.

Assignee: benjamin → nobody

Flags: needinfo?(benjamin)

Benjamin Smedberg

Comment 3

•

8 years ago

I think the story needs refinement. Really what I think we should have is separate longitudinal per channel. So e.g. a nightly-longitudinal, aurora-longitudinal, and beta-longitudinal. I don't think there's much value in having them combined.

Thomas Huelbert

Comment 4

•

8 years ago

Q1, maybe earlier.

Points: --- → 3

Priority: -- → P3

Frank Bertsch [:frank]

Assignee

Comment 5

•

8 years ago

I'll be working on this this sprint. I may partition by channel, which would give fast access to a per-channel longitudinal, without cluttering the table namespace.

Assignee: nobody → fbertsch

Priority: P3 → P1

Frank Bertsch [:frank]

Assignee

Comment 6

•

8 years ago

Currently, longitudinal includes opt-in scalars. I'm going to change that as part of this code.

Frank Bertsch [:frank]

Assignee

Comment 7

•

8 years ago

Lowering the priority. Some Aggregates stuff is coming down the queue that's higher.

Priority: P1 → P2

Frank Bertsch [:frank]

Assignee

Updated

•

8 years ago

Priority: P2 → P1

Frank Bertsch [:frank]

Assignee

Comment 8

•

8 years ago

Okay this is blocked by the following: https://issues.apache.org/jira/browse/SPARK-18016 Basically, code generation for wide/nested dataframes fails sometimes. They had other, similar problems with code generation that got backported to all previous releases of Spark, so we'll just have to sit and wait until that happens.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 9

•

8 years ago

To move this forward we could either: - drop columns that aren't used often to reduce the schema size; - split the schema and generate different datasets (e.g. by process type).

Frank Bertsch [:frank]

Assignee

Comment 10

•

8 years ago

I like the idea of splitting into process types. Right now there'd be three - parent, process, gpu (eventually, addons?). Even if you need to compare between process types, that should still be fine for both re:dash and ATMO, since you can read multiple tables in a single query for either one.

Ryan VanderMeulen [:RyanVM]

Comment 11

•

8 years ago

FYI, you might want to be aware of this thread on dev-platform if you're talking about doing things per-process: https://groups.google.com/forum/#!topic/mozilla.dev.platform/6RsdfRUJz2E

Frank Bertsch [:frank]

Assignee

Comment 12

•

8 years ago

Thanks for that Ryan. I think we'll be fine, because all we really care about are the processes returned in the "processes" portion of the ping. If those really start bumping up later, we can reevaluate - but hopefully the initial bug will be fixed and we can just merge all these tables.

Frank Bertsch [:frank]

Assignee

Comment 13

•

8 years ago

I'm bumping up the points on this because there are quite a few changes that have to be made, and we're actually making multiple new datasets here.

Points: 3 → 5

Frank Bertsch [:frank]

Assignee

Comment 14

•

8 years ago

It doesn't look like splitting up by processes is enough. For each process table there's ~1100 columns. I'm getting the following errors, where in Spark 2.1.0 it's the same error as before, and in Spark 2.0.2 it's a different error (that technically doesn't error, but the tasks fail and cause the dataframe to be empty, so the tests fail): 2.1.0: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection has grown past JVM limit of 0xFFFF at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) 2.0.2: 17/03/14 16:39:47 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. 17/03/14 16:39:53 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl 17/03/14 16:46:31 WARN CodeGenerator: Error calculating stats of compiled class. java.io.EOFException

Frank Bertsch [:frank]

Assignee

Comment 15

•

8 years ago

I have an alternative that works - we *only* include opt-in probes. If we wanted, we could have 6 tables - 3 for opt-in probes for each process, 3 for opt-out probes for each process. It seems to me that any insight gleaned from the opt-out data would mostly be available in the regular longitudinal, but we can always add more as we see fit.

Benjamin Smedberg

Comment 16

•

8 years ago

Only opt-in isn't really helpful. We routinely need to correlate multiple measures. Is this just a limitation on the number of toplevel columns? Could we group the same number of measurements into fewer columns? Example: instead of devtools_responsive_opened_count_gpu being a separate toplevel column from devtools_canvasdebugger_opened_count_content have a single devtools column which then has subfields for each particular measure? This means you're loading more data in some cases, but if you assume that measurements usually grouped by prefix anyway it's probably not that much more.

Frank Bertsch [:frank]

Assignee

Comment 17

•

8 years ago

No, this is a limitation on complex data. Bringing all the data to a smaller number of columns doesn't help if those columns are more complex. The nesting required would still create the failure conditions. The alternative is a predefined set of probes to include in this dataset, but that doesn't seem too useful as a generalized dataset. Also if it's too many it may still fail, and the list will need monitoring and updating. Finally we could just do both an opt-in and an opt-out for each process. If you need to correlate between an opt-in and an opt-out, you'll have to do a join. Code is already setup for this.

Thomas Huelbert

Updated

•

7 years ago

Component: Metrics: Pipeline → Datasets: Longitudinal

Product: Cloud Services → Data Platform and Tools

Frank Bertsch [:frank]

Assignee

Comment 18

•

7 years ago

As an update - we are still blocked on the Spark bug mentioned in Comment 8. The code is ready committed to telemetry-batch-view, but the actual creation of the dataset won't be able to happen until some future time.

Status: NEW → ASSIGNED

Priority: P1 → --

Thomas Huelbert

Updated

•

7 years ago

Priority: -- → P2

Mark Reid [:mreid]

Comment 19

•

7 years ago

Could we build a pre-release Longitudinal dataset using a curated list of Histograms as a start? The list could be expanded over time, as needed. This could avoid the "too many columns" Spark bug in the short term, and when we get unblocked we could drop the list and include all histograms.

Georg Fritzsche [:gfritzsche]

Comment 20

•

7 years ago

We can reach out to a few engineer users etc. of the pre-release data to build an initial list.

Frank Bertsch [:frank]

Assignee

Updated

•

7 years ago

Priority: P2 → P3

Frank Bertsch [:frank]

Assignee

Comment 22

•

6 years ago

At this point we will not be following through with this as-is. Instead we'll be moving to a Generic Longitudinal from all of main_summary.

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → INVALID

BMO Automation

Updated

•

5 years ago

Product: Data Platform and Tools → Data Platform and Tools Graveyard

You need to log in before you can comment on or make changes to this bug.