Closed Bug 1318709 Opened 8 years ago Closed 6 years ago

Create longitudinal dataset with 100% of pre-release data

Categories

(Data Platform and Tools Graveyard :: Datasets: Longitudinal, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: rvitillo, Assigned: frank)

References

Details

User Story

This dataset should contain 100% of data on nightly & aurora with all histograms (opt-in & opt-out). Ideally we would like to have a fraction of Beta users as well (e.g. 10%).
No description provided.
hey BDS, are we correct in thinking this is something you are currently working on?
Flags: needinfo?(benjamin)
Assignee: nobody → benjamin
Me personally? Not at all.
Assignee: benjamin → nobody
Flags: needinfo?(benjamin)
I think the story needs refinement. Really what I think we should have is separate longitudinal per channel. So e.g. a nightly-longitudinal, aurora-longitudinal, and beta-longitudinal. I don't think there's much value in having them combined.
Q1, maybe earlier.
Points: --- → 3
Priority: -- → P3
I'll be working on this this sprint. I may partition by channel, which would give fast access to a per-channel longitudinal, without cluttering the table namespace.
Assignee: nobody → fbertsch
Priority: P3 → P1
Currently, longitudinal includes opt-in scalars. I'm going to change that as part of this code.
Lowering the priority. Some Aggregates stuff is coming down the queue that's higher.
Priority: P1 → P2
Priority: P2 → P1
Okay this is blocked by the following: https://issues.apache.org/jira/browse/SPARK-18016 Basically, code generation for wide/nested dataframes fails sometimes. They had other, similar problems with code generation that got backported to all previous releases of Spark, so we'll just have to sit and wait until that happens.
To move this forward we could either: - drop columns that aren't used often to reduce the schema size; - split the schema and generate different datasets (e.g. by process type).
I like the idea of splitting into process types. Right now there'd be three - parent, process, gpu (eventually, addons?). Even if you need to compare between process types, that should still be fine for both re:dash and ATMO, since you can read multiple tables in a single query for either one.
FYI, you might want to be aware of this thread on dev-platform if you're talking about doing things per-process: https://groups.google.com/forum/#!topic/mozilla.dev.platform/6RsdfRUJz2E
Thanks for that Ryan. I think we'll be fine, because all we really care about are the processes returned in the "processes" portion of the ping. If those really start bumping up later, we can reevaluate - but hopefully the initial bug will be fixed and we can just merge all these tables.
I'm bumping up the points on this because there are quite a few changes that have to be made, and we're actually making multiple new datasets here.
Points: 3 → 5
It doesn't look like splitting up by processes is enough. For each process table there's ~1100 columns. I'm getting the following errors, where in Spark 2.1.0 it's the same error as before, and in Spark 2.0.2 it's a different error (that technically doesn't error, but the tasks fail and cause the dataframe to be empty, so the tests fail): 2.1.0: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection has grown past JVM limit of 0xFFFF at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) 2.0.2: 17/03/14 16:39:47 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. 17/03/14 16:39:53 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl 17/03/14 16:46:31 WARN CodeGenerator: Error calculating stats of compiled class. java.io.EOFException
I have an alternative that works - we *only* include opt-in probes. If we wanted, we could have 6 tables - 3 for opt-in probes for each process, 3 for opt-out probes for each process. It seems to me that any insight gleaned from the opt-out data would mostly be available in the regular longitudinal, but we can always add more as we see fit.
Only opt-in isn't really helpful. We routinely need to correlate multiple measures. Is this just a limitation on the number of toplevel columns? Could we group the same number of measurements into fewer columns? Example: instead of devtools_responsive_opened_count_gpu being a separate toplevel column from devtools_canvasdebugger_opened_count_content have a single devtools column which then has subfields for each particular measure? This means you're loading more data in some cases, but if you assume that measurements usually grouped by prefix anyway it's probably not that much more.
No, this is a limitation on complex data. Bringing all the data to a smaller number of columns doesn't help if those columns are more complex. The nesting required would still create the failure conditions. The alternative is a predefined set of probes to include in this dataset, but that doesn't seem too useful as a generalized dataset. Also if it's too many it may still fail, and the list will need monitoring and updating. Finally we could just do both an opt-in and an opt-out for each process. If you need to correlate between an opt-in and an opt-out, you'll have to do a join. Code is already setup for this.
Component: Metrics: Pipeline → Datasets: Longitudinal
Product: Cloud Services → Data Platform and Tools
As an update - we are still blocked on the Spark bug mentioned in Comment 8. The code is ready committed to telemetry-batch-view, but the actual creation of the dataset won't be able to happen until some future time.
Status: NEW → ASSIGNED
Priority: P1 → --
Priority: -- → P2
Could we build a pre-release Longitudinal dataset using a curated list of Histograms as a start? The list could be expanded over time, as needed. This could avoid the "too many columns" Spark bug in the short term, and when we get unblocked we could drop the list and include all histograms.
We can reach out to a few engineer users etc. of the pre-release data to build an initial list.
Priority: P2 → P3
At this point we will not be following through with this as-is. Instead we'll be moving to a Generic Longitudinal from all of main_summary.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.