Closed
Bug 1318709
Opened 8 years ago
Closed 6 years ago
Create longitudinal dataset with 100% of pre-release data
Categories
(Data Platform and Tools Graveyard :: Datasets: Longitudinal, defect, P3)
Data Platform and Tools Graveyard
Datasets: Longitudinal
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: rvitillo, Assigned: frank)
References
Details
User Story
This dataset should contain 100% of data on nightly & aurora with all histograms (opt-in & opt-out). Ideally we would like to have a fraction of Beta users as well (e.g. 10%).
No description provided.
Comment 1•8 years ago
|
||
hey BDS, are we correct in thinking this is something you are currently working on?
Flags: needinfo?(benjamin)
Updated•8 years ago
|
Assignee: nobody → benjamin
Comment 2•8 years ago
|
||
Me personally? Not at all.
Assignee: benjamin → nobody
Flags: needinfo?(benjamin)
Comment 3•8 years ago
|
||
I think the story needs refinement. Really what I think we should have is separate longitudinal per channel. So e.g. a nightly-longitudinal, aurora-longitudinal, and beta-longitudinal. I don't think there's much value in having them combined.
Assignee | ||
Comment 5•8 years ago
|
||
I'll be working on this this sprint. I may partition by channel, which would give fast access to a per-channel longitudinal, without cluttering the table namespace.
Assignee: nobody → fbertsch
Priority: P3 → P1
Assignee | ||
Comment 6•8 years ago
|
||
Currently, longitudinal includes opt-in scalars. I'm going to change that as part of this code.
Assignee | ||
Comment 7•8 years ago
|
||
Lowering the priority. Some Aggregates stuff is coming down the queue that's higher.
Priority: P1 → P2
Assignee | ||
Updated•8 years ago
|
Priority: P2 → P1
Assignee | ||
Comment 8•8 years ago
|
||
Okay this is blocked by the following:
https://issues.apache.org/jira/browse/SPARK-18016
Basically, code generation for wide/nested dataframes fails sometimes. They had other, similar problems with code generation that got backported to all previous releases of Spark, so we'll just have to sit and wait until that happens.
Reporter | ||
Comment 9•8 years ago
|
||
To move this forward we could either:
- drop columns that aren't used often to reduce the schema size;
- split the schema and generate different datasets (e.g. by process type).
Assignee | ||
Comment 10•8 years ago
|
||
I like the idea of splitting into process types. Right now there'd be three - parent, process, gpu (eventually, addons?). Even if you need to compare between process types, that should still be fine for both re:dash and ATMO, since you can read multiple tables in a single query for either one.
Comment 11•8 years ago
|
||
FYI, you might want to be aware of this thread on dev-platform if you're talking about doing things per-process:
https://groups.google.com/forum/#!topic/mozilla.dev.platform/6RsdfRUJz2E
Assignee | ||
Comment 12•8 years ago
|
||
Thanks for that Ryan. I think we'll be fine, because all we really care about are the processes returned in the "processes" portion of the ping. If those really start bumping up later, we can reevaluate - but hopefully the initial bug will be fixed and we can just merge all these tables.
Assignee | ||
Comment 13•8 years ago
|
||
I'm bumping up the points on this because there are quite a few changes that have to be made, and we're actually making multiple new datasets here.
Points: 3 → 5
Assignee | ||
Comment 14•8 years ago
|
||
It doesn't look like splitting up by processes is enough. For each process table there's ~1100 columns. I'm getting the following errors, where in Spark 2.1.0 it's the same error as before, and in Spark 2.0.2 it's a different error (that technically doesn't error, but the tasks fail and cause the dataframe to be empty, so the tests fail):
2.1.0:
org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection has grown past JVM limit of 0xFFFF
at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
2.0.2:
17/03/14 16:39:47 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
17/03/14 16:39:53 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/03/14 16:46:31 WARN CodeGenerator: Error calculating stats of compiled class.
java.io.EOFException
Assignee | ||
Comment 15•8 years ago
|
||
I have an alternative that works - we *only* include opt-in probes. If we wanted, we could have 6 tables - 3 for opt-in probes for each process, 3 for opt-out probes for each process. It seems to me that any insight gleaned from the opt-out data would mostly be available in the regular longitudinal, but we can always add more as we see fit.
Comment 16•8 years ago
|
||
Only opt-in isn't really helpful. We routinely need to correlate multiple measures.
Is this just a limitation on the number of toplevel columns? Could we group the same number of measurements into fewer columns? Example: instead of devtools_responsive_opened_count_gpu being a separate toplevel column from devtools_canvasdebugger_opened_count_content have a single devtools column which then has subfields for each particular measure?
This means you're loading more data in some cases, but if you assume that measurements usually grouped by prefix anyway it's probably not that much more.
Assignee | ||
Comment 17•8 years ago
|
||
No, this is a limitation on complex data. Bringing all the data to a smaller number of columns doesn't help if those columns are more complex. The nesting required would still create the failure conditions.
The alternative is a predefined set of probes to include in this dataset, but that doesn't seem too useful as a generalized dataset. Also if it's too many it may still fail, and the list will need monitoring and updating.
Finally we could just do both an opt-in and an opt-out for each process. If you need to correlate between an opt-in and an opt-out, you'll have to do a join. Code is already setup for this.
Updated•7 years ago
|
Component: Metrics: Pipeline → Datasets: Longitudinal
Product: Cloud Services → Data Platform and Tools
Assignee | ||
Comment 18•7 years ago
|
||
As an update - we are still blocked on the Spark bug mentioned in Comment 8. The code is ready committed to telemetry-batch-view, but the actual creation of the dataset won't be able to happen until some future time.
Status: NEW → ASSIGNED
Priority: P1 → --
Updated•7 years ago
|
Priority: -- → P2
Comment 19•7 years ago
|
||
Could we build a pre-release Longitudinal dataset using a curated list of Histograms as a start?
The list could be expanded over time, as needed. This could avoid the "too many columns" Spark bug in the short term, and when we get unblocked we could drop the list and include all histograms.
Comment 20•7 years ago
|
||
We can reach out to a few engineer users etc. of the pre-release data to build an initial list.
Assignee | ||
Updated•7 years ago
|
Priority: P2 → P3
Assignee | ||
Comment 22•6 years ago
|
||
At this point we will not be following through with this as-is. Instead we'll be moving to a Generic Longitudinal from all of main_summary.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
Updated•5 years ago
|
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•