Closed Bug 1360256 Opened 7 years ago Closed 7 years ago

Add direct-to-parquet for new "new-profile" ping

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
1

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: whd)

References

Details

(Whiteboard: [measurement:client:tracking] [SvcOps])

Once bug 1120370 lands, we need to make the ping data available for analysis to re:dash and other data tools.
Blocks: 1351394
Whiteboard: [measurement:client]
Component: Metrics: Pipeline → Datasets: General
Product: Cloud Services → Data Platform and Tools
this done Wesley?
Flags: needinfo?(whd)
Priority: -- → P2
Whiteboard: [measurement:client] → [measurement:client] [SvcOps]
(In reply to thuelbert from comment #1)
> this done Wesley?

This bug still doesn't have a parquet schema and lua config attached to it. I need a schema in order to deploy the change, for which a separate bug can optionally be filed if we're using this one to track actually creating the schema.
Flags: needinfo?(whd)
(In reply to Wesley Dawson [:whd] from comment #2)
> (In reply to thuelbert from comment #1)
> > this done Wesley?
> 
> This bug still doesn't have a parquet schema and lua config attached to it.
> I need a schema in order to deploy the change, for which a separate bug can
> optionally be filed if we're using this one to track actually creating the
> schema.

Is a "parquet" schema the same as the JSON schemas that can be found https://github.com/mozilla-services/mozilla-pipeline-schemas/ ?
(In reply to Alessio Placitelli [:Dexter] from comment #3)

> Is a "parquet" schema the same as the JSON schemas that can be found
> https://github.com/mozilla-services/mozilla-pipeline-schemas/ ?

No, though that repo does contain some parquet schemas. Compare https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/schemas/telemetry/core/core.9.parquetmr.txt to https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/schemas/telemetry/core/core.9.schema.json. See also bug #1333206 for an example that includes a lua config, which demonstrates that the schema was tested on https://pipeline-cep.prod.mozaws.net/ with some real data. The lua config is needed mostly for json_objects and s3_path_dimensions, the latter of which should at a minimum include Fields[submissionDate].
Trink, is this something you can take in the current or next iteration?
Flags: needinfo?(mtrinkala)
I can definitely mentor/review the PRs; ideally the team with the need/spec in their head should be handling the implementation and documentation as a recent gap in https://bugzilla.mozilla.org/show_bug.cgi?id=1353110#c9 demonstrates.  See the various parquetmr schemas in the pipeline repo e.g.  https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/schemas/telemetry/shield-study/shield-study.3.parquetmr.txt 

Dataset reference documentation: https://github.com/mozilla/firefox-data-docs/tree/master/datasets
Flags: needinfo?(mtrinkala)
(In reply to Mike Trinkala [:trink] from comment #6)
> I can definitely mentor/review the PRs; ideally the team with the need/spec
> in their head should be handling the implementation and documentation

The team in this case is us and we don't have time.
It was mentioned that you could have capacity for this.
Flags: needinfo?(mtrinkala)
This is part of creating a new ping type and it should be budgeted into the schedule as such; again I would be happy to help you learn how to self-serve. Let me know when you can work it into your schedule, thanks.
Flags: needinfo?(mtrinkala)
To clarify it will take more time and effort for me to mentor someone. The time/work isn't the issue the point is to bring other people up to speed on the infrastructure.  This will be the third time D2P mentoring has been deferred because people are 'too busy'.  The push back here is so we start to address what is becoming a pathological problem.

That being said and looking at the schema:
1) it doesn't actually match what is submitted e.g. the meta object.  This leads me to believe there is some additional debugging work before the parquet schema is even developed.
2) will it be a one to one conversion i.e. do we intend to write out the entire environment section to parquet?  A full parquet schema has not been written for this common section and would constitute the bulk of the work.
2a) if we are cherry picking fields for the parquet output then an additional specification is needed here.

The scoping depends on the answers to the questions above but regardless it shouldn't be more than 1 point.
Flags: needinfo?(gfritzsche)
Per IRC, we don't have the time for this on our side.
trink can take this in the mean-time (although someone from the pipeline side could be mentored too).

Forwarding the specific questions to Alessio.
Flags: needinfo?(gfritzsche) → needinfo?(alessio.placitelli)
(In reply to Mike Trinkala [:trink] from comment #9)
> To clarify it will take more time and effort for me to mentor someone. The
> time/work isn't the issue the point is to bring other people up to speed on
> the infrastructure.  This will be the third time D2P mentoring has been
> deferred because people are 'too busy'.  The push back here is so we start
> to address what is becoming a pathological problem.
> 
> That being said and looking at the schema:
> 1) it doesn't actually match what is submitted e.g. the meta object.  This
> leads me to believe there is some additional debugging work before the
> parquet schema is even developed.

Mh, the "meta" object doesn't need to be there? I built this schema by cargo-culting the "crash"/"main" schemas. Does meta need to be removed from there too (e.g. [1])?

> 2) will it be a one to one conversion i.e. do we intend to write out the
> entire environment section to parquet?  A full parquet schema has not been
> written for this common section and would constitute the bulk of the work.
> 2a) if we are cherry picking fields for the parquet output then an
> additional specification is needed here.

This dataset will be used, among the other things, for the "bot profiles" studies. For this reason, I'd expect us to need the full environment section.

Do we allow changes to the schema? I could gather a subset of potentially useful fields from the environment and expand the schema as needed, eventually.

[1] - https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/schemas/telemetry/crash/crash.4.schema.json#L523
Flags: needinfo?(alessio.placitelli) → needinfo?(mtrinkala)
1) yes the crash and main pings should also be fixed.
2) technically yes, but I am not sure *all* of the reporting tools are name based as opposed to column based, Frank?
Flags: needinfo?(mtrinkala) → needinfo?(fbertsch)
(In reply to Mike Trinkala [:trink] from comment #12)
> 1) yes the crash and main pings should also be fixed.

Created https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/50 for that.
(In reply to Mike Trinkala [:trink] from comment #12)
> 2) technically yes, but I am not sure *all* of the reporting tools are name based as opposed to column based, Frank?

Presto, Spark, and Athena all support this now, so we should be fine on the schema evolution front.
Flags: needinfo?(fbertsch)
Where are we at? Do we need the 100+ columns in the parquet table or only a subset?  Since this is configuration work it is not tied to the sprint packaging and can be deployed, by Ops, any time after completion (it will be packed into the June 12 - 25 sprint).
Flags: needinfo?(alessio.placitelli)
Hey Saptarshi, this bug is about creating a direct-to-parquet dataset for the 'new-profile' ping. As you know, this will be used for both the 'bot profiles' investigation and, in the future, for growth.

Other than the 'new-profile' ping payload, client id and all the common ping data [1], does this dataset need to contain the full environment section for each ping [2] or can a subset of the environment fields be enough?

If the latter, do you have any suggestion on which fields may be more useful? Please keep in mind that the table can be changed in the future to add more fields.

[1] - https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/common-ping.html
[2] - https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/environment.html
Flags: needinfo?(alessio.placitelli) → needinfo?(sguha)
Hi Alessio,

In environment, i would keep at a minimum,

build->
applicationName
architecture
version
vendor
hotfixVersion

settings->
isDefaultBrowser
defaultSearchEngine
locale
attribution

profile-> creationDate

partner

os->{name, version}

I think this would be a good start.
Peter, i'm I missing anything?
Flags: needinfo?(sguha) → needinfo?(pdolanjski)
Whiteboard: [measurement:client] [SvcOps] → [measurement:client:tracking] [SvcOps]
I don't know where the channel ID is normally, but we should have it to identify funnel cakes.
Otherwise looks good.
Flags: needinfo?(pdolanjski)
(In reply to Peter Dolanjski [:pdol] from comment #18)
> I don't know where the channel ID is normally, but we should have it to
> identify funnel cakes.
> Otherwise looks good.

Thanks Peter!

So, to recap, here's the minimum that's needed from the environment:

build.applicationName
build.architecture
build.version
build.buildId
build.vendor
build.hotfixVersion

settings.isDefaultBrowser
settings.defaultSearchEngine
settings.defaultSearchEngineData.*
settings.telemetryEnabled
settings.locale
settings.attribution.*
settings.update.*

profile.creationDate

partner.*

os.name
os.version
os.locale

Trink, thank you for your patience, this should be actionable now.
Flags: needinfo?(mtrinkala)
Just to clarify - any schema evolution that may happen needs to be top-level fields. This is the only way we can *guarantee* support for schema evolution.

Nested field schema evolution is support in Spark, using `spark.option("mergeSchema", "true")`.

Our current Presto config does not support nest field schema evolution. Bug 1352521 will fix that, however. That won't be until we move Presto off of EMR, so hopefully early q3, but Blake can give a better estimate.

The only unknown is Athena. We haven't run any tests for schema evolution support there, but I made bug 1371706 to evaluate it.
(In reply to Alessio Placitelli [:Dexter] from comment #19)
> (In reply to Peter Dolanjski [:pdol] from comment #18)
> > I don't know where the channel ID is normally, but we should have it to
> > identify funnel cakes.
> > Otherwise looks good.
> 
> Thanks Peter!
> 
> So, to recap, here's the minimum that's needed from the environment:
> 
> build.applicationName
> build.architecture
> build.version
> build.buildId
> build.vendor
> build.hotfixVersion
> 
> settings.isDefaultBrowser
> settings.defaultSearchEngine
> settings.defaultSearchEngineData.*
> settings.telemetryEnabled
> settings.locale
> settings.attribution.*
> settings.update.*
> 
> profile.creationDate
> 
> partner.* (includes distribution_id)
> 
> os.name
> os.version
> os.locale
> 
> Trink, thank you for your patience, this should be actionable now.

Also(i didnt see these above)

application.channel (but i think this is top level and by default included)
Assignee: nobody → mtrinkala
Points: --- → 1
Flags: needinfo?(mtrinkala)
Priority: P2 → P1
The JSON schema has environment and payload as optional. Are there times these wouldn't be sent in a new-profile ping?  Also there is a lot of verification of fields/structures that are not used in the output; the extra schema can be removed.
Flags: needinfo?(alessio.placitelli)
(In reply to Mike Trinkala [:trink] from comment #22)
> The JSON schema has environment and payload as optional. Are there times
> these wouldn't be sent in a new-profile ping?  Also there is a lot of
> verification of fields/structures that are not used in the output; the extra
> schema can be removed.

No, both the environment and the payload must always be sent with the new-profile ping. I made them mandatory here: https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/52

What do you mean by "output"? Do you mean not used in the parquet schema? If that's the case, since this data might be used outside of d2p (e.g. Spark), wouldn't still be valuable to have schema validation for all the fields that are sent with the ping itself?
Flags: needinfo?(alessio.placitelli) → needinfo?(mtrinkala)
Yes.  I am just saying there is associated processing, bandwidth and storage costs;  It would be nice to only pay for what we use in future efforts (and including data that doesn't have a specific use case should be avoided).
Flags: needinfo?(mtrinkala)
(In reply to Mike Trinkala [:trink] from comment #24)
> Yes.  I am just saying there is associated processing, bandwidth and storage
> costs;  It would be nice to only pay for what we use in future efforts (and
> including data that doesn't have a specific use case should be avoided).

Yes, that's a very good point. For this particular use-case, we're not sure upfront what would be the strictest set of dimensions, as this will be used for the bot-profile investigation too. However, we're considering introducing a slimmed down version of the environment that is considerably smaller than the current one, see bug 1372220.
Hey Trink, what's the next action on this? Should this be on Ops hands now?
Flags: needinfo?(mtrinkala)
Yes, ops will have to deploy the cfg.  They will most likely use the new schema packaging and I am working with them on the transition.  You will see a new template based version of the schema being PR'ed shortly (plus an integration test suite that will test the schemas using the real ingestion infrastructure).
Assignee: mtrinkala → whd
Flags: needinfo?(mtrinkala)
https://github.com/mozilla-services/svcops/pull/1363/files https://github.com/mozilla-services/puppet-config/pull/2601/files. Given no other information I partitioned this by submission date only. Let me know if that's insufficient. I also derived json_objects from the parquetmr schema as it wasn't made explicit either.

The parquet data should be available as telemetry_new_profile_parquet from re:dash whenever cron runs to pick it up.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.