<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Reporter

Comment 3

•

8 years ago

(In reply to Wesley Dawson [:whd] from comment #2) > (In reply to thuelbert from comment #1) > > this done Wesley? > > This bug still doesn't have a parquet schema and lua config attached to it. > I need a schema in order to deploy the change, for which a separate bug can > optionally be filed if we're using this one to track actually creating the > schema. Is a "parquet" schema the same as the JSON schemas that can be found https://github.com/mozilla-services/mozilla-pipeline-schemas/ ?

Wesley Dawson [:whd]

Assignee

Comment 4

•

8 years ago

(In reply to Alessio Placitelli [:Dexter] from comment #3) > Is a "parquet" schema the same as the JSON schemas that can be found > https://github.com/mozilla-services/mozilla-pipeline-schemas/ ? No, though that repo does contain some parquet schemas. Compare https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/schemas/telemetry/core/core.9.parquetmr.txt to https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/schemas/telemetry/core/core.9.schema.json. See also bug #1333206 for an example that includes a lua config, which demonstrates that the schema was tested on https://pipeline-cep.prod.mozaws.net/ with some real data. The lua config is needed mostly for json_objects and s3_path_dimensions, the latter of which should at a minimum include Fields[submissionDate].

Comment 5

•

8 years ago

Trink, is this something you can take in the current or next iteration?

Flags: needinfo?(mtrinkala)

Comment 6

•

8 years ago

I can definitely mentor/review the PRs; ideally the team with the need/spec in their head should be handling the implementation and documentation as a recent gap in https://bugzilla.mozilla.org/show_bug.cgi?id=1353110#c9 demonstrates. See the various parquetmr schemas in the pipeline repo e.g. https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/schemas/telemetry/shield-study/shield-study.3.parquetmr.txt Dataset reference documentation: https://github.com/mozilla/firefox-data-docs/tree/master/datasets

Flags: needinfo?(mtrinkala)

Comment 7

•

8 years ago

(In reply to Mike Trinkala [:trink] from comment #6) > I can definitely mentor/review the PRs; ideally the team with the need/spec > in their head should be handling the implementation and documentation The team in this case is us and we don't have time. It was mentioned that you could have capacity for this.

Flags: needinfo?(mtrinkala)

Comment 8

•

8 years ago

This is part of creating a new ping type and it should be budgeted into the schedule as such; again I would be happy to help you learn how to self-serve. Let me know when you can work it into your schedule, thanks.

Flags: needinfo?(mtrinkala)

Comment 9

•

8 years ago

To clarify it will take more time and effort for me to mentor someone. The time/work isn't the issue the point is to bring other people up to speed on the infrastructure. This will be the third time D2P mentoring has been deferred because people are 'too busy'. The push back here is so we start to address what is becoming a pathological problem. That being said and looking at the schema: 1) it doesn't actually match what is submitted e.g. the meta object. This leads me to believe there is some additional debugging work before the parquet schema is even developed. 2) will it be a one to one conversion i.e. do we intend to write out the entire environment section to parquet? A full parquet schema has not been written for this common section and would constitute the bulk of the work. 2a) if we are cherry picking fields for the parquet output then an additional specification is needed here. The scoping depends on the answers to the questions above but regardless it shouldn't be more than 1 point.

Flags: needinfo?(gfritzsche)

Comment 10

•

8 years ago

Per IRC, we don't have the time for this on our side. trink can take this in the mean-time (although someone from the pipeline side could be mentored too). Forwarding the specific questions to Alessio.

Flags: needinfo?(gfritzsche) → needinfo?(alessio.placitelli)

Reporter

Comment 11

•

8 years ago

(In reply to Mike Trinkala [:trink] from comment #9) > To clarify it will take more time and effort for me to mentor someone. The > time/work isn't the issue the point is to bring other people up to speed on > the infrastructure. This will be the third time D2P mentoring has been > deferred because people are 'too busy'. The push back here is so we start > to address what is becoming a pathological problem. > > That being said and looking at the schema: > 1) it doesn't actually match what is submitted e.g. the meta object. This > leads me to believe there is some additional debugging work before the > parquet schema is even developed. Mh, the "meta" object doesn't need to be there? I built this schema by cargo-culting the "crash"/"main" schemas. Does meta need to be removed from there too (e.g. [1])? > 2) will it be a one to one conversion i.e. do we intend to write out the > entire environment section to parquet? A full parquet schema has not been > written for this common section and would constitute the bulk of the work. > 2a) if we are cherry picking fields for the parquet output then an > additional specification is needed here. This dataset will be used, among the other things, for the "bot profiles" studies. For this reason, I'd expect us to need the full environment section. Do we allow changes to the schema? I could gather a subset of potentially useful fields from the environment and expand the schema as needed, eventually. [1] - https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/schemas/telemetry/crash/crash.4.schema.json#L523

Flags: needinfo?(alessio.placitelli) → needinfo?(mtrinkala)

Comment 12

•

8 years ago

1) yes the crash and main pings should also be fixed. 2) technically yes, but I am not sure *all* of the reporting tools are name based as opposed to column based, Frank?

Flags: needinfo?(mtrinkala) → needinfo?(fbertsch)

Reporter

Comment 13

•

8 years ago

(In reply to Mike Trinkala [:trink] from comment #12) > 1) yes the crash and main pings should also be fixed. Created https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/50 for that.

Frank Bertsch [:frank]

Comment 14

•

8 years ago

(In reply to Mike Trinkala [:trink] from comment #12) > 2) technically yes, but I am not sure *all* of the reporting tools are name based as opposed to column based, Frank? Presto, Spark, and Athena all support this now, so we should be fine on the schema evolution front.

Flags: needinfo?(fbertsch)

Comment 15

•

8 years ago

Where are we at? Do we need the 100+ columns in the parquet table or only a subset? Since this is configuration work it is not tied to the sprint packaging and can be deployed, by Ops, any time after completion (it will be packed into the June 12 - 25 sprint).

Flags: needinfo?(alessio.placitelli)

Reporter

Comment 16

•

8 years ago

Hey Saptarshi, this bug is about creating a direct-to-parquet dataset for the 'new-profile' ping. As you know, this will be used for both the 'bot profiles' investigation and, in the future, for growth. Other than the 'new-profile' ping payload, client id and all the common ping data [1], does this dataset need to contain the full environment section for each ping [2] or can a subset of the environment fields be enough? If the latter, do you have any suggestion on which fields may be more useful? Please keep in mind that the table can be changed in the future to add more fields. [1] - https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/common-ping.html [2] - https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/environment.html

Flags: needinfo?(alessio.placitelli) → needinfo?(sguha)

"Saptarshi Guha[:joy]"

Comment 17

•

8 years ago

Hi Alessio, In environment, i would keep at a minimum, build-> applicationName architecture version vendor hotfixVersion settings-> isDefaultBrowser defaultSearchEngine locale attribution profile-> creationDate partner os->{name, version} I think this would be a good start. Peter, i'm I missing anything?

Flags: needinfo?(sguha) → needinfo?(pdolanjski)

Updated

•

8 years ago

Whiteboard: [measurement:client] [SvcOps] → [measurement:client:tracking] [SvcOps]

Peter Dolanjski [:pdol]

Comment 18

•

8 years ago

I don't know where the channel ID is normally, but we should have it to identify funnel cakes. Otherwise looks good.

Flags: needinfo?(pdolanjski)

Reporter

Comment 19

•

8 years ago

(In reply to Peter Dolanjski [:pdol] from comment #18) > I don't know where the channel ID is normally, but we should have it to > identify funnel cakes. > Otherwise looks good. Thanks Peter! So, to recap, here's the minimum that's needed from the environment: build.applicationName build.architecture build.version build.buildId build.vendor build.hotfixVersion settings.isDefaultBrowser settings.defaultSearchEngine settings.defaultSearchEngineData.* settings.telemetryEnabled settings.locale settings.attribution.* settings.update.* profile.creationDate partner.* os.name os.version os.locale Trink, thank you for your patience, this should be actionable now.

Flags: needinfo?(mtrinkala)

Frank Bertsch [:frank]

Comment 20

•

8 years ago

Just to clarify - any schema evolution that may happen needs to be top-level fields. This is the only way we can *guarantee* support for schema evolution. Nested field schema evolution is support in Spark, using `spark.option("mergeSchema", "true")`. Our current Presto config does not support nest field schema evolution. Bug 1352521 will fix that, however. That won't be until we move Presto off of EMR, so hopefully early q3, but Blake can give a better estimate. The only unknown is Athena. We haven't run any tests for schema evolution support there, but I made bug 1371706 to evaluate it.

"Saptarshi Guha[:joy]"

Comment 21

•

8 years ago

(In reply to Alessio Placitelli [:Dexter] from comment #19) > (In reply to Peter Dolanjski [:pdol] from comment #18) > > I don't know where the channel ID is normally, but we should have it to > > identify funnel cakes. > > Otherwise looks good. > > Thanks Peter! > > So, to recap, here's the minimum that's needed from the environment: > > build.applicationName > build.architecture > build.version > build.buildId > build.vendor > build.hotfixVersion > > settings.isDefaultBrowser > settings.defaultSearchEngine > settings.defaultSearchEngineData.* > settings.telemetryEnabled > settings.locale > settings.attribution.* > settings.update.* > > profile.creationDate > > partner.* (includes distribution_id) > > os.name > os.version > os.locale > > Trink, thank you for your patience, this should be actionable now. Also(i didnt see these above) application.channel (but i think this is top level and by default included)

Updated

•

8 years ago

Assignee: nobody → mtrinkala

Points: --- → 1

Flags: needinfo?(mtrinkala)

Priority: P2 → P1

Comment 22

•

8 years ago

The JSON schema has environment and payload as optional. Are there times these wouldn't be sent in a new-profile ping? Also there is a lot of verification of fields/structures that are not used in the output; the extra schema can be removed.

Flags: needinfo?(alessio.placitelli)

Reporter

Comment 23

•

8 years ago

(In reply to Mike Trinkala [:trink] from comment #22) > The JSON schema has environment and payload as optional. Are there times > these wouldn't be sent in a new-profile ping? Also there is a lot of > verification of fields/structures that are not used in the output; the extra > schema can be removed. No, both the environment and the payload must always be sent with the new-profile ping. I made them mandatory here: https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/52 What do you mean by "output"? Do you mean not used in the parquet schema? If that's the case, since this data might be used outside of d2p (e.g. Spark), wouldn't still be valuable to have schema validation for all the fields that are sent with the ping itself?

Flags: needinfo?(alessio.placitelli) → needinfo?(mtrinkala)

Comment 24

•

8 years ago

Yes. I am just saying there is associated processing, bandwidth and storage costs; It would be nice to only pay for what we use in future efforts (and including data that doesn't have a specific use case should be avoided).

Flags: needinfo?(mtrinkala)

Reporter

Comment 25

•

8 years ago

(In reply to Mike Trinkala [:trink] from comment #24) > Yes. I am just saying there is associated processing, bandwidth and storage > costs; It would be nice to only pay for what we use in future efforts (and > including data that doesn't have a specific use case should be avoided). Yes, that's a very good point. For this particular use-case, we're not sure upfront what would be the strictest set of dimensions, as this will be used for the bot-profile investigation too. However, we're considering introducing a slimmed down version of the environment that is considerably smaller than the current one, see bug 1372220.

https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/53

Comment 26

•

8 years ago

Reporter

Comment 27

•

8 years ago

Hey Trink, what's the next action on this? Should this be on Ops hands now?

Flags: needinfo?(mtrinkala)