<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 1

•

4 years ago

I think we should probably focus more on the migration to Glean.js, if this takes effort. We will migrate to Glean.js anyway, so we might as well save some bandwidth for us :)

Reporter

Comment 2

•

4 years ago

Agreed, this is a forward facing bug. I do think it'll be confusing in the future to be calling everything in infrastructure code pioneer and everything externally rally.

Assignee

Comment 3

•

4 years ago

Note also that we've embedded pioneer into schemas (i.e. pioneer_id) and so this change is going to require more than edge routing and project provisioning changes. I strongly agree that we should avoid making these changes until we've sorted everything else out.

Assignee

Updated

•

4 years ago

Assignee: nobody → whd

Priority: -- → P3

Assignee

Comment 4

•

4 years ago

I met with :amiyaguchi briefly to discuss the particulars of this work. We came up with the following:

Pipeline family and associated infrastructure retain pioneer in their names
i.e. pubsub topics, gcp projects, bigquery datasets, vpc-sc etc. will all stay the same. Since the application view into the data uses authorized views, the fact that the internal name of the datasets is pioneer will largely be invisible.
The edge ROUTE_TABLE will be updated to route by namespace prefix "rally-" to the pioneer/rally environment
Pioneer deployment of MPS schemas will include "rally-" in addition to "pioneer-"
Likewise, data-shared will exclude "rally-".
Operational schemas for payload_bytes_{decoded,error} will add a new optional rally_id field
We're choosing to include both rally_id and the existing pioneer_id in operational tables to avoid major changes to infrastructure.
Views into errors for any particular study will SELECT * EXCEPT to remove the client id field that isn't being used for a particular study
Rally schemas will use rally_id instead of pioneer_id
Any other application (beam) updates needed to support rally_id, as determined by :amiyaguchi
This includes sorting out deletion-request semantics and behavior with :Dexter.
Shredder will be updated to support rally_id
This depends on sorting out application-level piece.
Analysis project naming convention will be to changed to use rally instead of pioneer
The convention for external partners is to include their organization in the namespace: rally-[org]-[study-id], and for project ids to be of the form moz-fx-data-rally-[truncated org]-[secrets.token_hex(2)]

Comment 5

•

4 years ago

•

Edited

(In reply to Wesley Dawson [:whd] from comment #4)

I met with :amiyaguchi briefly to discuss the particulars of this work. We came up with the following:

Pipeline family and associated infrastructure retain pioneer in their names
i.e. pubsub topics, gcp projects, bigquery datasets, vpc-sc etc. will all stay the same. Since the application view into the data uses authorized views, the fact that the internal name of the datasets is pioneer will largely be invisible.

The edge ROUTE_TABLE will be updated to route by namespace prefix "rally-" to the pioneer/rally environment

Pioneer deployment of MPS schemas will include "rally-" in addition to "pioneer-"
Likewise, data-shared will exclude "rally-".

Operational schemas for payload_bytes_{decoded,error} will add a new optional rally_id field
We're choosing to include both rally_id and the existing pioneer_id in operational tables to avoid major changes to infrastructure.

Views into errors for any particular study will SELECT * EXCEPT to remove the client id field that isn't being used for a particular study

Rally schemas will use rally_id instead of pioneer_id

Any other application (beam) updates needed to support rally_id, as determined by :amiyaguchi
This includes sorting out deletion-request semantics and behavior with :Dexter.

Shredder will be updated to support rally_id
This depends on sorting out application-level piece.

Analysis project naming convention will be to changed to use rally instead of pioneer
The convention for external partners is to include their organization in the namespace: rally-[org]-[study-id], and for project ids to be of the form moz-fx-data-rally-[truncated org]-[secrets.token_hex(2)]

I'll be talking to Anthony later today/this week to go through this. Let's hold back any action/update until we sync up, please :)

Assignee

Updated

•

4 years ago

Blocks: 1693305

Assignee

Comment 6

•

4 years ago

NI: amiyaguchi and :Dexter for results of discussion, as I haven't made any progress here per comment #5.

Flags: needinfo?(amiyaguchi)

Flags: needinfo?(alessio.placitelli)

Comment 7

•

4 years ago

(In reply to Wesley Dawson [:whd] from comment #6)

NI: amiyaguchi and :Dexter for results of discussion, as I haven't made any progress here per comment #5.

I don't think that's required for the upcoming study we're planning to launch, since it will be using the legacy data collection mechanism.

Flags: needinfo?(alessio.placitelli)

Assignee

Comment 8

•

4 years ago

I'm going to leave the NI on :amiyaguchi because I still need this information to proceed and I'd like to avoid a situation where this work becomes a last minute blocking request.

No longer blocks: 1693305

Reporter

Comment 9

•

4 years ago

My understanding that the deadline for this is 2021-03-31, in order to support the implementation for bug 1675479. Please correct me if I'm wrong. I'm planning to make the mozilla-pipeline-schemas changes before making the decoder changes, which I am planning to start in the next two weeks.

Flags: needinfo?(amiyaguchi)

Comment 10

•

4 years ago

(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #9)

My understanding that the deadline for this is 2021-03-31, in order to support the implementation for bug 1675479. Please correct me if I'm wrong. I'm planning to make the mozilla-pipeline-schemas changes before making the decoder changes, which I am planning to start in the next two weeks.

You are correct.

Reporter

Updated

•

4 years ago

Blocks: 1697342

Reporter

Comment 11

•

4 years ago

Alessio and I did meet on on 2021-01-21 regarding comment #5, the day after :whd and I discussed the operational piece regarding comment #4. I wish I had notes to recall the outcome of that discussion, but my current understanding is that this operational plan to bring the rally namespace online shouldn't conflict with the proposal/specification outlined in the Glean.js encryption document discussed and approved in bug 1675479.

Simply, the rally id will be in the client_info.client_id section and pulled out into a rally_id field in the metadata schema for operational purposes (errors, shredder, sample_id, etc). This will be visible to consumers as a primary client id field, in addition to the client_info.client_id field.

Reporter

Updated

•

4 years ago

Comment 12

•

4 years ago

(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #11)

Simply, the rally id will be in the client_info.client_id section and pulled out into a rally_id field in the metadata schema for operational purposes (errors, shredder, sample_id, etc). This will be visible to consumers as a primary client id field, in addition to the client_info.client_id field.

Consumers of the Glean SDK/Client can't set the client id manually. We will define a new metric, "rally_id", and send this with Glean pings coming from Rally.

Flags: needinfo?(amiyaguchi)

GitHub Bugzilla PR Linker

Reporter

Comment 13

•

4 years ago

Will all pings, including things like the baseline ping, be guaranteed to include this "rally_id" metric? What is client_info.client_id set to if it's not set by the core addon? We need a stable client identifier for shredder to appropriately delete data; it would be useful to know what can be extracted during decoding re: bug 1697342.

Flags: needinfo?(amiyaguchi) → needinfo?(alessio.placitelli)

Comment 14

•

4 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/665 — Details

Comment 15

•

4 years ago

(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #13)

Will all pings, including things like the baseline ping, be guaranteed to include this "rally_id" metric? What is client_info.client_id set to if it's not set by the core addon? We need a stable client identifier for shredder to appropriately delete data; it would be useful to know what can be extracted during decoding re: bug 1697342.

Glean.js doesn't currently define pings other than the deletion-request (so this is not an immediate problem). In the future, Rally will likely not be sending baseline pings or, if it ever will, it will include the "rally_id" metric.

Flags: needinfo?(alessio.placitelli)

Assignee

Comment 16

•

4 years ago

https://github.com/mozilla-services/cloudops-infra/pull/2962 contains most of the ops changes codifying comment #4. I will plan to deploy the edge routing changes before the first rally- schemas are deployed (probably rally-debug from bug #1698913).

Shredder work exists in https://bugzilla.mozilla.org/show_bug.cgi?id=1697454 https://bugzilla.mozilla.org/show_bug.cgi?id=1698934 which :akomar will mostly be handling.