Adding `rally-` prefix as acceptable schema prefix for pioneer/rally projects
Categories
(Data Platform and Tools Graveyard :: Operations, enhancement, P3)
Tracking
(Not tracked)
People
(Reporter: amiyaguchi, Assigned: whd)
References
Details
Attachments
(1 file)
Now that Rally is the new official name for what was known as pioneer-v2, new studies will want to be prefixed with the rally- prefix in mozilla-pipeline-schemas. The behavior should remain the same for existing pioneer- prefixed datasets. We would likely create projects with the rally- prefix too.
See https://github.com/mozilla-ion/ion-core-addon/issues/124 and https://docs.google.com/document/d/1_DNHPmFOdfPg_PLleenur185mcgVKUH8EGp4V-UeJZk/edit?ts=5faec577#heading=h.561oygfk6bnm (in particular, option (2)).
Comment 1•4 years ago
|
||
I think we should probably focus more on the migration to Glean.js, if this takes effort. We will migrate to Glean.js anyway, so we might as well save some bandwidth for us :)
| Reporter | ||
Comment 2•4 years ago
|
||
Agreed, this is a forward facing bug. I do think it'll be confusing in the future to be calling everything in infrastructure code pioneer and everything externally rally.
| Assignee | ||
Comment 3•4 years ago
|
||
Note also that we've embedded pioneer into schemas (i.e. pioneer_id) and so this change is going to require more than edge routing and project provisioning changes. I strongly agree that we should avoid making these changes until we've sorted everything else out.
| Assignee | ||
Updated•4 years ago
|
| Assignee | ||
Comment 4•4 years ago
|
||
I met with :amiyaguchi briefly to discuss the particulars of this work. We came up with the following:
- Pipeline family and associated infrastructure retain
pioneerin their names
i.e. pubsub topics, gcp projects, bigquery datasets, vpc-sc etc. will all stay the same. Since the application view into the data uses authorized views, the fact that the internal name of the datasets ispioneerwill largely be invisible. - The edge
ROUTE_TABLEwill be updated to route by namespace prefix "rally-" to the pioneer/rally environment - Pioneer deployment of MPS schemas will include "rally-" in addition to "pioneer-"
Likewise,data-sharedwill exclude "rally-". - Operational schemas for
payload_bytes_{decoded,error}will add a new optionalrally_idfield
We're choosing to include bothrally_idand the existingpioneer_idin operational tables to avoid major changes to infrastructure. - Views into errors for any particular study will
SELECT * EXCEPTto remove the client id field that isn't being used for a particular study - Rally schemas will use
rally_idinstead ofpioneer_id - Any other application (beam) updates needed to support
rally_id, as determined by :amiyaguchi
This includes sorting out deletion-request semantics and behavior with :Dexter. - Shredder will be updated to support
rally_id
This depends on sorting out application-level piece. - Analysis project naming convention will be to changed to use
rallyinstead ofpioneer
The convention for external partners is to include their organization in the namespace:rally-[org]-[study-id], and for project ids to be of the formmoz-fx-data-rally-[truncated org]-[secrets.token_hex(2)]
Comment 5•4 years ago
•
|
||
(In reply to Wesley Dawson [:whd] from comment #4)
I met with :amiyaguchi briefly to discuss the particulars of this work. We came up with the following:
- Pipeline family and associated infrastructure retain
pioneerin their names
i.e. pubsub topics, gcp projects, bigquery datasets, vpc-sc etc. will all stay the same. Since the application view into the data uses authorized views, the fact that the internal name of the datasets ispioneerwill largely be invisible.- The edge
ROUTE_TABLEwill be updated to route by namespace prefix "rally-" to the pioneer/rally environment- Pioneer deployment of MPS schemas will include "rally-" in addition to "pioneer-"
Likewise,data-sharedwill exclude "rally-".- Operational schemas for
payload_bytes_{decoded,error}will add a new optionalrally_idfield
We're choosing to include bothrally_idand the existingpioneer_idin operational tables to avoid major changes to infrastructure.- Views into errors for any particular study will
SELECT * EXCEPTto remove the client id field that isn't being used for a particular study- Rally schemas will use
rally_idinstead ofpioneer_id- Any other application (beam) updates needed to support
rally_id, as determined by :amiyaguchi
This includes sorting out deletion-request semantics and behavior with :Dexter.- Shredder will be updated to support
rally_id
This depends on sorting out application-level piece.- Analysis project naming convention will be to changed to use
rallyinstead ofpioneer
The convention for external partners is to include their organization in the namespace:rally-[org]-[study-id], and for project ids to be of the formmoz-fx-data-rally-[truncated org]-[secrets.token_hex(2)]
I'll be talking to Anthony later today/this week to go through this. Let's hold back any action/update until we sync up, please :)
| Assignee | ||
Comment 6•4 years ago
|
||
NI: amiyaguchi and :Dexter for results of discussion, as I haven't made any progress here per comment #5.
Comment 7•4 years ago
|
||
(In reply to Wesley Dawson [:whd] from comment #6)
NI: amiyaguchi and :Dexter for results of discussion, as I haven't made any progress here per comment #5.
I don't think that's required for the upcoming study we're planning to launch, since it will be using the legacy data collection mechanism.
| Assignee | ||
Comment 8•4 years ago
|
||
I'm going to leave the NI on :amiyaguchi because I still need this information to proceed and I'd like to avoid a situation where this work becomes a last minute blocking request.
| Reporter | ||
Comment 9•4 years ago
|
||
My understanding that the deadline for this is 2021-03-31, in order to support the implementation for bug 1675479. Please correct me if I'm wrong. I'm planning to make the mozilla-pipeline-schemas changes before making the decoder changes, which I am planning to start in the next two weeks.
Comment 10•4 years ago
|
||
(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #9)
My understanding that the deadline for this is 2021-03-31, in order to support the implementation for bug 1675479. Please correct me if I'm wrong. I'm planning to make the mozilla-pipeline-schemas changes before making the decoder changes, which I am planning to start in the next two weeks.
You are correct.
| Reporter | ||
Comment 11•4 years ago
|
||
Alessio and I did meet on on 2021-01-21 regarding comment #5, the day after :whd and I discussed the operational piece regarding comment #4. I wish I had notes to recall the outcome of that discussion, but my current understanding is that this operational plan to bring the rally namespace online shouldn't conflict with the proposal/specification outlined in the Glean.js encryption document discussed and approved in bug 1675479.
Simply, the rally id will be in the client_info.client_id section and pulled out into a rally_id field in the metadata schema for operational purposes (errors, shredder, sample_id, etc). This will be visible to consumers as a primary client id field, in addition to the client_info.client_id field.
Comment 12•4 years ago
|
||
(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #11)
Simply, the rally id will be in the
client_info.client_idsection and pulled out into arally_idfield in the metadata schema for operational purposes (errors, shredder, sample_id, etc). This will be visible to consumers as a primary client id field, in addition to theclient_info.client_idfield.
Consumers of the Glean SDK/Client can't set the client id manually. We will define a new metric, "rally_id", and send this with Glean pings coming from Rally.
| Reporter | ||
Comment 13•4 years ago
|
||
Will all pings, including things like the baseline ping, be guaranteed to include this "rally_id" metric? What is client_info.client_id set to if it's not set by the core addon? We need a stable client identifier for shredder to appropriately delete data; it would be useful to know what can be extracted during decoding re: bug 1697342.
Comment 14•4 years ago
|
||
Comment 15•4 years ago
|
||
(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #13)
Will all pings, including things like the baseline ping, be guaranteed to include this "rally_id" metric? What is
client_info.client_idset to if it's not set by the core addon? We need a stable client identifier for shredder to appropriately delete data; it would be useful to know what can be extracted during decoding re: bug 1697342.
Glean.js doesn't currently define pings other than the deletion-request (so this is not an immediate problem). In the future, Rally will likely not be sending baseline pings or, if it ever will, it will include the "rally_id" metric.
| Assignee | ||
Comment 16•4 years ago
|
||
https://github.com/mozilla-services/cloudops-infra/pull/2962 contains most of the ops changes codifying comment #4. I will plan to deploy the edge routing changes before the first rally- schemas are deployed (probably rally-debug from bug #1698913).
Shredder work exists in https://bugzilla.mozilla.org/show_bug.cgi?id=1697454 https://bugzilla.mozilla.org/show_bug.cgi?id=1698934 which :akomar will mostly be handling.
| Assignee | ||
Comment 17•4 years ago
|
||
Per resolution of bug #1698913 this work is also complete.
Updated•2 years ago
|
Description
•