Should we allow products to be removed from the probe-scraper without deleting their associated data?
Categories
(Data Platform and Tools :: General, enhancement)
Tracking
(Not tracked)
People
(Reporter: Dexter, Unassigned)
References
Details
(Whiteboard: [dataplatform])
As of today, removing any product from the repositories.yaml file for the probe-scraper will result in a mozilla-schema-generator failure (see bug 1764332), e.g.
items removed from the base
-org-mozilla-bergamot.custom.1.txt
This bug is for understanding why products can't be removed from the probe-scraper and if/how to fix that.
Reporter | ||
Comment 1•3 years ago
|
||
Hey :whd, why can't we remove things from the probe info service (and so from mozilla-schema-generator) without additionally deleting the tables?
I think an ideal state would be us being able to to deprecated products, stopping scraping them, without deleting the existing tables. What's preventing us to get there? Are you the best person to ask?
Updated•3 years ago
|
Updated•3 years ago
|
Comment 2•3 years ago
|
||
I've updated the title to be more specific. We support removing products from probe scraper as long as the source tables are being deleted, which requires a separate process. This process is not optimized and we could look at optimizing it, but that is a separate matter from what :Dexter is primarily concerned about. I won't discuss it further here except that to be specific, I don't think we actually disallow probe-scraper itself to remove products (for the last 3 days of schemas generation failures, probe-scraper itself succeeded). Rather, the schemas generation[1] pipeline and the schemas deployment pipeline as a whole will fail if we remove schemas for deprecated products that have been deleted from probe-scraper (or have history rewritten in upstream git repos), without following the procedure above.
why can't we remove things from the probe info service (and so from mozilla-schema-generator) without additionally deleting the tables?
You can think of the schemas generation pipeline as idempotent: every run it rebuilds https://github.com/mozilla-services/mozilla-pipeline-schemas/tree/generated-schemas from scratch (i.e. the repository history of all repositories probe-scraper scrapes, as well as MPS's main branch).
From a BQ deployment perspective, this artifact is the only data that matters, as it indicates which tables are to be deployed (or by their absence relative to production state, deleted). MSG is responsible for generating this artifact, but the probe-scraper, MSG, and Probe Info services are all implementation details that update this artifact, and the BQ deployment pipeline has no knowledge of any of these components.
What you are asking for is not easy to implement without an additional data store, because at the end of the day the generated-schemas
branch represents what should exist in production, regardless of deprecation status. We can't simply "forget" these tables exist (treating ops infra state as as the aforementioned "additional data store") as then we have no way of deleting the tables when they go from deprecation to deletion, or update them when e.g. ACLs change for the tables, or new metadata headers are introduced etc..
So the BQ deployment pipeline doesn't really need to understand the concept of deprecation: it only needs to understand which tables should be deployed or be deleted. An additional layer would need to be implemented in the schemas generation pipeline somewhere to have "deprecated but don't delete this data" state for tables whereby their schemas remain extant in the final generated-schemas
artifact, without consulting its upstream git repo. I would expect the end result in generated-schemas
to look almost exactly the same to the BQ deployment pipeline after such a layer was introduced.
I do have some thoughts on how we might support such via an additional data store, but it is a complex enough problem that no trivial solutions come to mind. My expertise is generally with the BQ deployment pipeline and not the schemas generation pipeline, so I've CC'd some folks from DE that might have thoughts. Given all of the above however, and the fact that a deprecated
flag exists in probe-scraper metadata, I think it might be worth reframing the issue as to whether we can reasonably use the existing deprecated
flag to lock schemas for deprecated projects to their "N-1" version instead of using their git histories to recompute their schemas (though this breaks the idempotent invariant). In this model, the source of truth on what is deprecated (but not deleted) remains in probe-scaper, but is resistant to changes like the one that motivated this bug, while deletion still occurs when entries are removed from probe-scraper.
[1] This is a bit confusing, so I'm talking about 3 related terms:
schemas generation pipeline (the pipeline that updates generated-schemas)
BQ deployment pipeline (the pipeline that deploys BQ resources to GCP based on generated-schemas)
schemas deployment pipeline (the combination of the above two pipelines)
Reporter | ||
Comment 3•3 years ago
|
||
(In reply to Wesley Dawson [:whd] from comment #2)
I've updated the title to be more specific. We support removing products from probe scraper as long as the source tables are being deleted, which requires a separate process. This process is not optimized and we could look at optimizing it, but that is a separate matter from what :Dexter is primarily concerned about. I won't discuss it further here except that to be specific, I don't think we actually disallow probe-scraper itself to remove products (for the last 3 days of schemas generation failures, probe-scraper itself succeeded). Rather, the schemas generation[1] pipeline and the schemas deployment pipeline as a whole will fail if we remove schemas for deprecated products that have been deleted from probe-scraper (or have history rewritten in upstream git repos), without following the procedure above.
Thanks for clarifying. Yes, I'm looking for the ability to deprecate products in the probe-scraper without blocking the rest of the pipeline on that.
Repos can move, change. Some repos
why can't we remove things from the probe info service (and so from mozilla-schema-generator) without additionally deleting the tables?
[...]
Thank you for the fantastic, in depth, clear answer.
[...] Given all of the above however, and the fact that a
deprecated
flag exists in probe-scraper metadata, I think it might be worth reframing the issue as to whether we can reasonably use the existingdeprecated
flag to lock schemas for deprecated projects to their "N-1" version instead of using their git histories to recompute their schemas (though this breaks the idempotent invariant). In this model, the source of truth on what is deprecated (but not deleted) remains in probe-scaper, but is resistant to changes like the one that motivated this bug, while deletion still occurs when entries are removed from probe-scraper.
From my consumer POV, I believe that this would be a good compromise, given the explanation about the BQ deployment pipeline.
Waiting for other folks to chime in on this.
Comment 4•3 years ago
|
||
:relud is working this half on replacing the "probe scraper" model with a "probe pusher" model where Glean apps push new metrics to a webhook as part of CI rather than relying on Git history. I imagine that new model would naturally alleviate this situation, as a deprecated app could simply stop remove the CI task for pushing new metrics.
:relud Does that sound right to you? Is it clear to you how app deprecation will play into the probe pusher model?
Comment 5•3 years ago
|
||
a deprecated app could simply stop remove the CI task for pushing new metrics.
yep, this is what i expect to happen.
Reporter | ||
Comment 6•3 years ago
|
||
:relud is this the current proposal?
Any chance we could update the proposal mentioning how this problem would be solved by it?
Thanks!
Reporter | ||
Comment 8•3 years ago
|
||
Thanks. Looks like this will be solved, as stated, in the future evolution of probe-scraper. Should we keep this bug open to track that progress?
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Reporter | ||
Comment 9•2 years ago
|
||
Hey Anna, this was closed as FIXED by the linked ticket doesn't offer much more information. Is it now safe to remove apps from the probe scraper without the risk of deleting the associated tables?
Comment 10•2 years ago
|
||
I believe so. But tagging :relud to confirm (or reopen the ticket in case I closed it prematurely).
Comment 11•2 years ago
•
|
||
Removing applications from repositories.yaml
continues to be an indication that tables should be deleted. Instead of that, probe scraper allows applications to become "deprecated" by disabling their glean-push CI job, or even deleting the repository entirely, so that an application is no longer scraped.
Although it was not the orginally planned outcome, I believe this means that this ticket is FIXED.
Description
•