The code to load validation schemas currently loads a specific set of document types. We should change the code to pre-load the vacuous schema and then, when it sees a new document type, attempt to load that schema on-demand. If we can load the schema, cache it and use it to validate. If not, then use the vacuous schema. The doctype should be sanitized before being treated as a filename to prevent any exploits. The code to monitor submission counts restricts to characters in [-0-9a-zA-Z_] as a reference.  https://github.com/mozilla-services/lua_sandbox_extensions/blob/master/moz_telemetry/io_modules/decoders/moz_telemetry/ping.lua#L134-L139
After having taken a stab at this in https://github.com/mozilla-services/lua_sandbox_extensions/compare/master...whd:bug_1337927?expand=1, I think it may be better to read all the available schemas from the filesystem directory on plugin start (using e.g. the lfs module to list the directory) instead of dynamically trying to load them. This is because we run into an issue if an attacker spams new (but potentially valid) docTypes: if we cache each of these as vacuous we can run out of memory in the input plugin, but if we don't then known documents for which we don't have a schema (like the new system diagnostics ping) will incur a read attempt for every ping we receive. The above implementation does not perform the caching, and so would incur 1000s of failed reads every second. Loading from a directory listing would mean you can't add new doctype schemas without restarting the input plugin, but as the deploy process involves spinning up new ec2 instances anyway this isn't really an issue.
Before we do this, we need to make sure that the `shield-study` schema will not cause us to discard (or send to the error stream) all of the shield-study pings prior to version 3. There is a "v2" shield study being deployed presently, which will run for a couple of weeks. If this change lands before that study is done, we will need to handle both the old style and new style submissions.
(In reply to Wesley Dawson [:whd] from comment #1) I not that concerned about memory. 1) there would have to be A LOT of docTypes and this would also cause issues for other plugins 1a) they would share a single instance of the schema so the memory requirement is pretty low 2) Failed reads would only happen once for each docType. After the first failure the vacuous schema would match for that docType and no more reads should be attempted. That being said I am +1 for loading everything we have at startup and calling it done; everything else maps to vacuous. The directory could be re-read without restarting the plugin (and will eventually have to be) but for now this is fine. Mark wanted an alert on new docTypes, would this be a useful or just annoying?
(In reply to Mike Trinkala [:trink] from comment #3) > 2) Failed reads would only happen once for each docType. After the first > failure the vacuous schema would match for that docType and no more reads > should be attempted. My point is if I as an attacker spam (e.g. millions) of invalid (but potentially valid) doc types we can potentially run out of memory mapping them all to the vacuous schema, or else are forced to check the filesystem every time, or implement some other mechanism for handling this. > That being said I am +1 for loading everything we have at startup and > calling it done; everything else maps to vacuous. The directory could be > re-read without restarting the plugin (and will eventually have to be) but > for now this is fine. We can do this as we do with geoip, as that is reloaded minutely to pick up geoip updates. > Mark wanted an alert on new docTypes, would this be a useful or just > annoying? It could be both. I want to know about new valid doc types (like system-diagnostics), but it can be hard to tell those from noise. Maybe a once-a-[interval] type summary of new pings seen?
(In reply to Mark Reid [:mreid] from comment #2) > If this change lands before that study is done, we will need to handle both > the old style and new style submissions. I don't think there is any urgency in deploying this, as shield schemas are the only schemas we have but don't load. This can go out on the normal sprint cadence or at any other time assuming we don't add schemas for some other docType in the meantime. (In reply to Mike Trinkala [:trink] from comment #3) > 1) there would have to be A LOT of docTypes and this would also cause issues > for other plugins This is indeed a potential issue for any plugins that store information on a per-docType basis without using some kind of "other" bucket. I am much more concerned about ingestion breaking than analysis plugins though.
Assignee: nobody → mtrinkala
Status: NEW → ASSIGNED
Points: 2 → 1
Priority: P2 → P1
Gregg, has the "couple of weeks" from comment 2 above elapsed? Is it now safe to turn on the shield schema validation and potentially discard old-style shield-study v2 pings that do not adhere to the current schemas?
Hello! I thought I had answered this, but I guess not! So, there is going to be at least one more old shield study, I think, which means for teh SHIELD-STUDY ping type, we are still having old ones. Let me see if I can get a more sensible / KILL WITH FIRE answer in the next few days. GL
Component: Metrics: Pipeline → Pipeline Ingestion
Product: Cloud Services → Data Platform and Tools
Rather than waiting for shield v2 pings to stop coming in, we will add support for per-version schemas in the pipeline schema repo. Per discussion with :trink and :rvitillo just now, this is the proposed new layout: schemas/<namespace>/<doctype>/<doctype>.<version>.<format identifier> The "schemas" prefix may seem redundant, but it allows us to separate out validation code and the future-planned build step from the actual schemas. In the case of shield-related documents, we will end up with: schemas/telemetry/shield-study/shield-study.4.schema.json schemas/telemetry/shield-study/shield-study.4.parquetmr.txt schemas/telemetry/shield-study-error/shield-study-error.4.schema.json schemas/telemetry/shield-study-error/shield-study-error.4.parquetmr.txt schemas/telemetry/shield-study-addon/shield-study-addon.4.schema.json schemas/telemetry/shield-study-addon/shield-study-addon.4.parquetmr.txt Any shield submission that comes in with a version that is not "4" will use the vacuous schema - or we can add schemas for versions prior to 4 to end up with something like: schemas/telemetry/shield-study/shield-study.1.schema.json schemas/telemetry/shield-study/shield-study.2.schema.json schemas/telemetry/shield-study/shield-study.3.schema.json schemas/telemetry/shield-study/shield-study.4.schema.json schemas/telemetry/shield-study/shield-study.4.parquetmr.txt :whd this will presumably impact how schemas are deployed to decoder nodes, does this seem reasonable to you? What sort of coordination will we need to ensure we don't cause problems deploying the existing decoder until this has been done?  https://github.com/mozilla-services/mozilla-pipeline-schemas
(In reply to Mark Reid [:mreid] from comment #9) > :whd this will presumably impact how schemas are deployed to decoder nodes, > does this seem reasonable to you? Yes. > What sort of coordination will we need to ensure we don't cause problems deploying the existing decoder until this has been done? CCing me on the schema repo PR should be sufficient.
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
This was deployed today after https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/44 was merged.
You need to log in before you can comment on or make changes to this bug.