Closed Bug 1401979 Opened 7 years ago Closed 5 years ago

Add normalized_build (or official_build) to validated pings

Categories

(Data Platform and Tools :: General, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: frank, Unassigned)

References

Details

There are two options: 1. Include a boolean about whether this ping is an "official" build (could this actually come from the client?) 2. Include a normalized_build field that includes the build IFF it is an official build
Component: Datasets: General → Pipeline Ingestion
This needs a bit more design - how will we incorporate official build info? I don't think we can rely on the client to report whether it's official (in fact, I think we already have some info in the ping about this). If we can, I think it can be done relatively easily. If not, we'll need to scope out how to get official build info in a timely fashion.
Flags: needinfo?(fbertsch)
:chutten and I were just discussing build-hub integration. Presumably we could get "official" builds from buildhub? The only downside to that might be delay of receiving new build-ids; i.e. if the build goes out, people crash, pings comes in, and THEN we receive info about that build - we'll set those crashes as unofficial builds and miss them (bad!). Can we get official build-ids before they are built? Or at least before they are deployed?
Flags: needinfo?(fbertsch)
(In reply to Frank Bertsch [:frank] from comment #2) > :chutten and I were just discussing build-hub integration. Presumably we > could get "official" builds from buildhub? The only downside to that might > be delay of receiving new build-ids; i.e. if the build goes out, people > crash, pings comes in, and THEN we receive info about that build - we'll set > those crashes as unofficial builds and miss them (bad!). > > Can we get official build-ids before they are built? Or at least before they > are deployed? There is also the additional detail that on Linux, pretty much all builds are coming from third parties so won't be in the buildid database (I suppose this might change in the next year or so, in a future where we produce our own flatpak and snappy builds). I don't think using buildhub for this is the right approach tbh. It just creates another system that we have to depend on / query every time we want to determine whether a ping represents an official build. If we want to see whether a build was produced by us, we should include that in the payload only in official builds (yes, there is an issue that a motivated actor could work around -- but then, they could also submit a fake ping with a known buildid...). Could we use the vendor field for this? http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/docs/data/environment.rst#31 I think it is always set to "Mozilla" right now (even local builds set this value), but we should be able to make it configurable. Or we could just use one of Frank's suggestions in comment 0.
(In reply to William Lachance (:wlach) (use needinfo!) from comment #3) > I don't think using buildhub for this is the right approach tbh. It just > creates another system that we have to depend on / query every time we want > to determine whether a ping represents an official build. If we want to see > whether a build was produced by us, we should include that in the payload > only in official builds (yes, there is an issue that a motivated actor could > work around -- but then, they could also submit a fake ping with a known > buildid...). Thinking about it, I'd rather we checked for valid build-ids server-side. My rationale is that if we have these people report unofficial builds as official, that will clutter up the set of official builds that we've seen, and will still require cleanup server-side. Alternatively, by validating build-ids, they can certainly fake one, but that data will just be munged in with the rest of that build.
(In reply to Frank Bertsch [:frank] from comment #4) > (In reply to William Lachance (:wlach) (use needinfo!) from comment #3) > > I don't think using buildhub for this is the right approach tbh. It just > > creates another system that we have to depend on / query every time we want > > to determine whether a ping represents an official build. If we want to see > > whether a build was produced by us, we should include that in the payload > > only in official builds (yes, there is an issue that a motivated actor could > > work around -- but then, they could also submit a fake ping with a known > > buildid...). > > Thinking about it, I'd rather we checked for valid build-ids server-side. My > rationale is that if we have these people report unofficial builds as > official, that will clutter up the set of official builds that we've seen, > and will still require cleanup server-side. Alternatively, by validating > build-ids, they can certainly fake one, but that data will just be munged in > with the rest of that build. Yeah, I suppose we would be trusting people not to abuse a field that we provide. Assuming we went with the buildhub approach, I would imagine that we would want some kind of data structure that provided a set of valid channel/platform/buildid combinations. I *suspect* this wouldn't be too huge? Maybe it's time we roped in the buildhub people to see if they can provide advice. Mathieu, from the buildhub commit log it looks like you're the most active contributor to the product. Can you tell us: 1) What is the latency of a build being released and it appearing in buildhub, if any? 2) Could you get buildhub to produce some kind of whitelist of valid buildids, as described above for the purposes of validating telemetry pings? I know we could look them up one at a time using the existing API, but that's not obviously going to scale for our purposes. :)
Flags: needinfo?(mathieu)
Hi there! Thanks for your interest in buildhub ;) > 1) What is the latency of a build being released and it appearing in buildhub, > if any? > Currently, our current strategy consists in relying on Amazon S3 events to publish entries in buildhub. In other words, it should appear in buildhub a few seconds after the release file is uploaded on archive.mozilla.org. > 2) Could you get buildhub to produce some kind of whitelist of valid buildids, > as described above for the purposes of validating telemetry pings? I know we > could look them up one at a time using the existing API, but that's not > obviously going to scale for our purposes. :) > Currently buildhub is just a raw instance of Kinto with an ElasticSearch plugin. The buildhub repo only contains a set of scripts to populate the data from archives.mozilla.org. So, everything is possible — «famous last words» ;) — but for obvious reasons, before adding specifities to the API, we'll want to make sure we cannot answer your needs with the current querying features :) The /search endpoint is a bridge to the underlying ElasticSearch where you can perform powerful queries. For example, you can obtain the list of build ids of a specific version: curl -s "https://buildhub.stage.mozaws.net/v1/buckets/build-hub/collections/releases/search?q=target.version:57.0b3" | \ jq -r '.hits.hits[] | ._source.build.id' | \ sort -u The /records endpoint contains ETag headers that you can use to poll for changes or cache results locally. Let us know how would look your ideal API, and we could elaborate a plan from that :)
Flags: needinfo?(mathieu)
(In reply to Mathieu Leplatre (:leplatrem) from comment #6) > Hi there! > > Thanks for your interest in buildhub ;) > > > 1) What is the latency of a build being released and it appearing in buildhub, > > if any? > > > > Currently, our current strategy consists in relying on Amazon S3 events to > publish entries in buildhub. In other words, it should appear in buildhub a > few seconds after the release file is uploaded on archive.mozilla.org. Great! This should be more than fast enough for our needs. > > 2) Could you get buildhub to produce some kind of whitelist of valid buildids, > > as described above for the purposes of validating telemetry pings? I know we > > could look them up one at a time using the existing API, but that's not > > obviously going to scale for our purposes. :) > > > > Currently buildhub is just a raw instance of Kinto with an ElasticSearch > plugin. The buildhub repo only contains a set of scripts to populate the > data from archives.mozilla.org. > > So, everything is possible — «famous last words» ;) — but for obvious > reasons, before adding specifities to the API, we'll want to make sure we > cannot answer your needs with the current querying features :) > > ... > Let us know how would look your ideal API, and we could elaborate a plan > from that :) So our use case here is that in various telemetry tools, we'd like to validate whether a ping is coming from an "official" build. The volume of pings we process is very high so I don't think an API call-per-ping is really going to cut it. What I'd like to be able to do is just store a whitelist in memory of valid pings per channel/platform in our transformation code, something like: { <platform e.g. macosx>: { <channel e.g. release>: Set([buildid1, buildid2, buildid3]), <channel ..>: Set([...]), ... }, <platform> : { ... }, ... } I *think* the cardinality of buildids is small enough that this would work. Anyway, I'd be happy with any API that made it easy to construct such a table. Obviously a json api with the exact information above would be the most obvious implementation, but I'm open to other approaches.
(In reply to William Lachance (:wlach) (use needinfo!) from comment #3) > (In reply to Frank Bertsch [:frank] from comment #2) > > :chutten and I were just discussing build-hub integration. Presumably we > > could get "official" builds from buildhub? The only downside to that might > > be delay of receiving new build-ids; i.e. if the build goes out, people > > crash, pings comes in, and THEN we receive info about that build - we'll set > > those crashes as unofficial builds and miss them (bad!). > > > > Can we get official build-ids before they are built? Or at least before they > > are deployed? > > There is also the additional detail that on Linux, pretty much all builds > are coming from third parties so won't be in the buildid database (I suppose > this might change in the next year or so, in a future where we produce our > own flatpak and snappy builds). As stated in bug 1233687 comment 18, Linux repacks that maintain the name and functionality are to be consider official and, as such, we're getting Telemetry from them. Before we finish with flatpak (bug 1278719), we could still reach out to the main distros [1] and ask for their build ids. [1] - https://sql.telemetry.mozilla.org/queries/40172
(In reply to William Lachance (:wlach) (use needinfo!) from comment #7) > > { > <platform e.g. macosx>: { > <channel e.g. release>: Set([buildid1, buildid2, buildid3]), > <channel ..>: Set([...]), > ... > }, > <platform> : { ... }, > ... > } > > I *think* the cardinality of buildids is small enough that this would work. > > Anyway, I'd be happy with any API that made it easy to construct such a > table. Obviously a json api with the exact information above would be the > most obvious implementation, but I'm open to other approaches. I did a small Python script that puts together the results of several ElasticSearch queries to fulfill your use-case: https://gist.github.com/leplatrem/ef29b3cd06690f8a3c8bdc693fd46a2c (You can even simplify it by hard coding known platforms and channels. Or optimize it to group the queries together if you know well the E/S API) It is rather fast — and depending on your context — you could re-run it every hour or so... Good thing is that this little experiment helped us realize that a lot of Mac OS records were missing :) We fixed it and are now filling up the gaps... Let us know if something like this could do it, or if it is totally off beam :)
Flags: needinfo?(wlachance)
(In reply to Mathieu Leplatre (:leplatrem) from comment #9) > It is rather fast — and depending on your context — you could re-run it > every hour or so... > > Good thing is that this little experiment helped us realize that a lot of > Mac OS records were missing :) We fixed it and are now filling up the gaps... > > Let us know if something like this could do it, or if it is totally off beam > :) Hi Mathieu, this is great! It seems to complete in less than 2 seconds from my office internet connection, which is more than fast enough. Uncompressed, the json file is less than 500k. Need to sort out the details, but my guess is that something like this should be more than enough for our needs.
Flags: needinfo?(wlachance)
Blocks: 1408525
This is a nice-to-have, but as far as I know nothing is dependent on it at this time. As such, I'm setting to P3 to be re-prioritized next quarter.
Priority: -- → P3

We... actually talked about this today but I'm going to close this as WONTFIX and we can track any new work in this space with new bugs.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Component: Pipeline Ingestion → General
You need to log in before you can comment on or make changes to this bug.