Closed Bug 823303 Opened 9 years ago Closed 8 years ago
Review: Publicize Bug Metadata from Bugzilla
> Who is/are the point of contact(s) for this review? Kyle Lahnakoski > Please provide a short description of the feature / application (e.g. > problem solved, use cases, etc.): Metrics has pulled information from Bugzilla, and placed it in an Elastic Search document store. There are no titles, no comments, and no descriptions. We would like this information available to the public so that it can be analyzed and digested by the community at large. > Please provide links to additional information (e.g. feature page, wiki) > if available and not yet included in feature description: The (Extract-Transform-Load) code is at https://github.com/mozilla-metrics/bugzilla_etl/tree/es_schema_v2. The schema is attached. My page uses this data: http://people.mozilla.com/~klahnakoski/ and there are requests from universities to make this type of information public. > Does this request block another bug? NO > To help prioritize this work request, does this project support a goal specifically listed on this quarter's goal list? If so, which goal? NO > Does this feature or code change affect Firefox, Thunderbird or any product or service the Mozilla ships to end users? NO > Are there any portions of the project that interact with 3rd party services? NO. > Will your application/service collect user data? NO > If you feel something is missing here or you would like to provide other > kind of feedback, feel free to do so here (no limits on size): **** The significant difference between this ES data and the public BZ data, is that ES HAS ALL THE SECURITY BUGS, but they lack descriptions, titles and comments. The ES version has ALL HISTORY; which is technically already available, but in the context of security bugs may increase risk. I am not asking for the ES service to be exposed (it can both query and update, so it is in no position to be exposed), but rather I am asking for the information to be exposed. The schema attached is probably the best source for determining what will be exposed.
we'll triage this at the 2013.01.02 meeting
Whiteboard: [pending secreview] → [pending secreview][triage needed]
Assignee: nobody → mgoodwin
ugh, https://github.com/mozilla-metrics/bugzilla_etl/blob/es_schema_v2/configuration/kettle/bugzilla_aliases.txt has a ton of e-mail addresses in it, which makes it great for spammers (though, bugmail addresses are generally ripe for spamming anyway). Also, that list has many incorrect aliases... For example: firstname.lastname@example.orgemail@example.com;multi;1;453594 firstname.lastname@example.orgemail@example.com;multi;1;374568 firstname.lastname@example.orgemail@example.com;multi;1;489109 Where's the best place to report the above issues?
There's a ton of stuff in the schema that we should *never* be publicly releasing for security bugs (or even bugs in any groups at all). Attachment descriptions, private attachments, status whiteboard entries, aliases, ... Will all that be included?
I have asked Mark Reid, and copied Reed Loden, about where to report alias issues. There is meta information on attachments, but no descriptions or patches. Status whiteboard entries do show, aliases show also. I am attaching an example of what the attachment content looks like. If you want an example of a specific bug, please ask.
ES stores bug version history as JSON. Here is an example bug.
(In reply to Reed Loden [:reed] from comment #2) > Also, that list has many incorrect aliases... For example: > > firstname.lastname@example.orgemail@example.com;multi;1;453594 > firstname.lastname@example.orgemail@example.com;multi;1;374568 > firstname.lastname@example.orgemail@example.com;multi;1;489109 Only rows marked as "curated" or "single" in column 2 are legitimate aliases. The ones marked "multi" are false positives and are only included so that they can be manually vetted/curated.
(In reply to Kyle Lahnakoski from comment #4) > There is meta information on attachments, but no descriptions or patches. > Status whiteboard entries do show, aliases show also. Yeah, status whiteboard entries and aliases (on at least bugs within groups) need to be masked. > I am attaching an example of what the attachment content looks like. If you > want an example of a specific bug, please ask. Can you give me an example of bug 713926? It's just a random security bug that I've worked on, but it has some fields set and whatnot.
The ES datastore keeps multiple snapshots of every bug over time. Furthermore, when possible, it keeps previous values with the current snapshot. This makes the data redundant, but easier to query with ES's limited query language. There is nothing particularly sensitive in this attachment.
I'm worried that attackers could figure out quite a bit just from the component and CC list. And making the whiteboard public scares me too. Why do you want this information about security bugs to be public?
We can not get accurate trend information if we remove work off the pile. The goal is not to make the security bugs specifically public but have as much info as possible out there so we can track work loads and make much more robust dashboards removing the load from Bugzilla. We can definitely make them "private bugs" and remove any info that would be of concern from a security point of view. The more info we can leave in the better so going to far and removing all whiteboard tags, product, and components would likely cripple the utility so we need to be very specific of what exactly needs to be removed or how we can white wash it so that people don't recognize the bug as specifically a security bug. The current question is, how do we white wash the data so we can let 3rd parties query the rest interface without compromising security information?
Jesse: I can understand the CC list being a concern; and since I have not used it yet, I have no immediate concern with removing it. But, the whiteboard tags are important for the types of queries we run: May you give me an example where whiteboard tags are an issue?
(In reply to Martin Best (:mbest) from comment #10) > we need to be very specific of what exactly > needs to be removed or how we can white wash it so that people don't > recognize the bug as specifically a security bug. Well, if you're leaving the keywords available, that's an easy indication that the bug is a security bug, as we have various sec-*, wsec-*, csec-* keywords that are added to almost all security bugs, noting the severity of the issue. Plus, there's a groups section, so it's easy enough to look to see which bugs are in which security/private groups. (In reply to Kyle Lahnakoski from comment #11) > But, the whiteboard tags are important for the types of queries we run: May > you give me an example where whiteboard tags are an issue? Security bugs regularly have private information in them (mention of shadow bugs, embargo info, etc.). Besides that, MoCo-confidential bugs sometimes use the whiteboard field to note private things like tracking numbers and other stuff that is not meant to ever become public. It is an absolute no-go for publicizing the status whiteboard entries for security or any other bugs contained within groups.
I think the question is not if what board tags are ok or not, but what content contained in whiteboard tags should be removed (or allowed via a white list). We tokenize whiteboard information so we can scrub it at that level. Any instances of "sg:crit" or what not would be removed and replaced with something like "private". It sounds like there could be random sentences in the whiteboard tags making me think the only way to go would be a whitelist for the programs we currently want to track. How does this strike you?
I should also mention that we can do the same for keywords.
It sounds like there is a lot more in the whiteboard tags than I have seen in my limited experience. Should I work on an explicit list of allowed whiteboard and allowed keywords?
Whiteboard: [pending secreview][triage needed] → [pending secreview]
We are missing a great deal of context here to make decisions and as a result I have a feeling that we are making unnecessarily complicated. (In reply to Martin Best (:mbest) from comment #10) > The current question is, how do we white wash the data so we can let 3rd > parties query the rest interface without compromising security information? Martin, can you explain who these third parties are and what REST API you are talking about? Is there a high level overview of what this project is about?
Attachment #694131 - Attachment mime type: text/html → text/plain
Dan Veditz: 1) I will make a white list of whiteboard tags and keywords 2) I agree the bug_group and isPrivate are not required 3) "not analyzed" is a technicality of elastic search: meaning the field is not parsed into individual 'words' before being indexed 4) I had not looked at the URL before. I will look at it, but I suspect I do not need it either. 5) The attached schema is a complete list of fields for all history of all bugs. There is no additional unmentioned fields that may have been used in the past. Only changes to the fields in the schema are recorded as changes. It is clear to me now that I must better describe the domains for each of the fields. I will start that today. Stefan Arentz has already asked to meet with me to clarify this further.
Assigning to Stefan to speed the process along
Assignee: mgoodwin → sarentz
Assignee: nobody → sarentz
Given that we have two other bugs for reviews (that have been completed): https://bugzilla.mozilla.org/show_bug.cgi?id=939081 (esFrontLine) https://bugzilla.mozilla.org/show_bug.cgi?id=930081 (Store Bugzilla data in public ElasticSearch) Should this bug still exist? Can we close this one?
Yes, we can close this one. Marked as WONTFIX because making the metadata public may expose more than intended.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.