Closed Bug 1274024 Opened 8 years ago Closed 8 years ago

getting a csv of bugzilla data for analysis, in anticipation of London

Categories

(bugzilla.mozilla.org :: Administration, task)

Production
task
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: hulmer, Assigned: dylan)

References

Details

Attachments

(1 file)

I am hoping to get a CSV of bugzilla data to perform some basic analysis, organized as one bug per row, each column representing an appropriate field. The spec would be as followed:

A complete spec of the CSV would be this:

Filtered Date Range: January 1st, 2010 onward
Filtered Products: Core, Desktop, Android, IOS, Toolkit
Columns of CSV:
- INT: bug ID
- DATESTRING: Date bug was filed
- STRING: the bug resolution (fixed, wontfix, anything other than those two)
- STRING: product
- STRING: keywords
- STRING: status flag
- STRING: platform
- STRING: product 
- STRING: component of product
- TRUE / FALSE: user story is present
- TRUE / FALSE: did the bug start in the General component of a product?
- STRING: release version when added by a code sheriff
- STRING: priority if added by staff
- TRUE / FALSE: whether there is an unresolved needinfo
- TRUE / FALSE: has an attachment
- INT: # of comments
- TRUE / FALSE if any comment is marked as abuse / spam / non-pertinent
- STRING severity
Assignee: administration → nobody
Product: Bugzilla → bugzilla.mozilla.org
QA Contact: default-qa
Version: unspecified → Production
Emma Humphries has alerted me that "there should already be an extract available for you with all the fields, that's the one mcote can connect you to."

These fields are based (mostly) on her list of "what we think does / doesn't make a good bug":

1. keywords
2. status flags
3. regression range (check comments, or the has reg regression range field, the later is not in heavy use yet)
4. platform
5. user story (if a feature request)

Fields that we think indicate success as they are added

1. release version when added by a code sheriff
2. priority if added by staff
3. approved review flags (this is a flag on an attachment, and you should query bugs for attachments)
4. patches (older bugs that don't use mozreview will have patches as atta

Fields that we think indicate failure

1. too many comments (expect a non-linear relationship)
2. jumping across product boundaries
3. jumping between components in a product
4. starting in the General component of a product
5. needinfos open too long 
6. comments flagged as abuse/spam/non-pertinent

Fields that we think have no bearing 

1. severity 
2. release version if set with bug filing
P2 because I will not immediately be working on this, but I'll follow up with questions shortly. Briefly, here is the difficulty for having these fields. Anything omitted is trivial.

> STRING: the bug resolution
this is trivial, but do you want only resolved bugs or open ones too?

> STRING: platform
I assume this means platform *and* OS

> TRUE / FALSE: did the bug start in the General component of a product?
Specifically General or other Triage components? I'll talk to some people that know about this process

> STRING: release version when added by a code sheriff
> STRING: priority if added by staff

These are most difficult and it will take me some time to explain. There are a few options
so don't worry too much about how long this will take, it's more about the trade-offs of accuracy. More on that later!
Assignee: nobody → dylan
> STRING: the bug resolution

Yes, both resolved & open ones. The analysis centers around the correlates of a bug getting resolved.

>> STRING: platform
> I assume this means platform *and* OS

Yes, I think that'd be a fair assumption - a separate column for OS would be great.

>> TRUE / FALSE: did the bug start in the General component of a product?
> Specifically General or other Triage components? I'll talk to some people that know about this process

I am not sure about this piece.

To reiterate, according to Emma it appears there may already be an extract of this data as per https://bugzilla.mozilla.org/show_bug.cgi?id=1274024#c1. If that's the case, then there might be very little work to get these data, especially the difficult parts.
(In reply to Hamilton from comment #1)
> Fields that we think indicate failure
> 
> 1. too many comments (expect a non-linear relationship)

One note here: the volume of comments has sometimes dramatically gone up since the introduction of MozReview, because (a) MozReview encourages splitting up work more, so bugs will now sometimes get 10 or 20 attachments instead of 1, and (b) updates to commits cause updates to the corresponding attachment *and* the attachments for all subsequent commits, due to rebasing (e.g. if you have 10 commits and update commit 2, you will get updates to commits 3-10 as well).

Just something to consider.  Unfortunately we aren't yet tagging these comments (we will be in the future), but we could separate out bugs that have MozReview attachments versus those that don't.
(In reply to Hamilton from comment #3)
> To reiterate, according to Emma it appears there may already be an extract
> of this data as per https://bugzilla.mozilla.org/show_bug.cgi?id=1274024#c1.
> If that's the case, then there might be very little work to get these data,
> especially the difficult parts.

That's actually just a sanitized database dump, similar (but smaller) to the current database.  It would require a similar amount of processing to get into the format you're looking for.
(In reply to Mark Côté [:mcote] from comment #5)

> That's actually just a sanitized database dump, similar (but smaller) to the
> current database.  It would require a similar amount of processing to get
> into the format you're looking for.


Got it - sounds like we're on the right track then.


(In reply to Mark Côté [:mcote] from comment #4)
 
> One note here: the volume of comments has sometimes dramatically gone up
> since the introduction of MozReview, because (a) MozReview encourages
> splitting up work more, so bugs will now sometimes get 10 or 20 attachments
> instead of 1, and (b) updates to commits cause updates to the corresponding
> attachment *and* the attachments for all subsequent commits, due to rebasing
> (e.g. if you have 10 commits and update commit 2, you will get updates to
> commits 3-10 as well).

That's a great point. This is the sort of thing I will likely be able to tease out by "interacting" date + # comments in my statistical model, so that we can control or that.
:dylan I think it would also suffice, if possible, to just get me a version of this dataset with the trivial-to-collect stuff, so I can get started on some of the basic data analysis while the harder-to-work-out parts get implemented. Would this be possible?
Flags: needinfo?(dylan)
Definitely. I can get you a csv the beginning of next week.
Flags: needinfo?(dylan)
Could the curious audience in this public issue tracker be told what's going to happen in London? Thanks.
I don't think anyone knows yet. :)  Hamilton and team want to analyze usage patterns in BMO to determine if there are ways we can streamline and/or enhance the UI and our bug-tracking processes, but I don't think anyone knows what that will look like yet.  We're in an exploratory phase right now, as far as I know.
Just pinging in, I'll be getting the report together and shared to Hamilton today.
Confirming a detail with :mcote before sending the data via out of band channel.
I have given a first round of data to the reporter. It contains only public bugs (from a May 3rd sanitized dump).

Needinfo me for further requests -- for fresh data we'll conduct a security review.
Attachment #8756434 - Attachment mime type: text/x-sql → text/plain
Flags: needinfo?(dylan)
I'm guessing this can be closed.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(dylan)
Resolution: --- → FIXED
Blocks: 1294503
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: