Closed Bug 838947 Opened 13 years ago Closed 11 years ago

researcher requesting data and paper review

Categories

(bugzilla.mozilla.org :: Administration, task)

Production
task
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: zhmh, Assigned: mhoye)

References

Details

Attachments

(1 file)

262.53 KB, application/pdf
Details
Attached file willenv.pdf
User Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:18.0) Gecko/20100101 Firefox/18.0 Build ID: 20130108033621 Steps to reproduce: Why I request the data? I try to quantify the factors that make a successful open source community. In particular, I focus on people, e.g., contributors' willingness, their expertise, their interaction with project context, and how that's associated with their performance in the community. The approach we are doing this is to investigate the traces people left in issue tracking systems and version control systems. And, Mozilla is one of our target communities, the other is Gnome. My co-author and I have obtained some interesting results on this topic and finished a paper (see the attached file). We would like you people to review the paper, and, please, could you offer us a copy of the database for the later validation? It would be great if I could get it in one week. What is my institution? I'm Minghui Zhou, a professor in Peking University, interested in measuring how developers live their lives with the hope that could help understand and control large complex software systems. My contact information: zhmh@pku.edu.cn. Hope you enjoy reading the paper. --------------- Minghui
Assignee: nobody → gerv
Assignee: gerv → mhoye
Hello, Professor - My name is Mike Hoye, and I'm the Engineering Community Manager for Mozilla. Thank you for your interest; while I don't know that I'm qualified to evaluate your paper, I think that I can make the data you've requested available by the end of the month. Unfortunately it will take some time to sanitize the data, so I cannot promise it sooner than that. I'll update this bug with additional information as soon as it's available. Thanks for your interest, and I'm looking forward to reading your results.
Mike, thanks a lot. Also, is that possible to share the process of how to sanitize the Bugzilla data?
Mike, First, thank you for helping us out with the Mozilla data, and if you could share how the sanitization process proceeds (or even scripts used), we will try to help other communities, e.g., gnome, to institute a similar process. >>"while I don't know that I'm qualified to evaluate your paper," As an "Engineering Community Manager" you are in an excellent position to comment if the description of the issue tracking procedures is accurate and if the findings related to who may contribute for an extended period appear reasonable and useful to you. Obviously, any other comments you may have would be much appreciated.
Hello, Audris - The sanitization process removes bugs or comments that are flagged as private or confidential - HR/Legal and open security bugs are the two big categories there, though there are (I think?) a few others. I'll see if I can get the scripts used to sanitize the database made available, though I can't promise anything, and I intend to be as transparent as I can about the amount of data removed during the sanitization process.
Hello Clare and Audris! I'm working with mhoye and others at Mozilla, as the "Bugmaster", specifically to improve the community for bug triaging. I really enjoyed reading your paper. What kind of feedback are you looking for? The points about clustered peer groups leading to encouragement, trust, and less confusion were particularly interesting. Would you mind if I blog and link to your paper, either after it is published, or now in its draft form?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Hello Liz! I'm really happy that you have enjoyed reading the working draft. At this point in time we have submitted it to IEEE Transactions on Software Engineering for a scholarly peer review. The main purpose of the scholarly peer review is to ensure that the methods used to conduct the analysis are sound and appropriate and that the findings are relevant and new. At the same time, we would like to hear directly from the projects involved to improve upon the working draft. Thus, it would be extremely useful if you could point out any issues or anything that does not appear to make sense. In particular, we'd like to make sure that the facts presented (including the description and interpretation of the process and data) are as accurate as possible. We also want to understand better if the findings appear to be sensible and/or useful. Finally, it would be of great interest to hear the feedback about what more could be done to make the work more relevant and useful from the perspective of Mozilla and other projects. Therefore, if you do not see any clear issues, it is fine to elicit a wider feedback (by including a reference in your blog), but if you do see any issues, we would appreciate you sharing them with us, so we could to fix them first. In any case, please note that this is just a working draft and the results are not finalized yet. Thank you!
Hello! I have good news: a sanitized MySQL dump of the Bugzilla database is available here: http://people.mozilla.org/~mhoye/bugzilla/ This is a MySQL dump of the contents of Bugzilla, as of January 2nd of this year. The above link goes to temporary hosting until a permanent solution can be put in place. Though the contents of this data set have been sanitized, with HR, legal and open security issues removed (~6.3% of the total), this dataset is otherwise Bugzilla's entire contents, minus attachments. I will update this bug with the relevant information once permanent hosting is found. Thanks, - mhoye
This seems to go against our current process for handling data dumps. Have you run this by Legal?
It's been cleared by legal, privacy and security.
So instead of requesting this specially, now anyone who wants the data dump can have it? Was it limited before because of worry about people harvesting email addresses?
In the past you needed to sign a researcher agreement. That's no longer the case.
OK! Good to know. Thanks Mike!
Many thanks!
Make my world pretty today! Mike, tons of thanks! I was wondering, how the decision was made between worrying about harvesting email addresses and satisfying researchers or other similar requirements?
Clare - The information in the dump is already public-facing - anyone scraping Bugzilla could already get at it, and the agreement we asked people to sign didn't really prevent any bad things from happening. Since it was unreasonable to think we'd decide to sue somebody for breaching the researchers' agreement, much gain anything by doing so, we decided that it wasn't serving much purpose beyond being a barrier to participation. So, we dropped it.
Status: NEW → ASSIGNED
I compared the bugs in this dump (bugs13) and the bugs we retrieved in May 2012 (bugs12), here is a brief report. 1, The number of bugs that are in bugs13 but not in bugs12 is 65291. These bugs include: a, the bugs reported since May 2012, the number is 64925. b, 366 bugs reported before May 2012 were not in bugs12 but in bugs13. For example: 290720 320101 324074 334190 344494 344495 344751 350433 355214 Among these 366, 157 were not authorized to access during retrieving (though they went public now, e.g., 177647). It left us 209 bugs we failed retrieving in May 2012. 209/709587 = 0.0003, that is a small proportion. 2, The number of bugs that are in bugsin12 but not in bugsin13 is 69. Apparently, these bugs are sensitive right now, e.g., Bug385081. I could understand a bug changes from confidential to non-confidential, but what's the reason for the opposite direction? At last, there is no profiles_activity table, I was wondering, is that possible to offer this table as well?
Hi, Clare - Thanks for your feedback. One of the things about having a public-facing issue tracker is that occasionally bugs get filed that should be marked confidential, but for whatever reason aren't right away. Sometimes because they aren't correctly filed, other times because it only becomes clear that this is a legal, HR or Security issue after the issue has been worked on for a while. So there's a race condition between the database getting dumped and the bug getting flagged that can result in the DB dump (or a list compiled by scraping Bugzilla directly) containing incorrect information. These days try to minimize that risk with by dumping the DB, waiting for a week and re-sanitizing the database dump of all bugs that had been marked confidential in the interim. That will be our practice for the foreseeable future as well. The profiles_activity table contains a log of user activity/changes, and along with a few others, is removed for security/confidentiality reasons. Consqeuently, we will not be making that table available. Thanks again for your interest, and good luck with your work! - mhoye
looks like no further action is required here; closing bug.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: