During July, approx 1100 commits were made by an automated account (firstname.lastname@example.org) to collect metrics. Unfortunately, that data is now throwing off complex reports. Any commit made by user email@example.com did not generate any load on the build system, and skews projections. We'd like to purge those records from the database, if it can be done reasonably quickly. In more detail: This is not *OMG_NEED_IT_NOW*, but it does oddly play into reports I'm (:joduinn) doing about 2013 budgets, AWS billing and general infrastructure load. I was in the middle of doing these reports when I hit this bug. So ...ummm... the sooner the better? Worst case, I could spend time manually hacking numbers in excel, but that'd take a day/two manually, which I dont really have right now. Background: 1) Every time RelEng see a developer checkin, we add entries to a scheduler db... this db is used to schedule jobs out to the various different buildbot masters. 2) As part of internal RelEng experiment this month, :hwine + :zeller in RelEng setup a cron job to do commits which intentionally would not cause any builds/tests to be run. The idea was to see if this would allow measuring performance of tryserver repo. We killed this experiment after it had pushed ~1,100 jobs as "firstname.lastname@example.org" to try, each of which correctly didn't cause builds/tests to be run. No wasted CPU cycles there, all good, but the fear was that it was throwing off reporting. 3) I'm attempting to generate a report showing our monthly checkins for July, and noticed these ~1,100 no-op tryserver checkins were actually being counted in our reporting as "valid" checkins. You can see the report here (needs MozillaVPN, which I think you have as IT?) https://secure.pub.build.mozilla.org/buildapi/reports/pushes?starttime=1372662000&endtime=1375254000&int_size=86400 4) Including these "email@example.com" pushes throws off all the charts, and means I'll now have to spend a bunch of time generating charts by hand from hand-modified data.... ...unless it is possible to remove all ~1,100 entries in the scheduler db which were created by a specific "firstname.lastname@example.org" user, pushing to tryserver? If yes, you can remove those entries, that would be a great help to me personally, just let me know and I'll file bug with all the above details to make this official-and-above-board. If not, no worries, just let me know and I'll figure out how to manually wrestle the data and then regenerate the charts. Further details: It looks like emails are typically being stored in the author or who columns (it varies). A quick search from email@example.com in scheduler.changes came up with some results (in an older database copy we have). If need be, we can also extract the branch (always 'try') and revision information from the pushlog database for the try repository. Branch/Revision for scheduler are fields in the following tables: Changes and Sourcestamps. There may be some additional 'firstname.lastname@example.org' entries prior to July. It does not matter if those other entries are removed or kept. Please ask any questions -- there may be other ways to identify the data.
So, for the record, the email is only stored in the buildbot_schedulers.changes table under the author column. The who column in the buildbot table does contain emails, but never once does it contain email@example.com I've been working on this for a bit and I think I'm pretty close to figuring out all the items that need to be removed. Will backup and do some tests shortly.
All done! Notes: These are the potentially affected tables by changes that don't push any builds, the changes.author is where the e-mail is stored: mysql> select table_schema,table_name, column_name from information_schema.columns where table_schema like 'buildbot%' and column_name in ('sourcestampid', 'changeid'); +---------------------+---------------------+---------------+ | table_schema | table_name | column_name | +---------------------+---------------------+---------------+ | buildbot_schedulers | buildsets | sourcestampid | | buildbot_schedulers | change_files | changeid | | buildbot_schedulers | change_links | changeid | | buildbot_schedulers | change_properties | changeid | | buildbot_schedulers | changes | changeid | | buildbot_schedulers | scheduler_changes | changeid | | buildbot_schedulers | sourcestamp_changes | sourcestampid | | buildbot_schedulers | sourcestamp_changes | changeid | +---------------------+---------------------+---------------+ 8 rows in set (0.01 sec) In this scenaro, only 4 of the tables actually had data after checking them, so I wiped all 4 of the trylander data. here are the delete statements I tested on a backup, checked for validity, then executed. mysql> delete bs.* from buildsets bs join sourcestamp_changes ssc on ssc.sourcestampid = bs.sourcestampid join changes c on c.changeid = ssc.changeid and c.author = 'firstname.lastname@example.org'; Query OK, 2208 rows affected, 1 warning (0.10 sec) mysql> delete cf.* from change_files cf join changes c on cf.changeid = c.changeid and c.author = 'email@example.com'; Query OK, 1103 rows affected (0.03 sec) mysql> delete ssc.* from sourcestamp_changes ssc join changes c on c.changeid = ssc.changeid where c.author = 'firstname.lastname@example.org'; Query OK, 2208 rows affected (0.04 sec) mysql> delete from changes where author = 'email@example.com'; Query OK, 1104 rows affected (0.50 sec) Afterwards I checked the graphs, and they were indeed reduced from ~4500 to 3477 try attempts. Let me know if you need anything else!
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.