176002 - Move duplicate statistics into the db

Reporter

Description

•

22 years ago

I'm sure I filed a bug on this earlier, but I can't find it now, either in bz, or in my mail archives. Oh well. If someone does find it, its probbaly better to dupe that against this one, which actually has some thoughts We need to move the duplicate code into the db. bmo currently has about 52,000 duplicate bugs (based on the reports.cgi graph). I don't know how big that makes bmo's dbm file (since it only has an entry for every unique bug), but I'd guess that its at least 40,000 and probably closer to 45,000. When loading duplicates.cgi, this means that we read in these 45000 entries from disk, copy them into memory to avoid modifying the tied version, then SELECT these 45000-odd entries from the database based on the resolution/bug_status of the target bug, and optionally the product. For each of these, we run CanSeeBug on each of them. Individually. It that passes, we then discard the closed ones (if requested) [why can't we do that in the query, at least, or before CanSeeBug?], and then chuck the entire list to the template, which then sorts these 45000 elements. This takes 10-20 seconds on bmo for each page load, which really sucks, although I must say that I'm impressed that it only takes 10 seconds to run CanSeeBug 45000 times. The fix for this is: a) move duplicate stats from the dbm files into the db (This also fixes the fact that my perl can no longer read the old dupe files from rh7.3 w/o running db_upgrade, because the formats are not compatable) b) Change the query to do a SELECT with the approriate WHERE clause, LIMIT $numwanted. Run CanSeeBug on each of those, and if it fails, run with LIMIT $numfailed OFFSET $oldnumwanted, and repeat until none fail, or we run out of entries. c) push that data to the tempate (b) will be slower in the case where the user can't see most of the bugs, since we'd run the query lots of times. We don't have to - on a test db of 100,000 bugs + ORDER BY, LIMIT 1000 takes 0.05 seconds, as opposed to 1.34s without the limit, so we could just do the one query, and then stop caring after we have enough passes from CanSeeBug.. I don't know if we care about that, though - its unlikly to be an issue on a real installation. I know that part of the reason we do some of this stuff this way is that in the future gerv wanted us to sort/filter in the page dynamically, w/o requiring us to reload the data. We can deal with this when we add that support, I think, given this _massive_ perf issue. The hard part is coming up with a db model which doesn't have 45,000 rows for each day - if we do that, then we'll probably lose most of the perf advantage. The simple way is to only keep the last week's results, or something. The better idea is to only store a new row when something changes. To find the top 100 bugs, we then do: SELECT bug_id, dupe_count FROM dupe_stats a WHERE changed_date=(SELECT MAX(b.changed_date) FROM dupe_stats b WHERE a.bug_id=b.bug_id) ORDER BY dupe_count LIMIT 100 except we can't, because thats a subselect :) The alternate is: CREATE TEMPORARY TABLE tmp ( ... ); INSERT INTO tmp SELECT bug_id, MAX(dupe_count) AS cnt FROM bugs GROUP BY bug_id SELECT bugs.bug_id, tmp.cnt FROM bugs, tmp WHERE bugs.bug_id = tmp.bug_id AND <other conds> ORDER BY tmp.cnt LIMIT 100 DROP TABLE tmp; We can avoid the temp table only if we pull everything out of the db, and sort in perl for the order-by-dupeCnt case. In that case, its still more efficient than the current scheme, but a temp table is probably faster anyway. For order-by-changed-in-last-n-days, we just ahve two temp tables, the second with an additional |date <= (now - 7 days)| added, and ORDER BY (foo.dupe_count-bar.dupe_count). The alternative, for when we're not sorting based on that, is to select the vals in perl WHERE bug_id IN (whatever we want to display), and then do the merge in perl. No idea which way is better/faster/etc. Anyway, that makes the db schema something like: DATE last_changed, INT bug_id, INT dupe_count with indexes on last_changed and bug_id. collectstats then grabs the current vals via the above query, grabs todays vals by scanning the dupes table + consolidating, and for any non equal ones, it INSERTs into the table. Theres a race condition with half updated vals, but we have that currently with the dbm stuff (although the window is smaller). That can't be fixed until we have transactions, though. Thoughts?

Bradley Baetz (:bbaetz)

Reporter

Updated

•

22 years ago

Blocks: bz-perf

Keywords: perf, topperf

Summary: Move duplicates into the db → Move duplicate statistics into the db

Target Milestone: --- → Bugzilla 2.18

Joel Peshkin

Comment 1

•

22 years ago

Rather than fetching a large number of items from the DB and then running CanSeeBug() on each of them, I'd suggest integrating the access check directly into the original query the same way as is (being) done in Search.pm.

Gervase Markham [:gerv]

Comment 2

•

22 years ago

bbaetz's original analysis is slightly dodgy. After reading the file into memory (although today's file is almost certainly in cache anyway), we throw away all dupes with a count < the threshold, which is 5 on b.m.o. This will reduce the number of bugs considered from 50,000 to far fewer, probably about 5000. Moving the stats into the DB is desireable, but an effort. I think we should immediately make the following changes: - Incorporate the "openonly" functionality into the query. - Incorporate the access check into the query - Incorporate the "numwanted" functionality into the query. Because the access check is also incorporated, we won't need to iterate. This should make a significant difference. Then, we can work out what schema we need to move the stats into the DB. Gerv

Bradley Baetz (:bbaetz)

Reporter

Comment 3

•

22 years ago

Ah, I missed the prelimiting, and that does explain why its so fast, relativly speaking. Whilst teh file may be in cache, we copy it,and I'm not sure whether solaris' Copy-On-Write code mixes with the internal refcnting perl must end up doing. In any event, 5000 is still way too many queries to do. Moving the open part into there will help, though. Maybe w eshould split off another bug for that, and keep this one open? I can get pg doing the subselect I mentioned earlier on 100000 duplicates in 0.8 seconds, and the second form (which we'd need for mysql) in 1.1 seconds on my PIII-500. Can you beat that :) Note that gerv's plan won't work for the case where we're sorting by '# changed in last n days'. Thats not the default, though, so it still should be a perf increase in teh general case.

Gervase Markham [:gerv]

Updated

•

22 years ago

Depends on: 176599

Gervase Markham [:gerv]

Updated

•

22 years ago

Priority: -- → P2

Gervase Markham [:gerv]

Comment 4

•

21 years ago

OK, so I'm having a look at this. Given the trouble we had migrating b.m.o., I'd like to eliminate the pesky dbm files if at all possible. However, bbaetz' suggested DB implementation method scares me (as someone not incredibly well versed in SQL-fu.) bbaetz: just to check, is that still your best idea? Gerv

Bradley Baetz (:bbaetz)

Reporter

Comment 5

•

21 years ago

Umm. Its been almost a year; let me think about this a bit more.

Myk Melez [:myk] [@mykmelez]

Comment 6

•

21 years ago

Note that the data/duplicates directory on b.m.o is currently 3.0GB and growing by 4.5MB a day, so this would be useful for space reasons alone (if MySQL is more space efficient than DB, or if it can be made so).

Bradley Baetz (:bbaetz)

Reporter

Comment 7

•

21 years ago

If we add indexes, it would take a fair ammount of space. I also want subselects to try to do this using my schema. If we just import stuff in as-is, I don't know if we get much benefit. Someone should try it and see....

Dave Miller [:justdave]

Comment 8

•

21 years ago

Pushing out bugs that aren't blockers. If someone's working on one of these, we can move it back.

Target Milestone: Bugzilla 2.18 → Bugzilla 2.20

Gervase Markham [:gerv]

Updated

•

20 years ago

Severity: critical → enhancement

Dave Miller [:justdave]

Comment 9

•

20 years ago

Bugzilla 2.20 feature set is now frozen as of 15 Sept 2004. Anything flagged enhancement that hasn't already landed is being pushed out. If this bug is otherwise ready to land, we'll handle it on a case-by-case basis, please set the blocking2.20 flag to '?' if you think it qualifies.

Target Milestone: Bugzilla 2.20 → Bugzilla 2.22

timeless

Updated

•

19 years ago

QA Contact: mattyt-bugzilla → default-qa

Target Milestone: Bugzilla 2.22 → ---

Vlad Dascalu

Updated

•

19 years ago

OS: Linux → All

Hardware: PC → All

Kristis Makris

Comment 10

•

18 years ago

(In reply to comment #6) > Note that the data/duplicates directory on b.m.o is currently 3.0GB and growing > by 4.5MB a day, so this would be useful for space reasons alone (if MySQL is > more space efficient than DB, or if it can be made so). I'm facing the space problem now, too. I'm backing up our Bugzilla instance and I'm not sure how those data/duplicates/dupes* files were created, or if they are even necessary. The manual does not talk about them. In the meantime, is it safe to delete them ? Delete the older ones at least ?

Gervase Markham [:gerv]

Comment 11

•

18 years ago

You can delete them if you don't want historical duplicates statistics in duplicates.cgi. If you delete say half, you don't get comparative stats from those dates. Gerv

Kristis Makris

Comment 12

•

18 years ago

(In reply to comment #11) > You can delete them if you don't want historical duplicates statistics in > duplicates.cgi. If you delete say half, you don't get comparative stats from > those dates. Could all of them be regenerated if needed using "collectstats.pl --regenerate"? Or would I be permanently losing valuable information ?

Gervase Markham [:gerv]

Comment 13

•

18 years ago

No, collectstats.pl --regenerate does not regenerate this information. Gerv

v1 18 years ago Max Kanat-Alexander 11.40 KB, patch		Details \| Diff \| Splinter Review
v2 18 years ago Max Kanat-Alexander 12.47 KB, patch	LpSolit : review-	Details \| Diff \| Splinter Review
v3 15 years ago Max Kanat-Alexander 12.57 KB, patch	LpSolit : review-	Details \| Diff \| Splinter Review
v4 15 years ago Max Kanat-Alexander 14.98 KB, patch	LpSolit : review-	Details \| Diff \| Splinter Review
screenshot, showing wrong reported duplicates 15 years ago Frédéric Buclin 122.63 KB, image/png		Details
v5 15 years ago Max Kanat-Alexander 17.16 KB, patch	LpSolit : review-	Details \| Diff \| Splinter Review
v6 15 years ago Max Kanat-Alexander 17.14 KB, patch	LpSolit : review+	Details \| Diff \| Splinter Review
v7 15 years ago Max Kanat-Alexander 17.20 KB, patch	mkanat : review+	Details \| Diff \| Splinter Review