Open Bug 281354 Opened 20 years ago Updated 11 years ago

Create a bug_update table

Categories

(Bugzilla :: Bugzilla-General, enhancement)

2.19.2
enhancement
Not set
normal

Tracking

()

People

(Reporter: bugzilla-mozilla, Unassigned)

References

(Blocks 2 open bugs)

Details

I think this proposed enhancement is best described by a pseudo-patch to the 
table creation part of checksetup.pl. This would also add a unique / primary 
key to the longdescs and bugs_activity tables (solving Bug 225221), and split 
comments and timetracking info into separate tables (allowing possibility of 
fix for Bug 271913 - currently WONTFIXed merely because db schema does not 
allow). Possibly also useful for bug 11368. It also avoids excessive 
duplication of bug_when fields etc. (and could even be referred to for 
delta_ts, creation_ts etc. on a bug?)

+$table{bug_updates} =
+   'update_id mediumint not null auto_increment primary key,
+    bug_id mediumint not null,
+    who mediumint not null,
+    bug_when datetime not null,  # ... or possibly timestamp?
+    index(bug_id),
+    index(who),
+    index(bug_when)';

 $table{longdescs} = 
-   'bug_id mediumint not null,
-    who mediumint not null,
-    bug_when datetime not null,
+   'update_id mediumint not null,
-    work_time decimal(5,2) not null default 0,
     thetext mediumtext,
     isprivate tinyint not null default 0,
-    index(bug_id),
-    index(who),
-    index(bug_when),
+    unique(update_id),
     fulltext (thetext)';

+$table{work_time} =
+   'update_id mediumint not null,
+    work_time decimal(5,2) not null default 0,
+    unique(update_id)';

 $table{bugs_activity} = 
-   'bug_id mediumint not null,
+   'update_id mediumint not null,
     attach_id mediumint null,  # perhaps this could move too?
-    who mediumint not null,
-    bug_when datetime not null,
     fieldid mediumint not null,
     added tinytext,
     removed tinytext,

-    index (bug_id),
-    index (bug_when),
+    unique(update_id,fieldid),
     index (fieldid)';


# similar change to attachements added as an afterthought...

 $table{attachments} =
    'attach_id mediumint not null auto_increment primary key,
-   'bug_id mediumint not null,
-    creation_ts datetime not null,
+    update_id mediumint not null,
     description mediumtext not null,
     mimetype mediumtext not null,
     ispatch tinyint,
     filename varchar(100) not null,
     thedata longblob not null,
-    submitter_id mediumint not null,
     isobsolete tinyint not null default 0, 
     isprivate tinyint not null default 0,
 
-    index(bug_id),
-    index(creation_ts)';
+    unique(update_id)';


# possible changes in line with this to bugs table?

 $table{bugs} =
...
-    creation_ts datetime not null,
-    delta_ts timestamp not null,
+    first_update_id mediumint not null,
+    last_update_id mediumint not null,
...
-    reporter mediumint not null,  # -> who from first_update_id ???
...
-    lastdiffed datetime not null,  # if this is always equal to a bug_when
+    diff_update_id mediumint not null,
...
-    index (creation_ts),
-    index (delta_ts),
+    unique(first_update_id),
+    unique(last_update_id),


Possibly also stuff with flag request / fulfilment could refer to the relevant 
update too...

Entering as UNCONFIRMED until someone more familiar with database design 
confirms whether:
a) this is a good change from a database POV (e.g. would it have perf 
advantages or perf hits?)
b) the (probably fairly wide-ranging) changes to code to use this layout are 
ever going to be worth it... or even possible!
I'd definitely have to think about this a lot, and look at the code, to
determine whether or not it was a good idea.

We already have the bugs_activity table -- all you've done is move four fields
from that table to the bug_updates table. Instead, we should use the
bugs_activity table as it is, now, because it represents the same thing that
that "bug_update" table would represent.

Perhaps the good part of the idea is to add an "update_id" to the bugs_activity
table, and then reference this from the other tables in JOINs. That certainly
would make certain logic easier for BugMail.
Severity: normal → enhancement
(In reply to comment #1)
> We already have the bugs_activity table -- all you've done is move four fields
> from that table to the bug_updates table.

Under the current schema, if you change 4 fields at once, the bugs_activity 
table has the bug_id, who, bug_when fields all replicated across all 4 rows, 
and these correspond with the bug_id, who, bug_when of a comment, etc... with 
show_activity.cgi then having to examine them to determine which to show as 
part of the same update.

The idea is that there is a single copy of the bug_id, who, bug_when triple 
referred to from all these different places. show_activity.cgi etc. would then 
use update_id to group rows by the same update.

I suggested a unique(update_id,fieldid) in bugs_activity on the assumption that 
a field cannot change twice in the same update, but perhaps this needs another 
field (attach_id ?) to guarantee uniqueness (e.g. more than one attachment can 
be marked obsolete simultaneously), or should be replaced simply with an index
(update_id).


If this is determined to be a Good Thing, then perhaps it might be possible to 
do this in stages, such that stage 1 adds the update_id and bug_updates table 
(with bug_id,who,bug_when continuing to be redundantly stored), stage 2 
migrates existing code to get this data from the bug_update table, and finally 
stage 3 removes the redundant data from the longdescs, bugs_activity tables.
Could this perhaps be broken down into a smaller target of update? Perhaps
targetting solely the separation of time from comments (longdescs). This would
make it easier to have a bug time listing (and modification perhaps).
There's already an open bug to give longdescs a primary key; even if we go 
through with this, that part of it has already been discussed and that bug 
already has an owner. You may want to coordinate efforts with him.
Depends on: 225221
(In reply to comment #4)
> There's already an open bug to give longdescs a primary key;

I did mention bug 225221 in comment 0... (although made no comment on the other 
bug yet. I'll go and do that shortly - had just thought I'd see whether it this 
got immediately shot down in flames first!)

I think it's more of a mutual "see also" rather than a dependency though - this 
enhancement would solve the original problem in bug 225221 as it adds a UNIQUE 
key that can be used as though it was PK. A genuine PK for longdescs could be 
made redundant by this enhancement, as it may lead to each row having two 
unique identifiers.
One thing that could be worth checking is if the following query returns any 
results on b.m.o, and if so whether it is enough to worry about (e.g. beyond 
possibility of manual fix-up):

SELECT bug_id,bug_when,who,count(*)
  FROM longdescs
  GROUP BY bug_id,bug_when,who
  HAVING count(*)>1;

It would be nice to be able to rely on the triple (bug_id,bug_when,who) to act 
as a unique identifier for a comment (and to pull out all bugs_activity 
associated with that comment). In principle this could be violated e.g. by a 
user marking bug B and bug C as duplicates of bug A simultaneously.
As above, a "phase 1" would simply be to add the additional columns in order to 
allow other code to migrate towards using the new table. In the short term this 
would duplicate the (bug_when, bug_id, who) data, but once code has been 
migrated to use the new table, these columns could be dropped from their 
original tables.

It may be that THIS bug just becomes the "phase 1", and that other code changes 
to actually make use of this new table / split work_time from longdescs, etc go 
into separate RFEs.


(Pseudo-code for / description of) changes for suggested first stage:

SELECT DISTINCT bug_id, who, bug_when FROM bugs_activity
 ORDER BY bug_when, bug_id, who;

SELECT bug_id, who, bug_when FROM longdescs ORDER BY bug_when, bug_id, who;

Merge results from both queries (should be no need to examine attachments table 
directly since an attachments creation_ts should always correspond with a 
comment containing the magic "created an attachment" string).

CREATE TABLE bug_updates ( [ as described in comment 0 ] );

insert the (bug_id, who, bug_when) triples into this new table, in the correct 
order (so that update_id order matches bug_when order). 

add update_id column to longdescs, attachments and bugs_activity, and populate 
based on matching the appropriate 3 fields.


Other changes necessary for phase 1:

[ in any place that new rows are inserted to these tables, insert a 
corresponding row into bug_updates, or add a reference to one created earlier 
during the same (bug_id, form post) ]

e.g. during process_bug.cgi, a row could be inserted into bug_updates just 
after these two lines:
1352     SendSQL("select now()");
1353     $timestamp = FetchOneColumn();
... and then the newly created update_id should be used everywhere that 
$timestamp is used in the following code for that bug.

Similarly in LogDependencyActivity a new row must be inserted, differing in 
bug_id from the row originally added.

Doubtless other locations exist that insert data into the longdescs, 
bugs_activity and/or attachments tables. All would need to be adjusted, but in 
a way that avoids adding two bug_updates rows for the same bug from the same 
form post.


[ probably add sanity checks to make sure that where there is a reference to an 
update_id in a table that the columns corresponding to bug_id, who, bug_when 
also have the correct values ]
I think something like this could be useful for custom fields, also, now that I
think of it. Because of database normalization, we want to have bug data in many
different tables, but we want to keep track of a bug update as an atomic unit.

Here's the way we'd want to go with it:

1) Resolve bug 225221.
2) Add an update_id field to bugs_activity.
3) Figure out what useful functionality-enhancing improvements we want, and how
a bug_update table could help us get there. Then we have a solid example in mind
as to where we want to go, and can plan from there. It may turn out that the
update_id field in bugs_activity could be enough...
Blocks: bz-majorarch
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Changes to database schema for longdescs and bugs_activity - primary key for a bug change/update → Create a bug_update table
I had been just about to remove the dep on 225221, because as I had originally 
intended, this would have been orthogonal to it.

However, I just thought of a slight change to the proposed schema that would 
have big advantages for bulk changes, and would make use of a comment_id added 
in bug 225221...


Rather than adding 'update_id mediumint not null' with a UNIQUE key to 
longdescs (making comment_id redundant), instead add 'comment_id medium null' 
to the bug_update table.

In cases where there is NO comment for a bug_update, then this would be null... 
but the big plus is that in a bulk change, the same row in longdescs can (in 
principle) be referenced by multiple bugs.

In this case, after all of the changes proposed here, all that would remain in 
the longdescs table is comment_id, is_private and thetext. This would imply 
that if a comment is private on one bug, it must be private on all of them...

This would seem reasonable for a bulk update, but it may be prudent to remove 
is_private, leaving the longdescs as simply comment_id and thetext, which would 
allow a new comment to merely add a reference to an existing comment_id. (We 
don't want someone maliciously copy-pasting a comment somewhere else and then 
marking it private after all!)


This should significantly reduce the number of rows in longdescs. Any thoughts?
(In reply to comment #9)
> but the big plus is that in a bulk change, the same row in longdescs can (in 
> principle) be referenced by multiple bugs.

If you do some searching, I recall seeing a bug already open with this as its 
initial description; the author bemoaned the fact that mass changes stored the 
same comment many times.

For DB normalization, though, you don't want this as another column on the 
bug_update table; you want a third table joining the two, with exactly two 
columns; bug_update_id (unique) and comment_id. This allows a one-to-n 
relationship between the update_id and the comment_id, without ever storing 
either duplicate information (i.e. the same update_id many times in the 
longdescs table, with the comment repeated many times) or nulls (i.e. rows in 
the bug_update table with no corresponding comment).

> it may be prudent to remove 
> is_private, leaving the longdescs as simply comment_id and thetext, which 
> would allow a new comment to merely add a reference to an existing
> comment_id. 

Not sure that you'll ever be able to create a UI to allow a user to simply copy 
an existing comment by pasting in a reference without it being rather arcane; I 
see this functionality as being useful almost exclusively for mass updates.

I can see arguments on both sides of the coin as far as keeping the isprivate 
boolean attached to the comment itself. One side is that comments should either 
be private or not; the privacy of a comment is not bug-dependent. The other 
side is that maybe a user WANTS it to be bug-dependent, especially if they have 
groups set up so that only certain people can see certain bugs... but one could 
then say that they shouldn't have done a mass-update. (Also, allowing privacy 
on a per-comment-per-bug basis would mean another table.) 

Either way, if you DO take isprivate out of longdescs, you'd definitely need to 
implement a check to see if the comment applies to more than one bug; whether 
you warn the user or prevent it from happening at all is a decision you can 
make later. At the beginning, I'd leave it in and simply prevent them from 
changing it; add the later functionality in an enhancement.

> This should significantly reduce the number of rows in longdescs.
I dunno if it would be *significant* -- and unless you're somehow going to do 
it retroactively, it would only affect future changes -- but I still see it 
as 'worth doing'.

The more I think about it, the more I like the idea of one atomic bug_update_id 
that stores all the unchanging information (who did it, when it was done, and 
what bug it was done to) in a single place. It's a *big* change, though, and 
touches a lot of things; you'd really have to have some drive to push through a 
change of this magnitude, even broken up into smaller pieces.

(In reply to comment #10)
> For DB normalization, though, you don't want this as another column on the 
> bug_update table; you want a third table joining the two, with exactly two 
> columns; bug_update_id (unique) and comment_id.

I remember talk of doing the opposite at one point - with a suggestion to move 
the duplicates table into additional columns in the bugs table...

Presumably there's a reason why an additional table is preferable to an 
additional column for which some values are NULL (corresponding with the 
minority (?) of occasions where a bug_update is done without a comment). Would 
the same apply to all other columns on other tables that are permitted to be 
NULL?


> Not sure that you'll ever be able to create a UI to allow a user to simply
> copy an existing comment by pasting in a reference without it being rather
> arcane;

That would be where the system pro-actively searches for an identical comment.

e.g. the automatically generated comment for a DUPE (where no user comment was 
made), a user that uses a stock comment frequently, or simply a "likely to be 
repeated" comment such as "OK" or "Yes".

Similarly for a mass update that is in fact split across several queries (with 
the same comment), or done a batch at a time.


> One side is that comments should either 
> be private or not; the privacy of a comment is not bug-dependent.

Almost certainly true if it came from the same mass-update. Perhaps not if 
different users made the same comment to different bugs.

> (Also, allowing privacy 
> on a per-comment-per-bug basis would mean another table.) 

In principle, couldn't it slip in to the table you suggested adding above, or 
does that literally need to be 2 columns for optimisation reasons?

> > This should significantly reduce the number of rows in longdescs.
> I dunno if it would be *significant*

Depends on your "significance" threshold! Especially when looking back at 
historic bugs, there seem to me to be a "signficant" number of mass changes.

> and unless you're somehow going to do 
> it retroactively, it would only affect future changes

Certainly this would take quite a long time on a huge DB. 


> you'd really have to have some drive to push through a 
> change of this magnitude, even broken up into smaller pieces.

Probably worth hunting around for bugs that this would enable / make easier... 
I mentioned a couple in comment 0.
I'd like to do this as the backend of Bugzilla::ChangeSet. At the least, the bugs_activity table can become enhanced.
Assignee: general → mkanat
Blocks: 301447
Target Milestone: --- → Bugzilla 2.24
Target Milestone: Bugzilla 3.0 → ---
Assignee: mkanat → general
You need to log in before you can comment on or make changes to this bug.