Closed Bug 806121 Opened 9 years ago Closed 9 years ago

Update TBPL schema

Categories

(Data & BI Services Team :: DB: MySQL, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(6 files, 1 obsolete file)

In bug 799578 it was discovered that TBPL had a different schema for each of {checked-into-tree,production,dev} (!!). Bug 799578 made the checked-in schema match production so as to make it clearer where we were at - and now in this bug I'd like to clean up a number of issues that were found, as well as get dev matching production. (patches will be attached shortly)

Also, is there a specific place that I should document the process for TBPL schema changes (ie: please don't change production schema without CCing myself and/or ensuring bugs are filed to resolved) to ensure that we don't end up having to do bug 799578 again at a later date?
* Change needs applying to production only (dev doesn't have this additional index)
* Once the change is made to production, I will check this into the TBPL tree.
Attachment #675875 - Flags: review?(scabral)
Attachment #675875 - Attachment description: Remove duplicate index on builders.buildername → Part 1: Remove duplicate index on builders.buildername
* Change needs applying to production only (dev doesn't have this additional index)
* Once the change is made to production, I will check this into the TBPL tree.
Attachment #675880 - Flags: review?(scabral)
* Will need applying to dev & prod, once Swatinem confirms this change is ok (see bug 799578's production tbpl schema attachment to see current AUTO_INCREMENT row counts for each table; numbers are well under).
* Once the change is made to production, I will check this into the TBPL tree.
Attachment #675881 - Flags: review?(scabral)
Attachment #675881 - Flags: feedback?(arpad.borsos)
* Will need applying to dev & prod, once Swatinem confirms this change is ok.
* Once the change is made to production, I will check this into the TBPL tree.
Attachment #675885 - Flags: review?(scabral)
Attachment #675885 - Flags: feedback?(arpad.borsos)
* This will be used by bug 803009 (and once we've switched over to using it, we can then remove the index on content, but I'll leave that for another bug).
* Will need applying to dev & prod.
* Once the change is made to production, I will check this into the TBPL tree.
Attachment #675886 - Flags: review?(scabral)
Attachment #675886 - Flags: feedback?(arpad.borsos)
Blocks: 803009
Depends on: 799578
* Needs applying to dev & prod
Attachment #675887 - Flags: review?(scabral)
Attachment #675881 - Flags: feedback?(arpad.borsos) → feedback+
Comment on attachment 675885 [details] [diff] [review]
Part 4: Use more appropriate default values

Since when did we have DEFAULT NULLs in there?
Attachment #675885 - Flags: feedback?(arpad.borsos) → feedback+
(In reply to Arpad Borsos (Swatinem) from comment #7)
> Comment on attachment 675885 [details] [diff] [review]
> Part 4: Use more appropriate default values
> 
> Since when did we have DEFAULT NULLs in there?

Since bug 799578, which made the checked-in schema reflect production (which had implicit DEFAULT NULLs, but I changed the in-tree schema to list them, so as to remind me to follow up).
Attachment #675886 - Flags: feedback?(arpad.borsos) → feedback+
Comment on attachment 675887 [details] [diff] [review]
Part 6: Correct builders.builder_id index name

Review of attachment 675887 [details] [diff] [review]:
-----------------------------------------------------------------

This is merely a change in the name of an index. In MySQL, you can't rename an index, so it has to be dropped and re-created, and we should test whether this is an online operation or not.
Attachment #675887 - Flags: review?(scabral) → review+
Comment on attachment 675886 [details] [diff] [review]
Part 5: Add an 'in_progress' boolean to runs_logs

Review of attachment 675886 [details] [diff] [review]:
-----------------------------------------------------------------

This is OK; My only question is, are you sure you want to default to "in_progress" as "false"?

Also, note that MySQL will do REALLY well with searches that look for the 'needle in a haystack' - so if most of the values are "false", searches on "true" will be very fast. But searches on "false" won't even bother to use the index, it's faster to do a full table scan than to look at more than about 20% of the data.....So if the data will be 50% true, 50% false, an index is useless.
Comment on attachment 675875 [details] [diff] [review]
Part 1: Remove duplicate index on builders.buildername

Review of attachment 675875 [details] [diff] [review]:
-----------------------------------------------------------------

yay for removing duplicate indexes, especially ones that are on a UTF-8 varchar(255)!
Attachment #675875 - Flags: review?(scabral) → review+
Comment on attachment 675880 [details] [diff] [review]
Part 2: Only index runs_notes.run_id not (runs_notes.run_id, runs_notes.timestamp)

Review of attachment 675880 [details] [diff] [review]:
-----------------------------------------------------------------

Hrm, run_id needs to be leftmost in the index for the foreign key constraint. I can't see why timestamp is particularly needed, but I don't know the query profile of tbpl_mozilla_org. If there are queries that involve the run_id and the timestamp, keeping this index would be useful - for example, if there's a WHERE filter on run_id and then a sort by timestamp. I feel like we shouldn't remove the index unless we know for a fact that the query(ies) for which it was created are gone. I'd rather change dev to match prod in this case.
Comment on attachment 675881 [details] [diff] [review]
Part 3: Use more appropriate int types

Review of attachment 675881 [details] [diff] [review]:
-----------------------------------------------------------------

I approve wholeheartedly of the int -> int unsigned, and I assume you're aware of how much data you generate so that MEDIUMINT is appropriate. For reference, MEDIUMINT UNSIGNED has a range from 0-16,777,215.
Attachment #675881 - Flags: review?(scabral) → review+
Comment on attachment 675885 [details] [diff] [review]
Part 4: Use more appropriate default values

Review of attachment 675885 [details] [diff] [review]:
-----------------------------------------------------------------

buildername is changing from DEFAULT NULL to NOT NULL with no default. Just want to make sure that's what you want to do (as opposed to something like NOT NULL DEFAULT '' - though my feeling is that NOT NULL with no default is appropriate here).
Attachment #675885 - Flags: review?(scabral) → review+
(er, for comment 14, the same goes for all the other changes....)

Also, why change this:

`starttime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP

to the default being "zero time"? Is it so you can easily see what hasn't started yet?
Comment on attachment 675880 [details] [diff] [review]
Part 2: Only index runs_notes.run_id not (runs_notes.run_id, runs_notes.timestamp)

Oops, typo in attachment name.
Attachment #675880 - Attachment description: Part 2: Only index runs.run_id not (runs.run_id, runs.timestamp) → Part 2: Only index runs_notes.run_id not (runs_notes.run_id, runs_notes.timestamp)
Comment on attachment 675880 [details] [diff] [review]
Part 2: Only index runs_notes.run_id not (runs_notes.run_id, runs_notes.timestamp)

Review of attachment 675880 [details] [diff] [review]:
-----------------------------------------------------------------

Hrm, run_id needs to be leftmost in the index for the foreign key constraint. I can't see why timestamp is particularly needed, but I don't know the query profile of tbpl_mozilla_org. If there are queries that involve the run_id and the timestamp, keeping this index would be useful - for example, if there's a WHERE filter on run_id and then a sort by timestamp. I feel like we shouldn't remove the index unless we know for a fact that the query(ies) for which it was created are gone. I'd rather change dev to match prod in this case.
(In reply to Sheeri Cabral [:sheeri] from comment #9)
> This is merely a change in the name of an index. In MySQL, you can't rename
> an index, so it has to be dropped and re-created, and we should test whether
> this is an online operation or not.

Ah, I didn't know it had to be dropped. This table is one of the smaller ones, so hopefully wouldn't be too bad (AUTO_INCREMENT=15209, and there has been some purging, so guessing number of rows fair bit lower than this).


(In reply to Sheeri Cabral [:sheeri] from comment #10)
> Comment on attachment 675886 [details] [diff] [review]
> Part 5: Add an 'in_progress' boolean to runs_logs
>
> This is OK; My only question is, are you sure you want to default to
> "in_progress" as "false"?
> 
> Also, note that MySQL will do REALLY well with searches that look for the
> 'needle in a haystack' - so if most of the values are "false", searches on
> "true" will be very fast. But searches on "false" won't even bother to use
> the index, it's faster to do a full table scan than to look at more than
> about 20% of the data.....So if the data will be 50% true, 50% false, an
> index is useless.

Yes sorry 'true' would be more appropriate, since the logs will always be in_progress when first added, and then marked false shortly after.

Unless something goes wrong, we should only have double digits as in_progress=true, so the index should be pretty effective (AUTO_INCREMENT=2733054 for this table, though again, lots of purging).


(In reply to Sheeri Cabral [:sheeri] from comment #11)
> Comment on attachment 675875 [details] [diff] [review]
> Part 1: Remove duplicate index on builders.buildername
> 
> Review of attachment 675875 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> yay for removing duplicate indexes, especially ones that are on a UTF-8
> varchar(255)!

I'm pretty sure we don't need UTF-8 for anything other than the logs (runs_logs.content) and maybe the name of people starring (runs_notes.who), although the latter isn't exactly necessary anyway. Would switching benefit us much? I also suspect that many of the varchars we use are much larger than they need be - is there an easy way to find out the current max lengths used for each field?


(In reply to Sheeri Cabral [:sheeri] from comment #12)
> Comment on attachment 675880 [details] [diff] [review]
> Part 2: Only index runs.run_id not (runs.run_id, runs.timestamp)
> 
> Review of attachment 675880 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Hrm, run_id needs to be leftmost in the index for the foreign key
> constraint. I can't see why timestamp is particularly needed, but I don't
> know the query profile of tbpl_mozilla_org. If there are queries that
> involve the run_id and the timestamp, keeping this index would be useful -
> for example, if there's a WHERE filter on run_id and then a sort by
> timestamp. I feel like we shouldn't remove the index unless we know for a
> fact that the query(ies) for which it was created are gone. I'd rather
> change dev to match prod in this case.

Sorry I meant to add a comment explaining this one.

The only query we run on this is:
{
  $notes = $db->prepare("
    SELECT who, note,
      unix_timestamp(timestamp) AS timestamp
    FROM runs_notes
    WHERE run_id = :runid
    ORDER BY timestamp ASC;");
  $notes->execute(array(":runid" => $run['id']));
  $run["notes"] = $notes->fetchAll(PDO::FETCH_ASSOC);
  unset($run["id"]); // don’t need the sql internal id
  $result[] = $run;
}

Most of the time there will either be 0 or 1 results returned (95% the former). Perhaps 1% of the time there will be 2 (or rarely 3) results (the notes correspond to human added text to explain a failure; the only times we have duplicates is where multiple people have simultaneously added a note before their client updated). I'm presuming the index on timestamp wouldn't save much given how few results are returned?


(In reply to Sheeri Cabral [:sheeri] from comment #13)
> Comment on attachment 675881 [details] [diff] [review]
> Part 3: Use more appropriate int types
> 
> Review of attachment 675881 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I approve wholeheartedly of the int -> int unsigned, and I assume you're
> aware of how much data you generate so that MEDIUMINT is appropriate. For
> reference, MEDIUMINT UNSIGNED has a range from 0-16,777,215.

Yeah, builders and builders_history have AUTO_INCREMENT=15209 and AUTO_INCREMENT=21129 respectively (for 2 years worth of use).


(In reply to Sheeri Cabral [:sheeri] from comment #14)
> Comment on attachment 675885 [details] [diff] [review]
> Part 4: Use more appropriate default values
> 
> Review of attachment 675885 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> buildername is changing from DEFAULT NULL to NOT NULL with no default. Just
> want to make sure that's what you want to do (as opposed to something like
> NOT NULL DEFAULT '' - though my feeling is that NOT NULL with no default is
> appropriate here).

They are all mandatory values, so should all be specified. Is that an acceptable use-case of "NOT NULL with no default"? (excuse my mysql ignorance)


(In reply to Sheeri Cabral [:sheeri] from comment #15)
> (er, for comment 14, the same goes for all the other changes....)
> 
> Also, why change this:
> 
> `starttime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
> CURRENT_TIMESTAMP
> 
> to the default being "zero time"? Is it so you can easily see what hasn't
> started yet?

The current time will always be incorrect for start time & the value should always be specified anyway. In the off-chance it is not, I'd rather an obvious failure rather than psuedo sorted correctly values. Is that ok?

Thank you for the feedback! :-)
(In reply to Ed Morley [:edmorley UTC+0] from comment #18)
> (In reply to Sheeri Cabral [:sheeri] from comment #9)
> > This is merely a change in the name of an index. In MySQL, you can't rename
> > an index, so it has to be dropped and re-created, and we should test whether
> > this is an online operation or not.
> 
> Ah, I didn't know it had to be dropped. This table is one of the smaller
> ones, so hopefully wouldn't be too bad (AUTO_INCREMENT=15209, and there has
> been some purging, so guessing number of rows fair bit lower than this).
> 

*nod* that should be small enough to not have a problem, yes.

> (In reply to Sheeri Cabral [:sheeri] from comment #10)
> > Comment on attachment 675886 [details] [diff] [review]
> > Part 5: Add an 'in_progress' boolean to runs_logs
> >
[snip]
> 
> Yes sorry 'true' would be more appropriate, since the logs will always be
> in_progress when first added, and then marked false shortly after.
> 
> Unless something goes wrong, we should only have double digits as
> in_progress=true, so the index should be pretty effective
> (AUTO_INCREMENT=2733054 for this table, though again, lots of purging).

OK, great. I'll r+ an attachment with a default of true.


> (In reply to Sheeri Cabral [:sheeri] from comment #11)
> > Comment on attachment 675875 [details] [diff] [review]
> > Part 1: Remove duplicate index on builders.buildername
> > 
> > Review of attachment 675875 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > yay for removing duplicate indexes, especially ones that are on a UTF-8
> > varchar(255)!
> 
> I'm pretty sure we don't need UTF-8 for anything other than the logs
> (runs_logs.content) and maybe the name of people starring (runs_notes.who),
> although the latter isn't exactly necessary anyway. Would switching benefit
> us much? I also suspect that many of the varchars we use are much larger
> than they need be - is there an easy way to find out the current max lengths
> used for each field?

Sure, you can do SELECT (MAX(LENGTH(field1)) as size1, (MAX(LENGTH(field2)) as size2 FROM..... 

.....or you can do SELECT field1,field2 FROM.... PROCEDURE ANALYSE() and get min/max/avg lengths. (or you can have me do them if you'd like)


> (In reply to Sheeri Cabral [:sheeri] from comment #12)
> > Comment on attachment 675880 [details] [diff] [review]
> > Part 2: Only index runs.run_id not (runs.run_id, runs.timestamp)


> Most of the time there will either be 0 or 1 results returned (95% the
> former). Perhaps 1% of the time there will be 2 (or rarely 3) results (the
> notes correspond to human added text to explain a failure; the only times we
> have duplicates is where multiple people have simultaneously added a note
> before their client updated). I'm presuming the index on timestamp wouldn't
> save much given how few results are returned?

Indeed, the index won't save much, and it will cause more overhead on the writes. I agree with nixing it.


> (In reply to Sheeri Cabral [:sheeri] from comment #13)
> > Comment on attachment 675881 [details] [diff] [review]

> Yeah, builders and builders_history have AUTO_INCREMENT=15209 and
> AUTO_INCREMENT=21129 respectively (for 2 years worth of use).

OK.



> (In reply to Sheeri Cabral [:sheeri] from comment #14)
> > Comment on attachment 675885 [details] [diff] [review]
> > Part 4: Use more appropriate default values
> > 
> > Review of attachment 675885 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > buildername is changing from DEFAULT NULL to NOT NULL with no default. Just
> > want to make sure that's what you want to do (as opposed to something like
> > NOT NULL DEFAULT '' - though my feeling is that NOT NULL with no default is
> > appropriate here).
> 
> They are all mandatory values, so should all be specified. Is that an
> acceptable use-case of "NOT NULL with no default"? (excuse my mysql
> ignorance)
> 

Indeed, this is exactly the right use-case of "NOT NULL with no default". (as for ignorance, you should see my code! we all have our strengths and weakneses)

> (In reply to Sheeri Cabral [:sheeri] from comment #15)
> > (er, for comment 14, the same goes for all the other changes....)
> > 
> > Also, why change this:
> > 
> > `starttime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
> > CURRENT_TIMESTAMP
> > 
> > to the default being "zero time"? Is it so you can easily see what hasn't
> > started yet?
> 
> The current time will always be incorrect for start time & the value should
> always be specified anyway. In the off-chance it is not, I'd rather an
> obvious failure rather than psuedo sorted correctly values. Is that ok?

*nod* that's what I thought, and that's acceptable. r+
With default = true

(After the schema change, we'll need to mass change the existing rows to false as part of a staggered landing, but I'll save that to bug 803009)
Attachment #675886 - Attachment is obsolete: true
Attachment #675886 - Flags: review?(scabral)
Attachment #677155 - Flags: review?(scabral)
(In reply to Sheeri Cabral [:sheeri] from comment #19)
> > I'm pretty sure we don't need UTF-8 for anything other than the logs
> > (runs_logs.content) and maybe the name of people starring (runs_notes.who),
> > although the latter isn't exactly necessary anyway. Would switching benefit
> > us much? I also suspect that many of the varchars we use are much larger
> > than they need be - is there an easy way to find out the current max lengths
> > used for each field?
> 
> Sure, you can do SELECT (MAX(LENGTH(field1)) as size1, (MAX(LENGTH(field2))
> as size2 FROM..... 
> 
> .....or you can do SELECT field1,field2 FROM.... PROCEDURE ANALYSE() and get
> min/max/avg lengths. (or you can have me do them if you'd like)

I don't suppose you'd mind doing this?

I don't have access to the DB & when I test locally it's only feasible to import the last few days worth of data, so it doesn't really reflect production.
Comment on attachment 677155 [details] [diff] [review]
Part 5: Add an 'in_progress' boolean to runs_logs v2

Review of attachment 677155 [details] [diff] [review]:
-----------------------------------------------------------------

tested the syntax too, just in case, but it's working fine!
Attachment #677155 - Flags: review?(scabral) → review+
(In reply to Ed Morley [:edmorley UTC+0] from comment #21)
> (In reply to Sheeri Cabral [:sheeri] from comment #19)
> > > I'm pretty sure we don't need UTF-8 for anything other than the logs
> > > (runs_logs.content) and maybe the name of people starring (runs_notes.who),
> > > although the latter isn't exactly necessary anyway. Would switching benefit
> > > us much? I also suspect that many of the varchars we use are much larger
> > > than they need be - is there an easy way to find out the current max lengths
> > > used for each field?
> > 
> > Sure, you can do SELECT (MAX(LENGTH(field1)) as size1, (MAX(LENGTH(field2))
> > as size2 FROM..... 
> > 
> > .....or you can do SELECT field1,field2 FROM.... PROCEDURE ANALYSE() and get
> > min/max/avg lengths. (or you can have me do them if you'd like)
> 
> I don't suppose you'd mind doing this?
> 
> I don't have access to the DB & when I test locally it's only feasible to
> import the last few days worth of data, so it doesn't really reflect
> production.

I can do this, just not today as I'm in Eastern time and leaving shortly.
First thing's first - a bit of information about varchar... from http://dev.mysql.com/doc/refman/5.1/en/char.html:

"In contrast to CHAR, VARCHAR values are stored as a 1-byte or 2-byte length prefix plus data. The length prefix indicates the number of bytes in the value. A column uses one length byte if values require no more than 255 bytes, two length bytes if values may require more than 255 bytes. "

This means that if you have a 20-character value in a varchar(200) field, it only stores 21 bytes - 20 bytes plus one byte for the size.

There *is* something to worry about though, whenever the field is used in a temporary intermediate table (like for a complex query with sorting/group by or other processing) then MySQL allocates the full amount in memory for each field (see http://dba.stackexchange.com/questions/424/performance-implications-of-mysql-varchar-sizes/2029#2029).

So basically, depending on usage it might not be necessary to reduce the size, but it's not a bad idea either....

0) So, I can't speak to needing utf8, but I'm pretty sure you don't need it, since all this stuff is internally generated and I can't see how it might need anything outside latin-1 (there's no user data like "name" that might need it). But you know the data better....

From here, every table is going to be an item in this list:

####################################################################################
1) bugscache.filename is currently a VARCHAR(255), and json is a TEXT field (utf8). Summary - keep json as TEXT, change filename to TINYTEXT:


mysql> select filename,json from bugscache procedure analyse(0,0)\G
*************************** 1. row ***************************
             Field_name: tbpl_mozilla_org.bugscache.filename
              Min_value: (runtestlist.py)
              Max_value: TestRunner.js)
             Min_length: 7
             Max_length: 58
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 23.1702
                    Std: NULL
      Optimal_fieldtype: TINYTEXT NOT NULL
*************************** 2. row ***************************
             Field_name: tbpl_mozilla_org.bugscache.json
              Min_value: []
              Max_value: [{"status":"RESOLVED","summary":"TEST-UNEXPECTED-FAIL | test-install-xpi.js | timed out waiting for modal dialog","resolution":"FIXED","id":718812}]
             Min_length: 2
             Max_length: 7216
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 459.7128
                    Std: NULL
      Optimal_fieldtype: TEXT NOT NULL
2 rows in set (0.00 sec)


####################################################################################
2) in the "builders" table you have:
name varchar(128)
branch varchar(256)
buildername varchar(256)

Given the max lengths, I'd say name could be varchar(80), branch could be varchar(45), and buildername could be varchar(100):

mysql> select name,branch,buildername from builders procedure analyse (0,0)\G
*************************** 1. row ***************************
             Field_name: tbpl_mozilla_org.builders.name
              Min_value: 3
              Max_value: ux-win64-nightly
             Min_length: 1
             Max_length: 66
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 37.6592
                    Std: NULL
      Optimal_fieldtype: VARCHAR(66) NOT NULL
*************************** 2. row ***************************
             Field_name: tbpl_mozilla_org.builders.branch
              Min_value: addon-sdk
              Max_value: ux
             Min_length: 2
             Max_length: 23
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 11.2852
                    Std: NULL
      Optimal_fieldtype: VARCHAR(23) NOT NULL
*************************** 3. row ***************************
             Field_name: tbpl_mozilla_org.builders.buildername
              Min_value: Android alder build
              Max_value: WINNT 6.1 x86-64 ux nightly
             Min_length: 12
             Max_length: 88
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 50.7686
                    Std: NULL
      Optimal_fieldtype: VARCHAR(88) NOT NULL
3 rows in set (0.01 sec)

####################################################################################
3)  in the builders_history table:
action varchar(20)
who varchar(256)
reason varchar(256)
ip varchar(20)

At first glance, I'd take the IP field and convert it to an INT UNSIGNED, and use the INET_NTOA() and INET_ATON() functions to convert back and forth. If you're worried about IPv6 addresses or something and don't want to convert it, then certainly IP doesn't need to be utf8, it can be latin1.

you might change action to varchar(10), since the max length is 6, but I see no harm in leaving it as varchar(20). I'd recommend changing "who" to varchar(45), it certainly doesn't need to be varchar(256). reason can stay varchar(256), and again, probably no need for utf8 on these.

mysql> select action, who, reason, ip from builders_history procedure analyse(0,0)\G
*************************** 1. row ***************************
             Field_name: tbpl_mozilla_org.builders_history.action
              Min_value: hide
              Max_value: unhide
             Min_length: 4
             Max_length: 6
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 5.7174
                    Std: NULL
      Optimal_fieldtype: CHAR(6) NOT NULL
*************************** 2. row ***************************
             Field_name: tbpl_mozilla_org.builders_history.who
              Min_value: aki
              Max_value: Standard8
             Min_length: 3
             Max_length: 24
       Empties_or_zeros: 0
                  Nulls: 8180
Avg_value_or_avg_length: 7.0214
                    Std: NULL
      Optimal_fieldtype: VARCHAR(24)
*************************** 3. row ***************************
             Field_name: tbpl_mozilla_org.builders_history.reason
              Min_value: .
              Max_value: Your green day was actually quite a while ago
             Min_length: 1
             Max_length: 199
       Empties_or_zeros: 0
                  Nulls: 8180
Avg_value_or_avg_length: 40.8070
                    Std: NULL
      Optimal_fieldtype: TINYTEXT
*************************** 4. row ***************************
             Field_name: tbpl_mozilla_org.builders_history.ip
              Min_value: 10.8.33.240
              Max_value: 10.8.81.217
             Min_length: 11
             Max_length: 11
       Empties_or_zeros: 0
                  Nulls: 8180
Avg_value_or_avg_length: 11.0000
                    Std: NULL
      Optimal_fieldtype: CHAR(11)
4 rows in set (0.10 sec)

####################################################################################
4) In the runs table:

buildername varchar(256)
slave varchar(256)
revision varchar(40)
result varchar(20)
branch varchar(256)
log varchar(256)

buildername: change to varchar(100)
slave: change to varchar(45)
revision: change to varchar(20)

result: change to ENUM('busted','exception','retry','success','testfailed') NOT NULL -  adding a choice to an ENUM is an online operation in MySQL 5.1 and up, so there's no worry if you need more than these 5 choices.

branch: change to varchar(45)

log: keep at varchar(256) (just in case)

mysql> select buildername, slave, revision, result, branch, log from runs procedure analyse (0,0) \G
*************************** 1. row ***************************
             Field_name: tbpl_mozilla_org.runs.buildername
              Min_value: Android alder build
              Max_value: WINNT 6.1 x86-64 ux nightly
             Min_length: 14
             Max_length: 88
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 51.0229
                    Std: NULL
      Optimal_fieldtype: VARCHAR(88) NOT NULL
*************************** 2. row ***************************
             Field_name: tbpl_mozilla_org.runs.slave
              Min_value: bld-centos5-32-vmw-001
              Max_value: w64-ix-slave99
             Min_length: 9
             Max_length: 24
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 15.5687
                    Std: NULL
      Optimal_fieldtype: VARCHAR(24) NOT NULL
*************************** 3. row ***************************
             Field_name: tbpl_mozilla_org.runs.revision
              Min_value: 0008531f1429
              Max_value: THUNDERBIRD_
             Min_length: 7
             Max_length: 12
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 12.0000
                    Std: NULL
      Optimal_fieldtype: CHAR(12) NOT NULL
*************************** 4. row ***************************
             Field_name: tbpl_mozilla_org.runs.result
              Min_value: busted
              Max_value: testfailed
             Min_length: 5
             Max_length: 10
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 7.2239
                    Std: NULL
      Optimal_fieldtype: VARCHAR(10) NOT NULL
*************************** 5. row ***************************
             Field_name: tbpl_mozilla_org.runs.branch
              Min_value: addon-sdk
              Max_value: ux
             Min_length: 2
             Max_length: 23
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 10.8465
                    Std: NULL
      Optimal_fieldtype: VARCHAR(23) NOT NULL
*************************** 6. row ***************************
             Field_name: tbpl_mozilla_org.runs.log
              Min_value: http://ftp.mozilla.org/pub/mozilla.org/b2g/nightly/2012/09/2012-09-11-14-03-51-mozilla-central/mozilla-central-ics_armv7a_gecko-nightly-bm12-build1-build15.txt.gz
              Max_value: http://stage.mozilla.org/pub/mozilla.org/xulrunner/nightly/17.0b4-candidates/build1/logs/release-mozilla-beta-xulrunner_win32_build-bm12-build1-build2.txt.gz
             Min_length: 118
             Max_length: 218
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 172.9198
                    Std: NULL
      Optimal_fieldtype: VARCHAR(218) NOT NULL
6 rows in set (7.65 sec)

####################################################################################
5) runs_logs has type varchar(64) and content longblob, but blobs don't have charsets:

Looks like you can change type to a varchar(20)...trying to add "content" here made the query too long and timeout, so I think it's OK to leave it as a longblob.

*************************** 1. row ***************************
             Field_name: tbpl_mozilla_org.runs_logs.type
              Min_value: annotatedsummary
              Max_value: tinderbox_print
             Min_length: 3
             Max_length: 16
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 10.8334
                    Std: NULL
      Optimal_fieldtype: VARCHAR(16) NOT NULL
1 row in set (3.65 sec)

####################################################################################
6) runs_notes has who varchar(256), note text and ip varchar(20).

I have the same advice about IP addresses as I did before - although it looks like there are only 3 IPs that things are coming from, so it could be an ENUM, too. Looks like "runs_notes" should stay varchar(256) or change to TINYTEXT, and note is fine remaining as TEXT.

mysql> select who,note,ip from runs_notes procedure analyse(100,100)\G
*************************** 1. row ***************************
             Field_name: tbpl_mozilla_org.runs_notes.who
              Min_value: :emk
              Max_value: zackw@panix.com
             Min_length: 2
             Max_length: 37
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 7.7481
                    Std: NULL
      Optimal_fieldtype: TINYTEXT NOT NULL
*************************** 2. row ***************************
             Field_name: tbpl_mozilla_org.runs_notes.note
              Min_value: 
Creating Python environment
Traceback (most recent call last):
  File "/builds/slave/m-cen-lnx64/build/python/virtualenv/virtualenv.py", line 2471, in <module>
[...]
OSError: [Errno 17] File exists
[...]

Exception: Error creating virtualenv.
*** Fix above errors and then restart with               "make -f client.mk build"
make[2]: Leaving directory `/builds/slave/m-cen-lnx64/build'
make[2]: *** [configure] Error 1
              Max_value: you need a newer parent
             Min_length: 1
             Max_length: 420
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 11.0405
                    Std: NULL
      Optimal_fieldtype: TEXT NOT NULL
*************************** 3. row ***************************
             Field_name: tbpl_mozilla_org.runs_notes.ip
              Min_value: 10.8.81.215
              Max_value: 10.8.81.217
             Min_length: 11
             Max_length: 11
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 11.0000
                    Std: NULL
      Optimal_fieldtype: ENUM('10.8.81.215','10.8.81.216','10.8.81.217') NOT NULL
3 rows in set (0.13 sec)
####################################################################################
####################################################################################
Assignee: server-ops-database → scabral
Comment on attachment 675880 [details] [diff] [review]
Part 2: Only index runs_notes.run_id not (runs_notes.run_id, runs_notes.timestamp)

Review of attachment 675880 [details] [diff] [review]:
-----------------------------------------------------------------

Now that it's explained, I'm OK with this.
Attachment #675880 - Flags: review?(scabral) → review+
What more is there for this bug?
Assigning to me for now so I remember to come back to this :-)
Assignee: scabral → emorley
At this point in time I don't think it's worth fixing continuing with this bug, since the replacement for TBPL is now in the design phase.

The largest table by far is runs_logs, and most of that will be in the content longblob. So whilst changing the length of some of the other fields / removing unnecessary indexes might gain us a bit; compared to the monthly purge it's not going to win us much.

Thank you for all your help with this Sheeri :-)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Please do let us know when tbpl is to be decommissioned. And if you want DB architectural help, please do include us in those meetings.
(In reply to Sheeri Cabral [:sheeri] from comment #29)
> Please do let us know when tbpl is to be decommissioned. And if you want DB
> architectural help, please do include us in those meetings.

I imagine TBPL will still be required at least until the end of the year (as we gradually transition).

Jeads is working on schema design for the replacement, and has plans for making sure everyone is in the loop (we've only had rough paper designs as of last week).
OK, great to know. We'll continue with the purge as normal.
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.