If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Prepare MySQL database for emoji by performing utf8mb4 conversion

NEW
Unassigned

Status

()

bugzilla.mozilla.org
Infrastructure
P3
normal
9 months ago
6 months ago

People

(Reporter: dylan, Unassigned)

Tracking

(Blocks: 1 bug)

Production

Details

User Story

- BMO will need to be hard-hatted (webheads, cronjobs and daemons off)
- set innodb_file_format=Barracuda
- set innodb_file_per_table=ON (this should already be the case)
- set innodb_large_prefix=1
- for all tables T, alter table $T ROW_FORMAT=COMPRESSED (or DYNAMIC, at the DBA's discretion.)
- for all columns that are utf8, convert to utf8mb4
- we un-hard-hat.
- bug 1253535 will be commited
- we follow a normal schema-only code push

From then on, everyone can use poop emoji.
(Reporter)

Description

9 months ago
We need to do the following operations to all the mysql servers:

- set innodb_file_format=Barracuda
- set innodb_file_per_table=ON (this should already be the case)
- set innodb_large_prefix=1
- for all tables T, alter table $T ROW_FORMAT=COMPRESSED (or DYNAMIC, at the DBA's discretion.)

More detailed planning will be deposited in the user story field.

I suspect it is safest to plan this for a TCW?
(Reporter)

Comment 1

9 months ago
I need two things from you on this bug:

1) Do the steps outlined in comment #0 seem correct and complete? 
The goal is to support full utf8mb4 in all fields, and support large prefixes so we don't run into the index limit problems for varchar(255).

2) Assuming we find a TCW to use, what does your availability look like in Q1?
Flags: needinfo?(mpressman)
Any estimate on how long BMO will be unavailable? What impact might this have on replication to AWS?

Next TCW looks to be around Feb 11th or 18th, FWIW.
(Reporter)

Comment 3

9 months ago
(In reply to Kendall Libby [:fubar] from comment #2)
> Any estimate on how long BMO will be unavailable? 
About as long a DB dump and reload, so the worst case is multiple hours.

> What impact might this
> have on replication to AWS?

I'm not sure, we'll want to find out.
(Reporter)

Comment 4

9 months ago
Ping. Still waiting to hear from :mpressman about this.
(Reporter)

Comment 5

8 months ago
So we've soft-announced bug 1253535 on dev-platform, but I'm still waiting on a reply to comment #1.

Natalie: Can we get some motion on this bug?
Flags: needinfo?(nhaywood)
Sorry for the delay here. Since this is prepare and not enact utf8mb4, what I see is fine for converting. I'm available for either the 11th or the 18th, whichever is the next TCW
Flags: needinfo?(mpressman)
Just to check, when are you then planning to convert to utf8mb4?
Flags: needinfo?(nhaywood)
(Reporter)

Comment 8

8 months ago
As we'd be commiting bug 1253535 at the same time, the utf8mb4 conversion would happen then. 
thinking about that, I'm concerned the push script may time out if checksetup.pl takes hours, so we can convert the charset
before hand.

I've updated the user story with the complete set of steps, and changed the summary of this bug to better reflect what we want.

Can you give the user-story a once over and let me know if I'm missing a step or if there are any other DB concerns?
Thanks!
User Story: (updated)
Flags: needinfo?(mpressman)
Summary: Prepare MySQL database for supporting utf8mb4 conversion → Prepare MySQL database for support emoji by performing utf8mb4 conversion
(Reporter)

Updated

8 months ago
Summary: Prepare MySQL database for support emoji by performing utf8mb4 conversion → Prepare MySQL database for emoji by performing utf8mb4 conversion
User story looks good. Just one question on an audit of the size of columns and indexes being used in the new utf8mb4 columns. utf to utf8mb4 conversion will use less characters which can cause problems as it uses 4 bytes per character instead of 3. I did a quick check of the schema and see there are tinytext columns. These can use 85 characters in utf8, but only 63 characters in utf8mb4. I didn't fully examine to see if any of these are used in indexes as the index size for text/varchar columns, but I see you are increasing the default index length to 3072 bytes from 767 with the innodb_large_prefix=1.
Flags: needinfo?(mpressman)

Updated

7 months ago
Flags: needinfo?(dylan)
(Reporter)

Comment 10

7 months ago
(In reply to Matt Pressman [:mpressman] from comment #9)
> User story looks good. Just one question on an audit of the size of columns
> and indexes being used in the new utf8mb4 columns. utf to utf8mb4 conversion
> will use less characters which can cause problems as it uses 4 bytes per
> character instead of 3. I did a quick check of the schema and see there are
> tinytext columns. These can use 85 characters in utf8, but only 63
> characters in utf8mb4. I didn't fully examine to see if any of these are
> used in indexes as the index size for text/varchar columns, but I see you
> are increasing the default index length to 3072 bytes from 767 with the
> innodb_large_prefix=1.

Just to be clear, it is my understanding that with large prefixes, tinytext can be longer than 63 characters.
If this is not the case -- if any fields will be truncated by this change, I need to know ASAP.
Flags: needinfo?(dylan) → needinfo?(mpressman)
Using the large_prefix increases the index key prefix length. Not the column size itself. A tinytext column can store 255 bytes, which means existing utf8 encoding can only store 85 three-byte characters. When converting to utf8mb4, it will use four-byte characters, thus limiting to only 63 characters in a tinytext column. So any tinytext columns that exceed 63 characters will need their data types changed in order to prevent it from being truncated
Flags: needinfo?(mpressman)
(Reporter)

Updated

6 months ago
Priority: P1 → P3
You need to log in before you can comment on or make changes to this bug.