Closed Bug 902395 Opened 11 years ago Closed 10 years ago

Enforce utf8 = true for all installations and remove the utf8 parameter

Categories

(Bugzilla :: Installation & Upgrading, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
Bugzilla 6.0

People

(Reporter: LpSolit, Assigned: LpSolit)

References

(Blocks 1 open bug)

Details

(Keywords: relnote)

Attachments

(1 file)

The utf8 parameter has been implemented in Bugzilla 2.22, see bug 126266. It is set to true for all new installations, and once it's set, there is no way to turn it off anymore. Bugzilla 3.0 came with a tool named recode.pl which lets you convert your existing DB to UTF8 for installations older than 2.22, see bug 280633 and bug 304550. During the whole 3.x and 4.x cycles, admins had the possibility to convert their DB to UTF8, which has much better support than installations with utf8 turned off, especially since Bugzilla 3.2, see bug 363153 and its dependency tree. With Bugzilla 5.0, I think it's time to remove this utf8 parameter entirely and force all installations to use UTF8. This would let us remove all our |if Bugzilla->params->{utf8}| checks from our codebase, and make the code a bit cleaner and faster (the penalty in url_quote() is very visible, see bug 898830). Before enforcing UTF8, I think we should fix bug 868867 first to not loose possible characters outside BMP when using MySQL, unless such characters cannot be set with utf8 turned off anyway, in which case the dependency can be removed.
I approve of this plan. 3.0 came out in 2007. 6 years is plenty of time.
Bug 405011 implemented a way of dealing with characters outside the BMP. So we can remove the dependency and get on with fixing this bug. So am I right that this bug involves: * Adding a release note about this * Changing checksetup.pl so it refuses to complete if the DB is not utf-8 and tells the admin to run recode.pl * Removing all Bugzilla->params->{utf8} checks from the code and assuming the param is true * Removing the Bugzilla->params->{utf8} param entirely ? Gerv
No longer depends on: 868867
(In reply to Gervase Markham [:gerv] from comment #2) > Bug 405011 implemented a way of dealing with characters outside the BMP. So > we can remove the dependency and get on with fixing this bug. Did you check the last paragraph of my comment 0 before removing the dependency? Otherwise the other steps are correct and trivial to implement.
Attached patch patch, v1Splinter Review
Assignee: installation → LpSolit
Status: NEW → ASSIGNED
Attachment #8567591 - Flags: review?(dkl)
Keywords: relnote
Target Milestone: --- → Bugzilla 6.0
Comment on attachment 8567591 [details] [diff] [review] patch, v1 Review of attachment 8567591 [details] [diff] [review]: ----------------------------------------------------------------- r=dkl
Attachment #8567591 - Flags: review?(dkl) → review+
Flags: approval?
(In reply to Frédéric Buclin from comment #3) > (In reply to Gervase Markham [:gerv] from comment #2) > > Bug 405011 implemented a way of dealing with characters outside the BMP. So > > we can remove the dependency and get on with fixing this bug. > > Did you check the last paragraph of my comment 0 before removing the > dependency? Otherwise the other steps are correct and trivial to implement. i'd like to see an answer to this before approving (can outside BMP characters be set when utf-8 is off?)
Flags: needinfo?(gerv)
Blocks: 1139414
(In reply to Byron Jones ‹:glob› from comment #6) > i'd like to see an answer to this before approving (can outside BMP > characters be set when utf-8 is off?) If utf-8 is off, then the DB will be using whatever legacy character encoding it was set up with. The only way the data could contain characters outside the BMP is if the original character encoding allowed the encoding of characters which, in Unicode, have been placed outside the BMP. Given that the BMP is supposed to allow loss-free migration from almost any existing character set, I find that possibility highly unlikely - although we could not reduce it to 0 without looking at all possible character sets that MySQL can use, and checking every character in each of them. Gerv
Flags: needinfo?(gerv)
(In reply to Gervase Markham [:gerv] from comment #7) > Given that the BMP is supposed to allow loss-free migration from almost any > existing character set, I find that possibility highly unlikely That's exactly why I set the dependency on bug 868867, because we don't yet support BMP characters with MySQL. So currently characters in the BMP would be dropped when moving to UTF8.
(In reply to Frédéric Buclin from comment #8) > That's exactly why I set the dependency on bug 868867, because we don't yet > support BMP characters with MySQL. So currently characters in the BMP would > be dropped when moving to UTF8. thanks gerv and frédéric. looks like we'll have to wait for the mysql version bump before we can commit this.
Depends on: 868867
Flags: approval?
(In reply to Frédéric Buclin from comment #8) > That's exactly why I set the dependency on bug 868867, because we don't yet > support BMP characters with MySQL. So currently characters in the BMP would > be dropped when moving to UTF8. Do you mean "non-BMP characters" and "characters outside the BMP"? The BMP is the Basic Multilingual Plane - it contains characters like A, $ and ç. I'm fairly sure we support characters in the BMP at the moment :-) Gerv
(In reply to Gervase Markham [:gerv] from comment #10) > Do you mean "non-BMP characters" and "characters outside the BMP"? Ah sorry, yes. I didn't realize I missed "outside". :) Even if these characters are rare in our languages, they exist and should be considered anyway.
(In reply to Frédéric Buclin from comment #11) > Even if these characters are rare in our languages, they exist and should be > considered anyway. Non-BMP characters are non-existent in "our languages". Let me give you some examples of languages from Plane 1: Linear B, Egyptian hieroglyphs, and cuneiform. Plane 2 is CJK ideographs which were not encoded in earlier standards - i.e. very obscure and historical ones. Plane 14 has a small number of meta-characters, like language tag indicators. Planes 15 and 16 are the private use area. The only non-BMP characters which might have some chance of being in a Bugzilla somewhere are some Emoji. My computer doesn't even have a font which renders them. The question is: are we going to let the possibility that there's a Bugzilla somewhere with Emoji in it stop us from making this significant improvement and code simplification? If we are going to worry about that possibility, why not change recode.pl to encode them using the workaround from bug 405011? Gerv
Hum, per the MySQL 5.5 documentation: "The utf8mb4, utf16, and utf32 character sets were added in MySQL 5.5.3." Those are the only 3 character sets that support 4-bytes characters. All other character sets (epescially those available in MySQL 5.1 and older) support a maximum of 3-bytes per character, like utf8. http://dev.mysql.com/doc/refman/5.5/en/charset-unicode.html If I understand correctly, this means that you couldn't store any non-BMP character in MySQL 5.1 and older at all. MySQL 5.5.3 has been released in March 2010, but Bugzilla fully supports UTF8 since Bugzilla 3.2, released in November 2008. So my guess would be the following: either an old Bugzilla installation still uses its old non-UTF8 character set which supports 3 or less bytes per character, or the installation has already been converted to UTF8 using recode.pl and so doesn't support non-BMP characters anyway. In both cases, there are no non-BMP characters in the DB as it's currently not possible to store them, even with another character set. So forcing utf8 = 1 should be harmless... unless some character set causes conversion to Unicode to go outside the BMP. But I couldn't find any documentation about this.
cyborgshadow, could you help us here? See comment 13: is it possible that in MySQL, a character set which is none of utf8mb4, utf16, or utf32 could generate non-BMP characters when converting strings from that character set to the utf8 character set? No other character set supports 4-bytes characters besides the three character sets mentioned above, but I wondered if a character from a 3-bytes (or less) character set could be converted into a non-BMP character when converting to the utf8 character set. If not, then we don't need to fix bug 868867 first as the conversion to utf8 would be safe.
cyborgshadow no longer works for mozilla, and his bmo account is disabled. redirecting the question in comment 14 to sheeri.
Flags: needinfo?(scabral)
Frédéric, I've had problems with odd characters converting from latin1 to utf8 - which is mysqldump's fault, see http://www.pythian.com/blog/beware-default-charset-for-mysqldump-is-utf8-regardless-of-server-default-charset-2/ This particular instance was causing problems because Russian and Serbian Cyrillic characters were being stored in columns that had a latin1 charset. Inserting, updating, and selecting posed no problems, and the Cyrillic characters looked fine. But they turned into mojibake (http://en.wikipedia.org/wiki/Mojibake) when the conversion took place. So I'd say it's probably not a good idea to assume that MySQL won't barf when converting charsets....it's too "friendly". But I agree with Gerv in comment 12. I think you can proceed with minimal risk.
Flags: needinfo?(scabral)
(In reply to Sheeri Cabral [:sheeri] from comment #16) > But I agree with Gerv in comment 12. I think you can proceed with minimal > risk. Ok, so let's go... Thanks for the info.
No longer depends on: 868867
Flags: approval?
Flags: approval? → approval+
To ssh://gitolite3@git.mozilla.org/bugzilla/bugzilla.git 1d96fa1..2ccf81d master -> master
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: