MDN truncates pages with Unicode supplementary characters (codepoint above U+FFFF)

NEW
Unassigned

Status

--
major
6 years ago
2 months ago

People

(Reporter: mozillabugs, Unassigned)

Tracking

(Depends on: 2 bugs)

Details

(Whiteboard: [specification][type:bug])

(Reporter)

Description

6 years ago
What did you do?
================
I edited the page
https://developer.mozilla.org/en-US/docs/JavaScript/Guide/Values,_variables,_and_literals
to document 21st century Unicode, and used an emoji character (Unicode code point U+1F604) in a sample string.

What happened?
==============
The page got truncated at the emoji character.

What should have happened?
==========================
The page should be shown completely. Duh.

Is there anything else we should know?
======================================
I can't show the emoji character here because bugzilla is equally broken. However, you can find it here:
http://www.fileformat.info/info/unicode/char/1f604/browsertest.htm
(Reporter)

Comment 1

6 years ago
Janet tried to revert the page to its state before my edits; the history shows success, but the generated page is still truncated.
Severity: normal → major
The original page has been restored. I just did a hard refresh (Ctrl + Shift + R). Not sure why that's necessary, but it should make a good workaround for now.
Emoji require 4-byte unicode characters, which are not supported by the utf8 character set used by our tables in MySQL.  There is a utf8mb4 character set:

http://stackoverflow.com/questions/7814293/how-to-insert-utf-8-mb4-characteremoji-in-ios5-in-mysql#8109948

We're planning on moving to PostgreSQL, and will ensure that 4-byte unicode is supported in the new database.
Depends on: 1159930
Duplicate of this bug: 1385724
Summary: MDN truncates pages with Unicode supplementary characters → MDN truncates pages with Unicode supplementary characters (codepoint above U+FFFF)

Comment 5

2 years ago
Since before the turn of the century, it's been a violation of the Unicode standard to destroy data simply because it contains code points beyond U+FFFF. Now in the year 2017, it's amazing that a known bug this fundamental and severe has gone unfixed for four years, in a website as important and technically oriented as MDN.

Evidently switching from MySQL to PostgreSQL is complicated and MDN has been working on that transition for years.

In the meantime, how hard could it be to upgrade the existing MySQL databases to utf8mb4 or varbinary? Maybe a few hours of work? None of the existing text data needs to change, only the metadata, to enable insertion of 21st-century Unicode text.

MDN should treat this bug as its highest priority. It's insane to focus on redesigning the superficial appearance of the site when the database is broken.

This page that I edited is currently truncated:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Text_formatting

I apologize for my role in its truncation. Who would have guessed that insertion of 21st-century Unicode text would have that effect. Is there a way to recover the previous content of that page?

Comment 6

2 years ago
I did manage to revert the Text_formatting page and make a few corrections to it.

Comment 7

2 years ago
This bug isn't assigned to anyone. I'd like to help.

The importance should be changed from "major" to "critical", since it causes loss of data and prevents documenting modern technology.

This bug currently "depends on 1159930". I think that turns out to be a mistake. Switching from MySQL to PostgreSQL is evidently difficult and doesn't appear to have critical importance or to have anyone assigned to it. There doesn't appear to be any expectation for bug 1159930 to be resolved within any time frame. Bug 1159930 may be a worthwhile long-term goal if anyone is sufficiently motivated, but it's counterproductive to treat an easy-to-solve critical bug as dependent on a hard-to-solve non-critical enhancement.

This bug can be solved with a few MySQL commands, like "ALTER TABLE mytable charset=utf8mb4, MODIFY COLUMN textfield1 VARCHAR(255) CHARACTER SET utf8mb4..." as described in the stackoverflow link referred to above. Figuring out the right MySQL commands should take a few hours at most. Of course, the webmaster will need to make backups, upgrade to a newer version of MySQL if a version before 5.5 is currently in use, and be in charge of actually running the commands while minimizing disruption of service.

Could anyone please add a comment here indicating which MySQL version is currently in use, and references to the table structures? Also, this same bug exists on both developer.mozilla.org and bugzilla.mozilla.org. Are there two separate databases for the two subdomains? Do they run on different servers? Is there a mechanism in place for duplicating the websites for test purposes?
I'm sorry you were affected by this bug, and thank you for advocating for it.

I believe major is the correct severity, due to the frequency this limitation is encountered. In any case, Bugzilla severity is one of several factors in deciding what work to do, and changing it will not change the priority.

If you'd like to work on this, a good place to start is the Kuma development environment.  It uses MySQL 5.6 and has the same table structure as production:

https://kuma.readthedocs.io/en/latest/

Yes, bugzilla.mozilla.org (BMO) has a different database and development process. If there's an existing bug on that issue, you can reach the BMO team by following and commenting on that bug. If not, you can file it and refer to this bug.

Comment 9

2 years ago
Thank you, John.

I've installed Docker and Kuma. The Kuma installation has problems running, similar to those reported at https://github.com/puckel/docker-airflow/issues/18 ("Can't connect to MySQL server on 'mysql' (111)"); I'll try to solve that.

Anyway, the source of bug 857438 is evident in kuma/mdn_sample_db.sql. It has 138 occurrences of 'utf8' and zero occurrences of 'utf8mb4'.

The solution should be to replace all occurrences in kuma/mdn_sample_db.sql of 'utf8' with 'utf8mb4', with the exception of one line that begins "INSERT INTO ..." and includes URLs like "https://github.com/search?utf8=...".

Comment 10

2 years ago
BTW, the corresponding BMO bug is: https://bugzilla.mozilla.org/show_bug.cgi?id=868867
This appears to be a good guide for updating to utfmb4: https://mathiasbynens.be/notes/mysql-utf8mb4

It mentions a potential problem:

> The InnoDB storage engine has a maximum index length of 767 bytes, so for utf8 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8 columns with indexes longer than 191 characters, you will need to index a smaller number of characters when using utf8mb4. (Because of this, I had to change some indexed VARCHAR(255) columns to VARCHAR(191).)

We have a unique index on email fields, defined as VARCHAR(255), and reducing that to VARCHAR(191) would mean that some valid email addresses would not be allowed:

https://stackoverflow.com/questions/386294/what-is-the-maximum-length-of-a-valid-email-address
https://stackoverflow.com/questions/2049502/what-characters-are-allowed-in-an-email-address
https://tools.ietf.org/html/rfc6531

It's not clear to me if supplementary unicode characters are allowed in email addresses, and may require a closer deeper reading of RFC 6531.

To keep the unique index on emails, we may need to keep the utf8 encoding, or wait for InnoDB to support 1026-key indexes. I'm not sure if there is an open bug for that.

Or, switch to postgres.
bug 1253535 is specific to bugzilla.mozilla.org, and comment 46 [1] has some of the requirements to get it working (MySQL 5.7, DYNAMIC/COMPRESSED row formats, innodb_large_prefix setting).

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1253535#c46

Comment 13

2 years ago
Yes, innodb_large_prefix should solve the VARCHAR(255) problem, and DYNAMIC/COMPRESSED row formats may be beneficial for efficiency. The complete solution should be to make those two changes along with changing utf8 to utf8mb4.
Duplicate of this bug: 1508878
Depends on: 1525165
You need to log in before you can comment on or make changes to this bug.