857438 - MDN truncates pages with Unicode supplementary characters (codepoint above U+FFFF)

Reporter

Description

•

13 years ago

What did you do? ================ I edited the page https://developer.mozilla.org/en-US/docs/JavaScript/Guide/Values,_variables,_and_literals to document 21st century Unicode, and used an emoji character (Unicode code point U+1F604) in a sample string. What happened? ============== The page got truncated at the emoji character. What should have happened? ========================== The page should be shown completely. Duh. Is there anything else we should know? ====================================== I can't show the emoji character here because bugzilla is equally broken. However, you can find it here: http://www.fileformat.info/info/unicode/char/1f604/browsertest.htm

Norbert Lindenberg

Reporter

Comment 1

•

13 years ago

Janet tried to revert the page to its state before my edits; the history shows success, but the generated page is still truncated.

Severity: normal → major

John Karahalis [:openjck]

Comment 2

•

13 years ago

The original page has been restored. I just did a hard refresh (Ctrl + Shift + R). Not sure why that's necessary, but it should make a good workaround for now.

John Whitlock [:jwhitlock]

Comment 3

•

8 years ago

Emoji require 4-byte unicode characters, which are not supported by the utf8 character set used by our tables in MySQL. There is a utf8mb4 character set: http://stackoverflow.com/questions/7814293/how-to-insert-utf-8-mb4-characteremoji-in-ios5-in-mysql#8109948 We're planning on moving to PostgreSQL, and will ensure that 4-byte unicode is supported in the new database.

Depends on: 1159930

John Whitlock [:jwhitlock]

Updated

•

8 years ago

Summary: MDN truncates pages with Unicode supplementary characters → MDN truncates pages with Unicode supplementary characters (codepoint above U+FFFF)

tangmu

Comment 5

•

8 years ago

Since before the turn of the century, it's been a violation of the Unicode standard to destroy data simply because it contains code points beyond U+FFFF. Now in the year 2017, it's amazing that a known bug this fundamental and severe has gone unfixed for four years, in a website as important and technically oriented as MDN. Evidently switching from MySQL to PostgreSQL is complicated and MDN has been working on that transition for years. In the meantime, how hard could it be to upgrade the existing MySQL databases to utf8mb4 or varbinary? Maybe a few hours of work? None of the existing text data needs to change, only the metadata, to enable insertion of 21st-century Unicode text. MDN should treat this bug as its highest priority. It's insane to focus on redesigning the superficial appearance of the site when the database is broken. This page that I edited is currently truncated: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Text_formatting I apologize for my role in its truncation. Who would have guessed that insertion of 21st-century Unicode text would have that effect. Is there a way to recover the previous content of that page?

tangmu

Comment 6

•

8 years ago

I did manage to revert the Text_formatting page and make a few corrections to it.

tangmu

Comment 7

•

8 years ago

This bug isn't assigned to anyone. I'd like to help. The importance should be changed from "major" to "critical", since it causes loss of data and prevents documenting modern technology. This bug currently "depends on 1159930". I think that turns out to be a mistake. Switching from MySQL to PostgreSQL is evidently difficult and doesn't appear to have critical importance or to have anyone assigned to it. There doesn't appear to be any expectation for bug 1159930 to be resolved within any time frame. Bug 1159930 may be a worthwhile long-term goal if anyone is sufficiently motivated, but it's counterproductive to treat an easy-to-solve critical bug as dependent on a hard-to-solve non-critical enhancement. This bug can be solved with a few MySQL commands, like "ALTER TABLE mytable charset=utf8mb4, MODIFY COLUMN textfield1 VARCHAR(255) CHARACTER SET utf8mb4..." as described in the stackoverflow link referred to above. Figuring out the right MySQL commands should take a few hours at most. Of course, the webmaster will need to make backups, upgrade to a newer version of MySQL if a version before 5.5 is currently in use, and be in charge of actually running the commands while minimizing disruption of service. Could anyone please add a comment here indicating which MySQL version is currently in use, and references to the table structures? Also, this same bug exists on both developer.mozilla.org and bugzilla.mozilla.org. Are there two separate databases for the two subdomains? Do they run on different servers? Is there a mechanism in place for duplicating the websites for test purposes?

John Whitlock [:jwhitlock]

Comment 8

•

8 years ago

I'm sorry you were affected by this bug, and thank you for advocating for it. I believe major is the correct severity, due to the frequency this limitation is encountered. In any case, Bugzilla severity is one of several factors in deciding what work to do, and changing it will not change the priority. If you'd like to work on this, a good place to start is the Kuma development environment. It uses MySQL 5.6 and has the same table structure as production: https://kuma.readthedocs.io/en/latest/ Yes, bugzilla.mozilla.org (BMO) has a different database and development process. If there's an existing bug on that issue, you can reach the BMO team by following and commenting on that bug. If not, you can file it and refer to this bug.

tangmu

Comment 9

•

8 years ago

Thank you, John. I've installed Docker and Kuma. The Kuma installation has problems running, similar to those reported at https://github.com/puckel/docker-airflow/issues/18 ("Can't connect to MySQL server on 'mysql' (111)"); I'll try to solve that. Anyway, the source of bug 857438 is evident in kuma/mdn_sample_db.sql. It has 138 occurrences of 'utf8' and zero occurrences of 'utf8mb4'. The solution should be to replace all occurrences in kuma/mdn_sample_db.sql of 'utf8' with 'utf8mb4', with the exception of one line that begins "INSERT INTO ..." and includes URLs like "https://github.com/search?utf8=...".

tangmu

Comment 10

•

8 years ago

BTW, the corresponding BMO bug is: https://bugzilla.mozilla.org/show_bug.cgi?id=868867

John Whitlock [:jwhitlock]

Comment 11

•

8 years ago

This appears to be a good guide for updating to utfmb4: https://mathiasbynens.be/notes/mysql-utf8mb4 It mentions a potential problem: > The InnoDB storage engine has a maximum index length of 767 bytes, so for utf8 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8 columns with indexes longer than 191 characters, you will need to index a smaller number of characters when using utf8mb4. (Because of this, I had to change some indexed VARCHAR(255) columns to VARCHAR(191).) We have a unique index on email fields, defined as VARCHAR(255), and reducing that to VARCHAR(191) would mean that some valid email addresses would not be allowed: https://stackoverflow.com/questions/386294/what-is-the-maximum-length-of-a-valid-email-address https://stackoverflow.com/questions/2049502/what-characters-are-allowed-in-an-email-address https://tools.ietf.org/html/rfc6531 It's not clear to me if supplementary unicode characters are allowed in email addresses, and may require a closer deeper reading of RFC 6531. To keep the unique index on emails, we may need to keep the utf8 encoding, or wait for InnoDB to support 1026-key indexes. I'm not sure if there is an open bug for that. Or, switch to postgres.

John Whitlock [:jwhitlock]

Comment 12

•

8 years ago

bug 1253535 is specific to bugzilla.mozilla.org, and comment 46 [1] has some of the requirements to get it working (MySQL 5.7, DYNAMIC/COMPRESSED row formats, innodb_large_prefix setting). [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1253535#c46

tangmu

Comment 13

•

8 years ago

Yes, innodb_large_prefix should solve the VARCHAR(255) problem, and DYNAMIC/COMPRESSED row formats may be beneficial for efficiency. The complete solution should be to make those two changes along with changing utf8 to utf8mb4.

John Whitlock [:jwhitlock]

Updated

•

7 years ago

Depends on: 1525165

BMO Automation

Comment 15

•

5 years ago

MDN Web Docs' bug reporting has now moved to GitHub. From now on, please file content bugs at https://github.com/mdn/sprints/issues/ and platform bugs at https://github.com/mdn/kuma/issues/.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → WONTFIX

BMO Automation

Updated

•

5 years ago

Product: developer.mozilla.org → developer.mozilla.org Graveyard

Bugzilla

MDN truncates pages with Unicode supplementary characters (codepoint above U+FFFF)

Categories

(developer.mozilla.org Graveyard :: General, defect)

Tracking

(Not tracked)

People

(Reporter: mozillabugs, Unassigned)

References

Details

(Whiteboard: [specification][type:bug])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Comment 15

Updated