Closed Bug 1275425 Opened 8 years ago Closed 8 years ago

OperationalError(1366, "Incorrect string value: '\\xF0\\x9D\\x90\\x80\\xF0\\x9D...' for column 'message' at row 1")

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: jgraham)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

<jgraham> In [3]: print "\xF0\x9D\x90\x80".decode("utf8")
<jgraham> 
Flags: needinfo?(james)
Eugh the unicode broke Bugzilla.

<jgraham> In [3]: print "\xF0\x9D\x90\x80".decode("utf8")
<jgraham> <REDACTED since it breaks bugzilla>
<jgraham> Well that didn't show up so well here but it's totally a legit character
<jgraham> So this is just the usual MySQL terribleness
<jgraham> We need to switch that field to be utf8mb4
<jgraham> https://mathiasbynens.be/notes/mysql-utf8mb4 is the canonical write up. But https://code.djangoproject.com/ticket/18392 makes it sound like django itself doesn't support this correctly, or something
<jgraham> Which is pretty unbelievable

Things we can do:
1) Most urgent: Not perform a celery retry for errors of this type, since we know it's going to fail. This would at least stop the exception spam and extra load/backlog that ensues.
2) Filter the messages in the meantime
3) File another bug for the long term fix of changing to utf8mb4

https://rpm.newrelic.com/accounts/677903/applications/4180461/filterable_errors#/show/4f5109-f43cfb99-2203-11e6-b947-b82a72d22a14/stack_trace?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&_k=f6psad
Flags: needinfo?(james)
Attachment #8756424 - Flags: review?(emorley)
Attachment #8756424 - Flags: review?(emorley) → review+
Assignee: nobody → james
Blocks: 1115608
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/18d0fdc8aab8c18d9830e03269efc33342b6e5e7
Bug 1275425 - Hacky workaround for issues storing astral characters.

Test names, messages, etc. may contain UTF8 characters from beyond the
Basic Multilingual Plane ("astral" characters). Unfortunately MySQL's
"utf8" character set is nothing of the sort and will only store a
maximum of three bytes per character, thus restricting it to BMP
characters. The correct fix to this is to switch to the utf8mb4
character set. Since such a change is somewhat involved, however, we
address the immediate problem with a hack.

When storing failure lines, if the operation fails for character set
related reasons, try again with any non-BMP characters replaced by a
marker of the form <U+codepoint> e.g. <U+10FFFF>.

Note further that whether or not MySQL fails here or silently replaces
each byte of the original character with a U+FFFD replacement character
depends on the value of the sql_mode setting. If this is set to
STRICT_ALL_TABLES, we get an error, otherwise silent data
loss. Therefore it is important this setting is consistent across all
environments.

https://github.com/mozilla/treeherder/commit/c3128c01fb65a60b102ba3779290e3c2c5043a42
Merge pull request #1512 from mozilla/utf8_astral_hack

Bug 1275425 - Hacky workaround for issues storing astral characters.
Blocks: 1277300
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Depends on: 1306976
Blocks: 1343630
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: