OperationalError(1366, "Incorrect string value: '\\xF0\\x9D\\x90\\x80\\xF0\\x9D...' for column 'message' at row 1")

RESOLVED FIXED

Status

Tree Management
Treeherder: Data Ingestion
P1
normal
RESOLVED FIXED
2 years ago
9 months ago

People

(Reporter: emorley, Assigned: jgraham)

Tracking

(Blocks: 2 bugs)

Details

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
<jgraham> In [3]: print "\xF0\x9D\x90\x80".decode("utf8")
<jgraham> 
Flags: needinfo?(james)
(Reporter)

Comment 1

2 years ago
Eugh the unicode broke Bugzilla.

<jgraham> In [3]: print "\xF0\x9D\x90\x80".decode("utf8")
<jgraham> <REDACTED since it breaks bugzilla>
<jgraham> Well that didn't show up so well here but it's totally a legit character
<jgraham> So this is just the usual MySQL terribleness
<jgraham> We need to switch that field to be utf8mb4
<jgraham> https://mathiasbynens.be/notes/mysql-utf8mb4 is the canonical write up. But https://code.djangoproject.com/ticket/18392 makes it sound like django itself doesn't support this correctly, or something
<jgraham> Which is pretty unbelievable

Things we can do:
1) Most urgent: Not perform a celery retry for errors of this type, since we know it's going to fail. This would at least stop the exception spam and extra load/backlog that ensues.
2) Filter the messages in the meantime
3) File another bug for the long term fix of changing to utf8mb4

https://rpm.newrelic.com/accounts/677903/applications/4180461/filterable_errors#/show/4f5109-f43cfb99-2203-11e6-b947-b82a72d22a14/stack_trace?top_facet=transactionUiName&primary_facet=error.class&barchart=barchart&_k=f6psad

Comment 2

2 years ago
Created attachment 8756424 [details] [review]
[treeherder] mozilla:utf8_astral_hack > mozilla:master
(Assignee)

Updated

2 years ago
Flags: needinfo?(james)
Attachment #8756424 - Flags: review?(emorley)
(Reporter)

Updated

2 years ago
Attachment #8756424 - Flags: review?(emorley) → review+
(Reporter)

Updated

2 years ago
Assignee: nobody → james
(Reporter)

Updated

2 years ago
Blocks: 1115608

Comment 3

2 years ago
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/18d0fdc8aab8c18d9830e03269efc33342b6e5e7
Bug 1275425 - Hacky workaround for issues storing astral characters.

Test names, messages, etc. may contain UTF8 characters from beyond the
Basic Multilingual Plane ("astral" characters). Unfortunately MySQL's
"utf8" character set is nothing of the sort and will only store a
maximum of three bytes per character, thus restricting it to BMP
characters. The correct fix to this is to switch to the utf8mb4
character set. Since such a change is somewhat involved, however, we
address the immediate problem with a hack.

When storing failure lines, if the operation fails for character set
related reasons, try again with any non-BMP characters replaced by a
marker of the form <U+codepoint> e.g. <U+10FFFF>.

Note further that whether or not MySQL fails here or silently replaces
each byte of the original character with a U+FFFD replacement character
depends on the value of the sql_mode setting. If this is set to
STRICT_ALL_TABLES, we get an error, otherwise silent data
loss. Therefore it is important this setting is consistent across all
environments.

https://github.com/mozilla/treeherder/commit/c3128c01fb65a60b102ba3779290e3c2c5043a42
Merge pull request #1512 from mozilla/utf8_astral_hack

Bug 1275425 - Hacky workaround for issues storing astral characters.
(Assignee)

Updated

2 years ago
Blocks: 1277300
(Reporter)

Updated

2 years ago
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
(Reporter)

Updated

a year ago
Depends on: 1306976
(Reporter)

Updated

9 months ago
Blocks: 1343630
You need to log in before you can comment on or make changes to this bug.