Closed Bug 1115608 Opened 10 years ago Closed 8 years ago

Pushlog ingestion fails if commit messages contain astral unicode characters ("OperationalError: (1366, "Incorrect string value:...")

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: emorley)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

https://tbpl.mozilla.org/?tree=Try&rev=fe405d217531 shows my pile-of-poo try push, but 40 minutes after I pushed, https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=fe405d217531 continues to say that it's still Waiting for a push with revision fe405d217531
Mmm, if someone pushes something more likely to actually be used to mozilla-inbound, like "Bug 1234567 - Fix display of
Severity: normal → blocker
Priority: -- → P1
Heh, pretty awesome that bugzilla ate the rest of that comment. I suspect, having just pushed a couple of things with try syntax which contained no non-ASCII, that what I actually meant to file was "try pushes that don't produce any builds claim in treeherder to be perpetually 'Waiting for a push with revision bb847e6159c2' instead of showing the lack of any builds produced." Which is still fairly severe and should block killing tbpl, but much less severe than having a way to hide pushes on any tree.
Severity: blocker → normal
Priority: P1 → --
Whichever it is, I think we should still take a look sooner rather than later, so making this a P1 again.
Priority: -- → P1
Assignee: nobody → cdawson
Status: NEW → ASSIGNED
Priority: P1 → P2
Un-assigning from myself. I'm not going to get to this soon.
Assignee: cdawson → nobody
Priority: P2 → P3
Blocks: 1191934
Status: ASSIGNED → NEW
Guess we'll need to take the approach used in bug 1275425 here too. (Strip astral unicode characters; will move that to a common file.)
Assignee: nobody → emorley
Depends on: 1275425
Well we could just fix it properly :) I don't know if there will be breakage though; some of the stuff about index lengths did scare me a little.
Assignee: emorley → nobody
Production is currently failing to ingest autoland and mozilla-central pushes due to: django.db.utils:OperationalError: (1366, "Incorrect string value: '\\xF0\\x9F\\x92\\xA3 (...' for column 'comments' at row 1") The issue started on autoland, but the problematic commit has now been merged to mozilla-central. I've asked :tomcat to not merge mozilla-central anywhere else for now. I've tracked the problematic commit back to: https://hg.mozilla.org/integration/autoland/rev/ad8230a681a8 ...which comes from servo vendoring: https://github.com/servo/servo/pull/15769 The first line of the commit message includes an astral unicode character. Options for us: 1) Strip out astral unicode during ingestion 2) Convert the table to utf8mb4 (viable, since it's a small table)
Assignee: nobody → emorley
Priority: P3 → P1
Summary: Treeherder failed to ingest a try push with U+1F4A9 in the commit message → Pushlog ingestion fails if commit messages contain astral unicode characters ("OperationalError: (1366, "Incorrect string value:...")
AFAIR from my days working at a MySQL-powered dating website, given that the (MySQL) utf8 charset is a subset of utf8mb4, converting the charset to binary and only then to utf8mb4 should be fast and not spend pointless time doing noop conversions.
Converting to utf8mb4 seemed preferable, since: * Didn't appear to require code changes to quicker to fix prod * This table is small so quick to run / easy to switch back if we decide stripping characters is preferred * It looks like it's the direction we might take in the future for all tables So against stage I ran: `ALTER TABLE treeherder.commit CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;` ...which completed in 54s. However exceptions continued on stage, which I've tracked down to Django hardcoding the connection charset to utf8: https://github.com/django/django/blob/1.10.5/django/db/backends/mysql/base.py#L239 We can either: 1) Override this by modifying settings.py's DATABASES['default']['OPTIONS'] (though not sure if this will have any implications for other tables) 2) Use the stripping approach for now and decide what our overall utf8 strategy should be later
Let's make "later" not be 2 years again.
Can't be worse than Django itself though, I guess. https://code.djangoproject.com/ticket/18392
Comment on attachment 8842396 [details] [review] [treeherder] mozilla:django-utf8mb4-connection > mozilla:master This worked fine on stage (which has the converted commit table). I'll pre-emptively cherrypick to prod prior to review.
Attachment #8842396 - Flags: review?(james)
I've converted the `commit` table on prod/prototype to utf8mb4 (using the SQL in comment 9), and a prod push is in progress (2a778372 plus this PR and an unrelated exception fix [6b0b8585] cherrypicked).
Comment on attachment 8842396 [details] [review] [treeherder] mozilla:django-utf8mb4-connection > mozilla:master r+'ing this so :emorley can land.
Attachment #8842396 - Flags: review?(james) → review+
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/d37cece20b8eb1a37c82a64b0e90261e81af3613 Bug 1115608 - Set Django DB connection charset to utf8mb4 This allows non-BMP unicode values to be sent by the client to MySQL server, to allow insertion into tables that have been converted to utf8mb4 (currently only the `commit` table).
Blocks: 1343630
No issues since this morning. Plus side of the conversion rather than stripping the characters: Treeherder now supports emoji \o/ eg: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=378fa5c177f1c4055803e2f445cc8ea5d89d2eea I've filed bug 1343630 for handling the longer-term utf8mb4 discussion.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: