Closed Bug 1277300 Opened 8 years ago Closed 8 years ago

Fix astral plane hack to work with UCS2 python

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jgraham, Assigned: jgraham)

References

Details

Attachments

(1 file)

[treeherder] mozilla:astral_hack_ucs2 > mozilla:master 8 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request	emorley : review+	Details \| Review

James Graham [:jgraham]

Assignee

Description

•

8 years ago

The hack in bug 1275425 doesn't work when python is compiled with UCS2 (which happens to be the case on Heroku). To fix this a separate codepath is needed to cover cases where non-BMP characters have a length of > 1.

GitHub Autolander Bot

Comment 1

•

8 years ago

Attached file [treeherder] mozilla:astral_hack_ucs2 > mozilla:master — Details

Ed Morley [:emorley]

Updated

•

8 years ago

Blocks: treeherder-heroku

Ed Morley [:emorley]

Updated

•

8 years ago

Attachment #8758786 - Flags: review+

Ed Morley [:emorley]

Updated

•

8 years ago

Assignee: nobody → james

Ed Morley [:emorley]

Updated

•

8 years ago

Blocks: 1277304

Treeherder GitHub Bugbot

Comment 2

•

8 years ago

Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/5162728d63b8a6e36b0602d727768d8822b64d35
Bug 1277300 - Fix astral plan character hack to work with UCS2 python

Python can either be compiled in UCS2 mode or UCS4 mode, which differ in
how non-BMP characters are represented. In the former they are
represented by a (user-exposed) surrogate pair so that a single astral
codepoint is encoded in a string of length 2. In the latter all
characters are represented by a single 4 byte value and all characters
have a string length of 1 e.g. len(u"\U0010FFFF") is 2 in UCS2 python
and 1 in UCS4 python.

Several functions don't work identically between the two variants
e.g. ord() will only accept a single python character and
ord(u"\U0010FFFF") will either succeed or fail depending on the compile
time options. Similarly, regexps can't directly work with characters
outside the BMP in UCS2 Python, and one must instead match the
individual parts of the surrogate pair.

This is obviously a terrible design mistake, but it's important to us
because although most environments use the forgiving/sane UCS4
configuration, Heroku currently uses edge-case-happy UCS2
configuration. Therefore where we are dealing with non-BMP characters we
must add in individual codepaths for each variant.

There is probably a moral in this story about why "just make it an
option" is generally a terrible idea.

https://github.com/mozilla/treeherder/commit/46224c12b38873dca776ba06610208b6f611454a
Merge pull request #1547 from mozilla/astral_hack_ucs2

Bug 1277300 - Fix astral plan character hack to work with UCS2 python

Ed Morley [:emorley]

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Updated

•

8 years ago

Blocks: 1292720

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Fix astral plane hack to work with UCS2 python

Categories

(Tree Management :: Treeherder: Data Ingestion, defect)

Tracking

(Not tracked)

People

(Reporter: jgraham, Assigned: jgraham)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Updated

Updated

Updated

Updated

Comment 2

Updated

Updated

Attachment

General

Description

File Name

Content Type