The hack in bug 1275425 doesn't work when python is compiled with UCS2 (which happens to be the case on Heroku). To fix this a separate codepath is needed to cover cases where non-BMP characters have a length of > 1.
Created attachment 8758786 [details] [review] [treeherder] mozilla:astral_hack_ucs2 > mozilla:master
Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/5162728d63b8a6e36b0602d727768d8822b64d35 Bug 1277300 - Fix astral plan character hack to work with UCS2 python Python can either be compiled in UCS2 mode or UCS4 mode, which differ in how non-BMP characters are represented. In the former they are represented by a (user-exposed) surrogate pair so that a single astral codepoint is encoded in a string of length 2. In the latter all characters are represented by a single 4 byte value and all characters have a string length of 1 e.g. len(u"\U0010FFFF") is 2 in UCS2 python and 1 in UCS4 python. Several functions don't work identically between the two variants e.g. ord() will only accept a single python character and ord(u"\U0010FFFF") will either succeed or fail depending on the compile time options. Similarly, regexps can't directly work with characters outside the BMP in UCS2 Python, and one must instead match the individual parts of the surrogate pair. This is obviously a terrible design mistake, but it's important to us because although most environments use the forgiving/sane UCS4 configuration, Heroku currently uses edge-case-happy UCS2 configuration. Therefore where we are dealing with non-BMP characters we must add in individual codepaths for each variant. There is probably a moral in this story about why "just make it an option" is generally a terrible idea. https://github.com/mozilla/treeherder/commit/46224c12b38873dca776ba06610208b6f611454a Merge pull request #1547 from mozilla/astral_hack_ucs2 Bug 1277300 - Fix astral plan character hack to work with UCS2 python