Textual diffs containing non-UTF-8 text cannot be represented on Phabricator
Categories
(Conduit :: General, defect)
Tracking
(Not tracked)
People
(Reporter: hsivonen, Unassigned)
References
Details
It appears that Lando silently replaces bytes that are invalid UTF-8 with the REPLACEMENT CHARACTER. This is bad when landing test cases that try to test non-UTF-8 encodings.
Here's a changeset landed with Lando:
https://hg.mozilla.org/integration/autoland/rev/f593045cc48f
Here's the same changeset without data loss landed by pushing to inbound directly:
https://hg.mozilla.org/integration/mozilla-inbound/rev/76dbcd8529c4
(Use e.g. curl
to pipe to less
to see the difference. The two pages look the same in Firefox.)
Comment 1•5 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #0)
It appears that Lando silently replaces bytes that are invalid UTF-8 with the REPLACEMENT CHARACTER.
This isn't Lando / transplant's doing, the problem is with Phabricator and the tool used to upload the diff (in this case, phlay).
From https://secure.phabricator.com/book/phabricator/article/utf8/:
Phabricator stores all internal text data as UTF-8, processes all text data as UTF-8, outputs in UTF-8, and expects all inputs to be UTF-8. Principally, this means that you should write your source code in UTF-8. In most cases this does not require you to change anything, because ASCII text is a subset of UTF-8.
I believe we might be able to work around this by treating files that aren't valid UTF-8 as binary when uploading. The reviewing experience will be poor, requiring the reviewer to inspect the file outside the browser, but I think it should make it through Lando / Transplant properly.
Comment 2•5 years ago
|
||
I vaguely remember this coming up in the past. :zalun, do you know if moz-phab is doing the correct thing here?
Updated•5 years ago
|
Comment 3•5 years ago
|
||
I think this should be fixed for the beta version of phabsend-moz now, thanks for the reminder smacleod.
(@ probably fails, but differently to phlay, since it's still using the same createrawdiff
call as upstream phabsend, which looks okay in the Phab UI but based on that automated failure probably can't be applied by lando)
I vaguely remember this coming up in the past. :zalun, do you know if moz-phab is doing the correct thing here?
moz-phab and arc will mark non-utf8 files as binary (arc with confirmation, moz-phab automatically):
Invalid Content Encoding (Non-UTF8)
This diff includes files which are not valid UTF-8 (they contain invalid byte
sequences). You can either stop this workflow and fix these files, or
continue. If you continue, these files will be marked as binary.
You can learn more about how Phabricator handles character encodings (and how
to configure encoding settings and detect and correct encoding problems) by
reading 'User Guide: UTF-8 and Character Encoding' in the Phabricator
documentation.
AFFECTED FILES
modules/freetype2/builds/amiga/README
modules/freetype2/builds/amiga/include/config/ftconfig.h
modules/freetype2/builds/amiga/include/config/ftmodule.h
modules/freetype2/builds/amiga/makefile
modules/freetype2/builds/amiga/makefile.os4
modules/freetype2/builds/amiga/smakefile
modules/freetype2/builds/amiga/src/base/ftdebug.c
modules/freetype2/builds/amiga/src/base/ftsystem.c
Do you want to mark these files as binary and continue? Yes
Uploaded binary data for "README".
Uploaded binary data for "README".
Uploaded binary data for "ftconfig.h".
Uploaded binary data for "ftconfig.h".
Uploaded binary data for "ftmodule.h".
Uploaded binary data for "ftmodule.h".
Uploaded binary data for "makefile".
Uploaded binary data for "makefile".
Uploaded binary data for "makefile.os4".
Uploaded binary data for "makefile.os4".
Uploaded binary data for "smakefile".
Uploaded binary data for "smakefile".
Uploaded binary data for "ftdebug.c".
Uploaded binary data for "ftdebug.c".
Uploaded binary data for "ftsystem.c".
Uploaded binary data for "ftsystem.c".
Upload complete.
the files in the above case are latin1 in-tree.
steven - i think this is a WONTFIX: phabricator is UTF8 only, and the official clients do the right thing.
Comment 5•5 years ago
|
||
(In reply to Byron Jones ‹:glob› 🎈 from comment #4)
steven - i think this is a WONTFIX: phabricator is UTF8 only, and the official clients do the right thing.
Yup.
Description
•