Closed Bug 1625258 Opened 5 years ago Closed 4 years ago

BOMless UTF-16LE not autodetected if the first 1024 bytes contain non-Latin1 characters

Categories

(Core :: DOM: HTML Parser, defect, P3)

75 Branch
defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox74 --- affected
firefox75 --- affected
firefox76 --- affected

People

(Reporter: didec3662, Unassigned, NeedInfo)

References

Details

Attachments

(2 files)

Attached file file.zip

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0

Actual results:

Exported HTML file shows like a sample text (ignore html syntax). Google Chrome the file opens like Firefox (incorrect). But freaking Microsoft Edge the file shows correct. In attachment is ZIP file with a part of the file and PNGs with previews of different browsers.

Hi,

Thanks for the details. I was able to reproduce on windows 10 pro, on the following versions

Release 74.0 (64-bit)
Beta 75.0b11 (64-bit)
Firefox Nightly 76.0a1 (2020-03-31) (64-bit)

I will move this over to a component so developers can take a look over it. If is not the correct component please feel free to change it to an appropriate one.

Thanks for the report.

Best regards, Clara.

Component: Untriaged → DOM: Core & HTML
Product: Firefox → Core
Status: UNCONFIRMED → NEW
Ever confirmed: true

Removing some special from the original test htm file.

(In reply to Alphan Chen [:alchen] from comment #2)

Created attachment 9138557 [details]
PartOfTheExportedFileNew.htm

Removing some special from the original test htm file.

I think the problem is related to encoding.
If I remove some special characters(attachment 9138557 [details]), it can be viewed normally.
e.g. ř, ž ů, ě

Hi Henri, could you leave some comments on this?

Flags: needinfo?(hsivonen)
Priority: -- → P3

The file is a BOMless UTF-16LE document. We detect this case only if the code points is the first 1024 bytes are all below U+0100. (The function name suggests under U+0080, but it looks the name doesn't match what the function does.) The detection was added in bug 631751.

Detecting BOMless UTF-16[LE|BE] in the general case is problematic. See https://en.wikipedia.org/wiki/Bush_hid_the_facts

Detecting it assuming the content has HTML tags is less problematic, but I'm still inclined to treat this as WONTFIX unless there's a very good reason to do otherwise.

Reporter, where did the file come from?

Flags: needinfo?(hsivonen) → needinfo?(didec3662)
Summary: Incorrect view of the exported HTML file → BOMless UTF-16LE not autodetected if the first 1024 bytes contain non-Latin1 characters
See Also: → 631751
Component: DOM: Core & HTML → DOM: HTML Parser

The component has been changed since the backlog priority was decided, so we're resetting it.
For more information, please visit auto_nag documentation.

Priority: P3 → --

Setting back to P3 for now, although this is most likely WONTFIX.

Priority: -- → P3

Because this bug's Severity has not been changed from the default since it was filed, and it's Priority is P3 (Backlog,) indicating it has been triaged, the bug's Severity is being updated to S3 (normal.)

Severity: normal → S3

But freaking Microsoft Edge the file shows correct.

New Edge shows it like Chrome.

See also bug 1727491.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: