Closed Bug 1279982 Opened 8 years ago Closed 8 years ago

pages are sometimes interpreted as windows-1252 instead of UTF-8 until a reload is done

Categories

(Core :: DOM: Core & HTML, defect)

45 Branch
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: vincent-moz, Unassigned)

Details

Attachments

(4 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0
Build ID: 20160607223741

Steps to reproduce:

1.  Start firefox -safe-mode -no-remote
2. Create a fresh profile
3. Open https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=827249
4. From it, open https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=807528


Actual results:

For both pages, the text encoding is windows-1252, so that accented characters are incorrect.


Expected results:

The correct encoding is UTF-8, which is obtained when I do Ctrl-Shift-R to force a reload.

This may be a cache issue, because I did the following with my main Firefox profile:

When I reloaded

  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=807528

directly, it was in UTF-8 (confirmed with Live HTTP Headers). Then I opened this URL via the link on:

  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=790825#84

and Live HTTP Headers showed nothing (so I assume that it came from the cache), but the accented characters are now incorrect and "View Page Info" says windows-1252 for the text encoding.
I can't reproduce this problem with the current Nightly: 50.0a1 (2016-06-13).
(In reply to Vincent Lefevre from comment #1)
> I can't reproduce this problem with the current Nightly: 50.0a1 (2016-06-13).

Actually I could reproduce it with Nightly.
When I load the page initially (or via shift+reload) it's served as UTF-8 by the server, but when I reload the server instead claims it's "ISO-8859-1"
More precisely, by using Web Developer → Network:

1. I open https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=827249 and get a page with incorrect accented characters and in the response headers:

    Content-Type: "text/html; charset=ISO-8859-1"

2. I do Ctrl-Shift-R to force a reload, and the page is now correct. In the response headers:

    Content-Type: "text/html; charset=utf-8"

3. I open https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=827249 again and I get the page from the cache, which is correct (contrary to what happened with Firefox 45.2.0).

So, there are two problems:

1. For the initial URL open, I get charset=ISO-8859-1, which is incorrect. This is specific to Firefox: no such problem with wget, lynx, w3m and Opera. This happens with both Firefox 45.2.0 and Nightly. This problem disappears with Ctrl-Shift-R.

2. When the page is obtained from the cache after a Ctrl-Shift-R, the charset is incorrect with Firefox 45.2.0, but I couldn't reproduce this problem with Nightly.
(In reply to Alex from comment #3)
> When I load the page initially (or via shift+reload) it's served as UTF-8 by
> the server, but when I reload the server instead claims it's "ISO-8859-1"

Yes, I confirm that a simple Ctrl-R in Firefox gives ISO-8859-1 (but with lynx, a Ctrl-R still gives UTF-8).
I now see the same problem with w3m, where restarting w3m with the same URL. So, problem (1) is a server issue, while problem (2) seems to be an issue with Firefox 45.2.0.
Chrome, Safari and Firefox all behave the same for me (And none of them seem to use the cache, the server reports 200 with the same etag instead of a 304)

Does lynx do re-validation? That's when the server returns the incorrect headers for me.
(In reply to Alex from comment #7)
> Chrome, Safari and Firefox all behave the same for me (And none of them seem
> to use the cache, the server reports 200 with the same etag instead of a 304)

Firefox uses the cache when one opens the URL very shortly after a (forced) reload.

> Does lynx do re-validation? That's when the server returns the incorrect
> headers for me.

Like w3m, same problem with lynx when I restart it on the same URL. After a Ctrl-R, I get the correct charset. I don't know what Ctrl-R does exactly; it is just documented as "Reload current file and refresh the screen". Perhaps it's like Ctrl-Shift-R in Firefox, which would explain the behavior.
Component: Untriaged → DOM
Product: Firefox → Core
It seems that I can no longer reproduce problem (2).
Thanks for the report and details. If you can reproduce, please re-open this bug.
Status: UNCONFIRMED → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Component: DOM → DOM: Core & HTML

Looks like this problem has reappeared in 100.0.1 mint001 on Linux Mint, 64 Bit. Also seen on Windows 7, 64 Bit.

Visit http://biblio.aktionsgruppe.de/obiblio/opac/index.php and search for Marek. The site is an instance of Obiblio library software.

The pages that it generates are clearly marked <META http-equiv="content-type" content="text/html; charset=iso-8859-1"> which Firefox also reports when accessing Ctrl-I. There Firefox states

content-type text/html; charset=iso-8859-1
but above that, it reads

Text encoding Windows1252
and all accented characters are wrong. They are correct in the database (checked with phpMyAdmin), the web page is correctly marked, but still...

Possibly linked to this behaviour is some odd problem I see on various systems, Obiblio one of them, but also instances of Tiki Wiki, since a couple of weeks:

On sites that worked flawlessly for years, which did NOT get a software update of the PHP software or a change of collation in the database.

When a user fills in a text entry field that the PHP program will use to perform a search: Is there a collation information sent from Firefox to the website?

When you enter text to be searched, the (PHP) programs now throw an error "illegal mix of collations". Looks like in text entry fields, firefox does not pass the pages own encoding as collation (which would be latin1_german2_ci for iso8859), but utf8_general_ci, if accented characters are present in the user input... It does not happen when there are no accented characters in the user input...

Thanks
hman

The HTTP response headers contain:

Content-Type: text/html; charset=OBIB_CHARSET

OBIB_CHARSET is not a valid charset. So the issue would be the configuration of the web server, not Firefox.

In HTML the meta statement is <META http-equiv="content-type" content="text/html; charset=iso-8859-1">

And Firefox detects ISO 8859-1, but does not use it, see screenshot. Hm, cannot attach a screenshot here?

Codepage detection info page.

Uh, yes, despite the META, network analysis showed indeed Content-Type: text/html; charset=OBIB_CHARSET. Thanks for pointing me to that.

And I got lucky. I did not write Open Biblio software, but this was the result of a bug that was quickly found and corrected, thank you.

Now the correct Content-Type text/html; charset=iso-8859-1 is written in the HTTP request header.

But - Firefox still doesn't do it right. Firefox still correctly recognizes iso-8859-1, but still uses windows-1252 and produces dysfunctional diacritics...

Concerning windows-1252 while the page is declared as iso-8859-1 is related to this bug. This is bug 897302 / bug 890478.

I meant is unrelated to this bug.

That bug was closed as invalid. Declaring windows1252 can never be correct on a Linux machine, because that code page does not exist. Also, the actual text rendering is obviously wrong and NOT iso8859-1, just look at it.

Table is set to biblio_status_dm InnoDB 10 Dynamic 9 1820 16384 0 0 0 NULL 2022-05-29 11:57:30 NULL NULL latin1_german2_ci NULL row_format=DYNAMIC
latin1_german2_ci = iso-8859-1 character set. And this is the rendered result:

Render result.

Text content in the table .

(In reply to Oliver Kluge from comment #22)

Bildschirmfoto vom 2022-06-11 02-22-06.png

This looks incorrect because the characters are encoded in UTF-8 while the server declares iso-8859-1. You need to fix the configuration of the server so that it declares utf-8 (at the same time, this will avoid the windows-1252 nonsense).

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: