Closed Bug 527140 Opened 15 years ago Closed 14 years ago

Special characters are converted into HTML entities in the translator forms

Categories

(support.mozilla.org :: Localization, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE
Future

People

(Reporter: thomas.lendo, Assigned: jsocol)

References

()

Details

Attachments

(1 file, 6 obsolete files)

If you re-translate an already translated string that contains special characters like Umlauts then SUMO changes the special characters to HTML entities. E.g. "ö" turns into "ö".

I tested this issue only with German Umlauts but maybe this is a more general problem.

Steps to reproduce:
1) Translate a string with the online translator. The translation should contain a German Umlaut. (e.g. "2) Get Personal Help" -> "2) Holen Sie sich persönliche Hilfe" at https://support.mozilla.com/de/kb/)
2) Submit the translation. All's well.
3) Reopen the same string with the online translator. You will see that Umlaut is shown as HTML entity. If you save this then the entity will be shown on the page instead of the correct character.
Target Milestone: --- → 1.6
Please fix this bug asap. I have to correct broken strings of the SUMO interface with ü's and ä's every day.
Nominating for 1.5.1 per comment 1.
Target Milestone: 1.6 → 1.5.1
Attached patch escape only single quotes (obsolete) — Splinter Review
This should be okay security-wise, I think.
Assignee: nobody → paulc
Attachment #418302 - Flags: review?(james)
Comment on attachment 418302 [details] [diff] [review]
escape only single quotes

The 'quotes' option does not HTML encode single quotes, it adds a slash to them.

I'm wondering if, rather than only escaping single quotes, setting the fourth argument (double-encode?) of htmlentities() and htmlspecialchars() to false is safe. I didn't see any examples where it wasn't but I'm not 100% on this. The only way I think it could be a problem is if there was actually an entity in the translation. Does that occur?
Attachment #418302 - Flags: review?(james) → review-
By "an entity in the translation" I mean, for example, typing the following into the interactive translator:

"This & That"

rather than typing

"This & That"
Attached patch double-encode=false (obsolete) — Splinter Review
Here's my example implementation of this with double-encode set to false.

My concern isn't really the translations, but other places where this filter is used. The code is so schizophrenic with it's escaping that I'm wary of making a change like this.

Maybe this could be implemented as a new mode for the {$var|escape} modifier?
(In reply to comment #6)
> Maybe this could be implemented as a new mode for the {$var|escape} modifier?
Yeah I was thinking that too. It doesn't seem like the risk of modifying something used site-wide is worth the gain for this instance alone.
Attachment #418302 - Attachment is obsolete: true
Attachment #419952 - Flags: review?(james)
Attachment #418448 - Attachment is obsolete: true
Comment on attachment 419952 [details] [diff] [review]
adds new escape modifier which avoids double-encoding

WFM, thx.
Attachment #419952 - Flags: review?(james) → review+
r59233
To test this:
1. Your user language (in "My account") must be set to something other than english (e.g. German)
2. Toggle interface translation on.
3. Translate a string with special characters (e.g. for German, the sidebar menu on the homepage -- see comment 0)
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Summary: Special characters are converted into HTML entities → Special characters are converted into HTML entities in the interactive translator
I translated a string into fake French, including an ampersand, then opened the translator again. The special characters are typically just UTF-8 characters.
Thomas, would you be interested in helping us test this out on http://support-stage.mozilla.org/de/kb/?
Hi Stephen. I've translated something at the start page of the staging site and in the language table. It seems to work fine.

But if you search for a string in the language table, the string in the search field will be false encoded after search:

1) Go to https://support-stage.mozilla.org/tiki-edit_languages.php?locale=en-US
2) Select "de" (German) as language to edit
3) Click "Edit language table"
3) Search for "Menüpfad" in the search field under "Edit translations:"
4) The search result list shows right encoding of the string, but the string in the search field is false encoded - see my attached screenshot
Attached image Remaining problem (obsolete) —
Thomas, if you replace the ü with a regular, unicode "ü" and save it, does the problem go away on subsequent edits?
James, tthe translation itself is properly encoded. The problem exists only in the search text field above the translation table.

If I click "translate", the translation itself will be saved and displayed properly. But the search text field entry will turn from "Menüpfad" to "Menüpfad".
Paul, would you mind patching up the tiki-edit_languages.php template to fix this? Ideally if I put in Unicode characters, it should just output Unicode.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: Special characters are converted into HTML entities in the interactive translator → Special characters are converted into HTML entities in the translator forms
Attached patch fix search field double-encoding (obsolete) — Splinter Review
This should do it.
James: should we flag this as tiki_bug for upstreaming the smarty filter?
Attachment #421319 - Flags: review?(james)
Looking at this a little more closely, I think our approach was flawed in the first place.

For some reason, the filter defaults to ISO-8859-1 (webroot/lib/smarty/libs/plugins/modifier.escape.php:22). Rather than adding a new type to the filter, I think we should consider changing the filter to default to UTF-8.

I've also noticed that webroot/tiki-interactive_trans.php uses htmlentities() on new translations, instead of htmlspecialchars(). So we can type ü into the interactive translator, but it gets saved as 'ü'. The template escaping is then (mostly) working as intended (except for the character set thing.

The search field double-encoding is also a case of htmlentities() where htmlspecialchars() should probably be used. (tiki-edit_languages.php:199)

Laura, would that switch from htmlentities() to htmlspecialchars() cause any problems you can think of? It seems to me we should be encoding only htmlspecialchars(), and only on the way out (ie, not saving encoded entities in the database). Does that seem right?

If so, I can file a new patch to back out the first fix, and make the correct changes here.
Comment on attachment 421319 [details] [diff] [review]
fix search field double-encoding

Per comment 19.
Attachment #421319 - Flags: review?(james) → review-
It should definitely be encoded on the way out, and I can't see any reason not to use htmlspecialchars(), probably with doubleencode turned off.
Attached patch revert attachment 419952 (obsolete) — Splinter Review
This reverts the previous fix. Then we can apply James' initial patch, which fixes both problems (I will un-obsolete it).
Attachment #421510 - Flags: review?(james)
Attachment #418448 - Attachment is obsolete: false
Attachment #421510 - Flags: review?(james) → review+
r59810 (reverts)
This patch is not a 100% solution--that's very difficult considering the circumstances--some HTML in interactive translator blocks, already-escaped strings in the database, etc--but this will help.

This just replaces htmlentities in three places with htmlspecialchars. Things like <>& that were already being encoded will still be encoded, but things like ü will not be.

When editing a translation via the interactive translator, the string will be encoded: that's how it's stored in the database. However, when you change the entities back to characters--which you have to do now, anyway--they will stay as characters in the database this time.

Not perfect, but it works for now. We can revisit this problem when we look at the l10n workflow this year.
Assignee: paulc → james
Attachment #418448 - Attachment is obsolete: true
Attachment #419952 - Attachment is obsolete: true
Attachment #420892 - Attachment is obsolete: true
Attachment #421319 - Attachment is obsolete: true
Attachment #421510 - Attachment is obsolete: true
Attachment #421976 - Flags: review?(paulc)
Comment on attachment 421976 [details] [diff] [review]
replace htmlentities with htmlspecialchars

The first part of this bug worked for me with this patch. I.e. using the interface translator, strings now display on edit just as they do as they are saved. This is great.

However, the second part didn't work. Searching for "Menüpfad" in tiki-edit_languages.php does not return any results -- because the translation is stored already-escaped in the database. However, even when I tried searching for "Men&uuml;pfad", it was double escaped into "Men&amp;uuml;pfad" and, again, no results were returned.

To fix this, maybe disable double-encoding.

Bottom line however, I don't like the fact that translators have to search for the escaped version of the string with this patch. We could fix this by running a script to re-encode all of the translated strings (~125K rows in tiki_language). What do you think, James?
Attachment #421976 - Flags: review?(paulc) → review-
Target Milestone: 1.5.1 → Future
Status: REOPENED → RESOLVED
Closed: 15 years ago14 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: