Tags Unicode Problem

RESOLVED FIXED in 4.x (triaged)

Status

P2
normal
RESOLVED FIXED
9 years ago
3 years ago

People

(Reporter: barisderin, Assigned: davedash)

Tracking

unspecified
4.x (triaged)
x86
Windows XP

Details

(Whiteboard: [z])

(Reporter)

Description

9 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.4) Gecko/20091016 Firefox/3.5.4
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.4) Gecko/20091016 Firefox/3.5.4

Tag strings on is shown as in UTF8 encoding. But they need to be converted and shown as Unicode.

Reproducible: Always

Steps to Reproduce:
1. Visit https://addons.mozilla.org/en-US/firefox/addon/11448/
2. At the right of the page below Tags section you will notice the UTF8 characters in tags.
3. Instead they need to be shown as Unicode.
Actual Results:  
Tags contains UTF8 encoded characters.

Expected Results:  
Tags needs to be shown in Unicode.
My guess is those strings are double encoded somewhere along the line, but I don't think it's a bug on AMO.  There are plenty examples of UTF-8 strings working properly:

https://addons.mozilla.org/en-US/firefox/tag/搜索
https://addons.mozilla.org/en-US/firefox/tag/शब्दकोश
https://addons.mozilla.org/en-US/firefox/tag/فارسي

Are these tags that you added?  If so, can you tell me what you were trying to add?
(Reporter)

Comment 2

9 years ago
Those tags were working quite well almost one week ago. They were in Unicode. Second and Third tags were :

# Ekşi
# ekşi sözlük

but suddenly they converted into UTF8 encoded style as below:

# EkÅŸi
# ekşi sözlük 

. So it seems that the problem is specific but something caused on Server or Database that Unicode to Encoded change. Just wanted to report.
davedash: fall out from the db conversion?
Interesting... I'll dig into this... the mysql changes we made would be a likely candidate for this not working.
Wil - So I ran,

UPDATE tags  SET tag_text = CONVERT(CONVERT(CONVERT(tag_text USING latin1) using binary) using utf8) WHERE LENGTH(tag_text) > CHAR_LENGTH(tag_text) AND id = 969;


ERROR 1062 (23000): Duplicate entry 'ekşi sözlük' for key 2

Which means we already have a tag where the encoding is correct.

We've got 3 options:

* Remove all these from the tags table (and the user_addons_tags)
* Fix them case by case as they come up, like in this case.
* Write a script to attempt to fix them, but I think that might be time consuming

I'm leaning toward the second option, as this is something that anybody can quickly do if they see the tags on their addon - or if they tagged an addon, they can just remove it and retype the tag again.

Baris, feel free to do that now, and we can see if it's worth it for the other tags.
Status: UNCONFIRMED → NEW
Ever confirmed: true
(Reporter)

Comment 6

9 years ago
Well, I removed them and added :

# ekşi sözlük 
# ekşi

tags but they converted to 

# Eksi Sozluk
# Eksi

. Has UTF8 support been removed from tags?
UTF8 support is there.  My suspicion is that it's matching an existing tag "eksi sozluk" in the database.

Our tagging system is relatively primative so here's what happens:

SOmeone somewhere tags an addon as:

YAHoö

in the future... anybody who tags something as:

yahoo
Yahoo
yAhoo

will get their tag written as "YAHoö".

Which is a actually quite frustrating... since tag casing, especially when it comes to umlauts and other diactrics are important to some.

Unfortunately that's a big task and involves some rearchitecting of the tags system... I'll file a bug regarding that.
This issue is tracked in bug  525271
Er, that shouldn't happen.  That's a mysql problem.

SELECT * FROM tags WHERE tag_text = 'YAHoö'

is returning the row with 'yahoo' in it.
That's expected:

http://dev.mysql.com/doc/refman/5.0/en/case-sensitivity.html

You'd need to define the table differently for mysql to treat casing differently.
-> davedash for ideas.  It'd be nice to figure out an answer in 5.5/5.6 at least, even if we don't get it fixed in that timeframe.
Assignee: nobody → dd
Priority: -- → P3
Target Milestone: --- → 5.5
We need to come up with a plan for this
Target Milestone: 5.5 → 5.6
Alright, the plan is to create a second column in the db to hold clean values (after any substitution or normalization, similar to the translations table).

In addition (and more to the heart of this problem) we'll need to ALTER the mysql table to be binary so it doesn't do ridiculous things like comment 9.
Priority: P3 → P2
Whiteboard: [z]
Target Milestone: 5.6 → 4.x (triaged)
I think we fixed this elsewhere:

mysql> SELECT * FROM tags WHERE tag_text = 'yahoo';
+-------+----------+-------------+---------------------+---------------------+------------+
| id    | tag_text | blacklisted | created             | modified            | restricted |
+-------+----------+-------------+---------------------+---------------------+------------+
| 25318 | yahoo    |           0 | 2010-09-21 06:36:30 | 2010-09-21 06:36:30 |          0 |
+-------+----------+-------------+---------------------+---------------------+------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM tags WHERE tag_text = 'YAHoö';
Empty set (0.00 sec)
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.