Closed Bug 776048 Opened 12 years ago Closed 9 years ago

Localized tag appears spontaneously, cannot be deleted

Categories

(developer.mozilla.org Graveyard :: Editing, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: teoli, Unassigned)

References

Details

(Keywords: regression, Whiteboard: [LOE:5])

Attachments

(2 files)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/17.0 Firefox/17.0
Build ID: 20120718030544

Steps to reproduce:

Go to https://developer-new.mozilla.org/en-US/docs/CSS/word-break
Edit it
Delete the tag "CSS Référence"
It disappears from the editing interface
"Save Changes"


Actual results:

Nothing, the "CSS Référence" tag is still there (even after refreshing the cache)


Expected results:

The tag should have been removed.

One specific thing: there also is a "CSS Reference" which is different and has to stay.

Priority: we can likely live with it for launch, but annoying to create correct tag structure.
I created a new page: https://developer-new.mozilla.org/en-US/docs/CSS/break-after

I added several tags, among them "CSS Reference", but not "CSS Référence".

At the end, both are in the list.


Maybe the tag system believes that both are the same (é = e) ?
Summary: KUMA: Cannot delete some tag → KUMA: Cannot delete some tag and spontaneous creation of it
The spontaneous creation of it is a regression: it wasn't the case on article edited earlier in the week.
Keywords: regression
Priority: -- → P1
I think this one has been fixed just before launch. Luke probably forget to R/F it.

Anyway, this is ok since more than one week.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Version: Kuma → unspecified
Component: Docs Platform → Editing
This just started to happen again.
https://developer.mozilla.org/en-US/docs/Web/API/Range.Range

Pretty sure it was ok last week-end, so something has regressed recently.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Oh the tag on that page is Reference (which make Référence appear and being impossible to remove)
I'm afraid the regression is older: July 25th or earlier.
I see the problem here: https://developer.mozilla.org/en-US/docs/Web/CSS/border-image-source (and last edit is in May :-( )

So either we forgot to fix it the last time, either the regression is that old.
Summary: KUMA: Cannot delete some tag and spontaneous creation of it → Some tags cannot be deleted
John: they are also spontaneously created. Add "Reference" and you'll get "Reference" and "Référence"...
Thanks for the clarification.
Summary: Some tags cannot be deleted → Localized tag appears spontaneously, cannot be deleted
I don't have a fix yet, but here's a potential lead:

mysql> select * from  wiki_documenttag where name = 'CSS Reference';
+-------+---------------+-----------------+
| id    | name          | slug            |
+-------+---------------+-----------------+
|     5 | CSS Reference | css-reference   |
| 15943 | CSS R�f�rence | css-reference_1 |
+-------+---------------+-----------------+
2 rows in set (0.00 sec)

Apparently, the database is configured such that accented characters like this can be fuzzily matched. More investigation needed.
Pinging :ubernostrum and :jezdez for some Django-expert second opinions. I think there are two things that could fix this issue:

1) We need to change to a more strict collation setting for the database, so something like this:

    alter table wiki_documenttag convert to character set utf8 collate utf8_bin;

2) But, that breaks the site unless we upgrade the MySQLdb python module to 1.2.4 (currently at 1.2.3c1)

Upgrading MySQLdb and then altering the DB collation are best done as IT tasks, assuming a MySQLdb upgrade doesn't break anything.
Flags: needinfo?(jezdez)
Flags: needinfo?(jbennett)
(In reply to Les Orchard [:lorchard] from comment #13)
> Pinging :ubernostrum and :jezdez for some Django-expert second opinions. I
> think there are two things that could fix this issue:
> 
> 1) We need to change to a more strict collation setting for the database, so
> something like this:
> 
>     alter table wiki_documenttag convert to character set utf8 collate
> utf8_bin;
> 
> 2) But, that breaks the site unless we upgrade the MySQLdb python module to
> 1.2.4 (currently at 1.2.3c1)
> 
> Upgrading MySQLdb and then altering the DB collation are best done as IT
> tasks, assuming a MySQLdb upgrade doesn't break anything.

Can you elaborate how this breaks the site? Looking at the Django docs (https://docs.djangoproject.com/en/1.5/ref/databases/#collation-settings), the switch in behavior for binary data was done in MySQLdb 1.2.2, also see https://github.com/farcepest/MySQLdb1/blob/master/HISTORY.
Flags: needinfo?(jezdez) → needinfo?(lorchard)
(In reply to Jannis Leidel [:jezdez] from comment #14)

> Can you elaborate how this breaks the site? Looking at the Django docs
> (https://docs.djangoproject.com/en/1.5/ref/databases/#collation-settings),
> the switch in behavior for binary data was done in MySQLdb 1.2.2, also see
> https://github.com/farcepest/MySQLdb1/blob/master/HISTORY.

Yeah, actually I was surprised because I thought I'd read that the binary fix was done in that version. That's part of why I ping'd you & james, because something seems odd here.

When I altered the table, admin and doc pages with tags threw unicode exceptions (eg. default ascii encoding yadda yadda). I should have copied the errors, since it's now fixed on my dev VM & will take some finagling to reproduce again.

After upgrading to MySQLdb 1.2.4, though, the errors went away and this bug was fixed.
Flags: needinfo?(lorchard)
Here we go... I downgraded to MySQLdb 1.2.3c1 and added a French language tag do a document. Not sure whether it's in the url() helper or the link text itself, but here's the error I got:

Traceback (most recent call last):
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 85, in run
    self.result = application(self.environ, self.start_response)
  File "/vagrant/vendor/src/django/django/contrib/staticfiles/handlers.py", line 67, in __call__
    return self.application(environ, start_response)
  File "/vagrant/vendor/src/django/django/core/handlers/wsgi.py", line 241, in __call__
    response = self.get_response(request)
  File "/vagrant/vendor/src/django/django/core/handlers/base.py", line 179, in get_response
    response = self.handle_uncaught_exception(request, resolver, sys.exc_info())
  File "/vagrant/vendor/src/django/django/core/handlers/base.py", line 111, in get_response
    response = callback(request, *callback_args, **callback_kwargs)
  File "/vagrant/vendor/src/django/django/views/decorators/csrf.py", line 77, in wrapped_view
    return view_func(*args, **kwargs)
  File "/vagrant/vendor/src/django/django/views/decorators/http.py", line 41, in inner
    return func(request, *args, **kwargs)
  File "/vagrant/apps/wiki/views.py", line 219, in _added_header
    response = func(request, *args, **kwargs)
  File "/vagrant/apps/authkeys/decorators.py", line 35, in process
    return func(request, *args, **kwargs)
  File "/vagrant/apps/wiki/views.py", line 175, in process
    return func(request, *args, **kwargs)
  File "/vagrant/vendor/src/django/django/views/decorators/http.py", line 147, in inner
    response = func(request, *args, **kwargs)
  File "/vagrant/vendor/src/django/django/db/transaction.py", line 224, in inner
    return func(*args, **kwargs)
  File "/vagrant/apps/wiki/views.py", line 571, in document
    response = render(request, 'wiki/document.html', data)
  File "/vagrant/vendor/src/django/django/shortcuts/__init__.py", line 44, in render
    return HttpResponse(loader.render_to_string(*args, **kwargs),
  File "/vagrant/vendor/src/django/django/template/loader.py", line 176, in render_to_string
    return t.render(context_instance)
  File "/vagrant/vendor/src/jingo/jingo/__init__.py", line 191, in render
    return super(Template, self).render(context_dict)
  File "/usr/local/lib/python2.7/dist-packages/jinja2/environment.py", line 891, in render
    return self.environment.handle_exception(exc_info, True)
  File "/vagrant/apps/wiki/templates/wiki/document.html", line 17, in top-level template code
    {% set help_link = url('wiki.translate', document_path=document.parent.full_path, locale=document.parent.locale)|urlparams(tolocale=request.locale) %}
  File "/vagrant/apps/wiki/templates/wiki/base.html", line 17, in top-level template code
    {% set scripts = ('wiki',) %}
  File "/vagrant/templates/base.html", line 153, in top-level template code
    {% block content %}{% endblock %}
  File "/vagrant/apps/wiki/templates/wiki/document.html", line 289, in block "content"
    <a class="text tagit-label" href="{{url('wiki.tag', tag.name)}}">{{ tag.name }}</a>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
This is going stale on the board. Any objections to me filing the IT bugs to get MySQLdb upgraded and to make this database change? Going once... going twice...
Flags: needinfo?(jbennett)
So, it looks like we won't get a MySQL module upgrade any time soon. Still, I'm confused - from everything I've read, 1.2.3c1 *should* already be fixed with regard to the database collation change. But, on my dev VM, the site breaks per comment 16.

Is there another way to fix this that doesn't involve upgrading the DB module?
OMG - does that mean we can finally fix this!?
:sheppy: I see no activity, what make you think that?
Flags: needinfo?(eshepherd)
(In reply to Jean-Yves Perrier [:teoli] from comment #21)
> :sheppy: I see no activity, what make you think that?

The blocker (bug 905362) has been fixed. I got email on that last night.
Flags: needinfo?(eshepherd)
Oh, this is one of the most annoying thing we have.

This week I had to clean 50 pages because a contributors created by mistake a tag "Sélécteurs" that propagated very quickly with editions of similar pages with the correct tag "Sélecteurs".

2 years of this bug means that we have more than 2200 articles before we have cleaned the mess:
https://developer.mozilla.org/en-US/docs/tag/R%C3%A9f%C3%A9rence
and more generally https://developer.mozilla.org/<locale_different_than_fr>/docs/tag/R%C3%A9f%C3%A9rence

Other occurrences are minor and we can deal with them manually once the core bug is fixed.

Feedbacking :hoosteno so that he can prioritize this quite high, including the cleaning of the articles.
Flags: needinfo?(hoosteeno)
(In reply to Jean-Yves Perrier [:teoli] from comment #23)

> Feedbacking :hoosteno so that he can prioritize this quite high, including
> the cleaning of the articles.

I think the main remaining step is to apply the ALTER TABLE statement in comment 13:

    alter table wiki_documenttag convert to character set utf8 collate utf8_bin;

But, as a dry run: update Kuma dependencies to match MySQLdb 1.2.5, grab a recent DB dump, try the ALTER TABLE in a dev VM, spot check & tests.

Then, a dev could make this happen for dev/stage/prod in a DB migration. But, I think it would be better to have someone from IT (and/or maybe sheeri) on deck to do the ALTER manually instead, and possibly help recover if something goes awry.

Then, for doc clean up, I suspect this could be a pretty quick SQL query: Just delete all the rows associating "CSS Référence" with documents
(In reply to Les Orchard [:lorchard] from comment #24)

> Then, for doc clean up, I suspect this could be a pretty quick SQL query:
> Just delete all the rows associating "CSS Référence" with documents

Just delete all the rows associating "CSS Référence" with *non-fr* documents ;-)
This will be considered as part of bug 1032321.
Flags: needinfo?(hoosteeno)
Priority: P1 → P2
Whiteboard: [LOE:?]
5 days minimum for the devops work mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=776048#c24 because these are very low level changes that could affect many things above.

As for automatically cleaning up ... we cannot automatically remove all translated tags from erroneous documents. :( Our only way to find translated tags is based on slug matching:

select slug from wiki_documenttag where slug in (select left(slug, length(slug)-2) as check_slug from wiki_documenttag);
(See attachment: translated_tags_slugs.txt)

select name, slug from wiki_documenttag where slug in ('xpcom', 'xpcom_1');
(See attachment: translated_tag_example.txt)

But even that's not 100% accurate, because there some tags with partial slug matches that *aren't* translations. E.g., '.userconfig' (slug userconfig) and 'userconfig' (slug userconfig_1).

So, I would add LOE:1 for partial automatic data cleanup. Surely if we spent a whole day with the SQL we could find a way to clean at least some stuff up.

But the first step is to convert wiki_documenttag character set to utf8. That will at least stop the problem from getting worse. And that's LOE:5.
Whiteboard: [LOE:?] → [LOE:5]
I'm not sure to understand what these requests are about…

What we need is to remove the duplicated tags that got inserted in the tag lists of articles, like Elément on non-fr pages and Element on en-US pages.
We can try to make a list of about 5-10 of these tags that will cover 90-95% of all the problems. The rest can be done manually. 

We don't want to remove tags from the DB, only their use on some articles where they got added by this bug.
Ah, yeah. If we have a list of specific translated tags, we can probably clean them out after the ALTER TABLE statement. That should let us query for the tags by translated *name* rather than slug. A list of 5-10 tags on this bug will be great.
Feedbacking myself to track that I have to build this list.
Flags: needinfo?(jypenator)
We may need to involve subject matter experts in the various languages. For example, the Japanese language pages seem to use English language tags but I'd first want to confirm that. Can we assume the French language tags that now appear on them were introduced by bug 776048? For example, https://developer.mozilla.org/ja/docs/Web/HTML/Element/frame has is tagged for "Élément","Element","Reference","Référence","Web","HTML". That page only has one revision. I assume it was the result of a translation effort and that person did not translate the tags but rather used the same English and French tags as in the source article.  At the time of the translation the English language version was tagged "HTML","Web","Reference","Element" while the French article had "HTML". Thus, it appears the translation was from the English edition and that "Élément" plus "Référence" were introduced by bug 776048.

This seems safe:
For US/English and Japanese language pages
    Report and review pages tagged only with "Élément" - were those by design or should we replace that with "Element"?
    If the page is tagged for both "Élément" and "Element" then delete "Élément".
    Do the same for "Référence" and "Reference".

For French language pages
    Report and review pages tagged only with "Element" - were those by design or should we replace that with "Élément"?
    If the page is tagged for both "Élément" and "Element" then delete "Element".
    Do the same for "Référence" and "Reference".

I assume the list we need to build is those tags where the non-binary comparison is the same, "Référence" and "Reference" for example, and where both tags appear on at least one article.
Note: This bug is part of the overall L10n UX project. [1] During #mdndev planning today, we decided we won't be able to fix this bug by the end of September unless we seriously disrupt other projects like GitHub, Welcome Emails, or Compat Data Tables.

So, unless there's a time-sensitive need to fix this by end of September, we won't get back to it until then.

[1] https://wiki.mozilla.org/MDN/Projects/Development/Localization_UX_Improvements
Removing the P2 after a conversation with :teoli -- this bug is still valid, but we're not going to push to fix it in 2014.
No longer blocks: mdn-l10n
Priority: P2 → --
(In reply to Marc Kupper from comment #33)
> For US/English and Japanese language pages
>     Report and review pages tagged only with "Élément" - were those by
> design or should we replace that with "Element"?
As Japanese doesn't have 'Élément' in its vocabulary, it could savely be replaced by 'Element' (while a Japanese translation would probably be better).

>     If the page is tagged for both "Élément" and "Element" then delete
> "Élément".
>     Do the same for "Référence" and "Reference".
This probably counts for all Non-French pages. At least I can say that German and Spanish don't have these words with accents.

> For French language pages
>     Report and review pages tagged only with "Element" - were those by
> design or should we replace that with "Élément"?
They should obviously be replaced by their French equivalent 'Élément'.
Btw. there is also another tag named 'Elément', which seems to be misspelled.

> I assume the list we need to build is those tags where the non-binary
> comparison is the same, "Référence" and "Reference" for example, and where
> both tags appear on at least one article.

@teoli, that sounds like a good starting point for your list.

Sebastian
See Also: → 671721
Using the steps to reproduce in comment 0, this bug is fixed:

https://developer.allizom.org/en-US/docs/Chrome_Registration$compare?locale=en-US&to=491360&from=477547

It was fixed as part of the Q1/Q2 MDN code cleanup project, and specifically I think by https://github.com/mozilla/kuma/pull/3073. 







his bug was fixed as part of the
Status: REOPENED → RESOLVED
Closed: 12 years ago9 years ago
Resolution: --- → FIXED
Oh yes, I confirm.

This is awesome. Really appreciated that it is fixed!
Flags: needinfo?(jypenator)
Yeah, we had to add a custom MySQL collation in bug 1161107 as the code was assuming case insensitive Unicode-respecting characters for tags -- which the previous MySQL collation utf8_general_ci didn't offer. While it would allow to use Unicode characters it failed when doing string comparisons between tags and would basically lead to uniqueness errors when handling tags like "Référence" and "Reference". MySQL just couldn't see a difference between those to (d'oh!). So please file this under "under-specified, ill-developed features" ;)

More info about the collation we used here: https://mariadb.com/blog/adding-case-insensitive-distinct-unicode-collation
I can also confirm its finally working! \o/

I remember a discussion earlier (with Luke, Jean-Yves and others I guess?) whether the localized tags could be removed automatically. Is there already a bug filed for that?

Sebastian
Flags: needinfo?(jypenator)
We have the ability to delete (or rename) a tag globally (any admin can do it). Just ping me.

Rename works, but no merge possible.
Flags: needinfo?(jypenator)
See Also: → 1159930
(In reply to Justin Crawford [:hoosteeno] [:jcrawford] from comment #37)
> Using the steps to reproduce in comment 0, this bug is fixed:
> 
> https://developer.allizom.org/en-US/docs/
> Chrome_Registration$compare?locale=en-US&to=491360&from=477547

ZOMG this is fabulous news! Thanks, guys!
See Also: → 1271552
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.