Closed Bug 1206820 Opened 9 years ago Closed 9 years ago

Tags are duplicated when other case variants exist

Categories

(developer.mozilla.org Graveyard :: General, defect)

All
Other
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sphinx_knight, Assigned: robhudson)

References

Details

(Keywords: in-triage, Whiteboard: [specification][type:bug])

Attachments

(2 files)

What did you do? ================ 1. Went to https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Instructions/import 2. I updated the page's content only (no touch on tags) 3. I saved the page What happened? ============== New tags appeared which are case-variants of existing tags. In other words, tags are duplicating themselves if a case-variant exists. What should have happened? ========================== The tags should have been left untouched since no edit occured here. Tags should be left case sensitive or else non-proper tags (like "Javascript" without capital S) will "poison" some sections. Is there anything else we should know? ====================================== This happened at least with tags like - JavaScript (which duplicated into "Javascript" and "javascript") - Guide (which duplicated into "guide") This might somehow be related to https://bugzilla.mozilla.org/show_bug.cgi?id=776048
Summary: Tags are duplicated when other case insensitive variants exist → Tags are duplicated when other case variants exist
"Howto" generates "HowTo" and "HOWTO". If you delete them, they come back. Zombie tags! Ayeee!
This may be a regression of when the diacritics problem has been fixed. Tags weren't case sensitive at some point in the past.
Severity: normal → major
Keywords: in-triage
Watching the SQL as I edit a page with tags I'm seeing why this is happening. The fix, while straightforward in the SQL, isn't so straightforward in the code. When a revision is saved it checks the tags for changes and if there are changes it deletes all tags and re-adds them. This seems like a simple way to ensure the current list of tags provided in the form matches what's in the database versus figuring out what was added or removed and updating the tags in a more precise way. The problem is it queries the tag table to get the tag IDs using `IN`. For example: SELECT `wiki_documenttag`.`id`, `wiki_documenttag`.`name`, `wiki_documenttag`.`slug` FROM `wiki_documenttag` WHERE `wiki_documenttag`.`name` IN ('Web', 'JavaScript', 'Reference'); Because of our database settings string comparisons are case insensitive. The above query finds a tag that matches both 'JavaScript' and 'Javascript': +--------+-------------+---------------+ | id | name | slug | +--------+-------------+---------------+ | 42 | Reference | reference | | 279 | JavaScript | javascript_2 | | 2964 | Web | web_2 | | 636703 | Javascript | javascript_10 | +--------+-------------+---------------+ Based on this list it then adds the tag ID to the document (by creating a tag<->document record in a related table). There are probably a few ways to fix this... One would be to use BINARY string comparisons so they are case sensitive. But this is trickier than it should be because the source of this query is in a 3rd party library and using the Django ORM where it isn't simple to change the underlying SQL. Another might be to simply delete the "Javascript" tag so the above query doesn't find it using case insensitive string comparisons. Note: This is deleting the tag itself, not the tag<->revision record, which would just re-add it b/c it does a full deletion and re-adding based on the provided tags in the form. If we do this we need to be careful to clean tag<->revision records that use the tag being deleted. I'm curious how the "Javascript" tag got created in the first place if we intend to always use "JavaScript" instead. I'd have to dig further to see how new tags are created and why this was allowed to happen.
We automatically add tags to the table when users add them to pages. So unless/until we use case sensitive string comparisons, the app will re-create the dupe tags into the wiki_documenttag table. I just verified on stage: mysql> select * from wiki_documenttag where name in ('JavaScript'); +--------+------------+---------------+ | id | name | slug | +--------+------------+---------------+ | 279 | JavaScript | javascript_2 | | 636457 | javascript | javascript_10 | +--------+------------+---------------+ 2 rows in set (0.02 sec) mysql> delete from wiki_documenttag where id = 636457; Query OK, 1 row affected (0.01 sec) mysql> select * from wiki_documenttag where name in ('JavaScript'); +-----+------------+--------------+ | id | name | slug | +-----+------------+--------------+ | 279 | JavaScript | javascript_2 | +-----+------------+--------------+ 1 row in set (0.00 sec) ======= <Added 'javascript' tag to https://developer.allizom.org/en-US/docs/Web/HTTP/Access_control_CORS via the site UI.> ======= mysql> select * from wiki_documenttag where name in ('JavaScript'); +--------+------------+---------------+ | id | name | slug | +--------+------------+---------------+ | 279 | JavaScript | javascript_2 | | 636459 | javascript | javascript_10 | +--------+------------+---------------+ 2 rows in set (0.01 sec) Could we put an extra tag-checking query in our view code before the 3rd-party library gets to it?
Flags: needinfo?(robhudson)
(In reply to Luke Crouch [:groovecoder] from comment #6) > Could we put an extra tag-checking query in our view code before the > 3rd-party library gets to it? I dug around some more... Taggit has an option settings.TAGGIT_CASE_INSENSITIVE which will iterate over all tags and do an "iexact" query. I don't think this is what we want. We want case-sensitive tags. Regardless I tried it because I don't think there's a way to do case-sensitive "IN" queries. It ended up resulting in a 500 error b/c it performed a LIKE query on each tag, and "JavaScript" results in ["JavaScript", "Javascript"] being returned to the `get()` query. There are many bugs in both Django and django-taggit about collation and broken MySQL case-sensitive queries. The only possible solution I can think of now is to write our own tag manager class and use custom SQL to do what we need here. This sounds fragile to me, however.
Flags: needinfo?(robhudson)
After a conversation on IRC with :jezdez and :groovecoder it was decided that the only fix here is considered a "hack" and not something we want. We feel that the tagging system should be rewritten to support the wanted features (unicode-insensitive, case-sensitive tagging, preferably with localized tags). Our options are either: 1. Wait for the Postgresql transition, which at least solves the unicode-insensitive, case-sensitive problems, but not the localized tags. 2. Rewrite the tagging system or find an alternative solution that fits our needs. Julien: How urgent would you consider this bug? Is it something we can work around for awhile until one of those two options can be worked on?
Flags: needinfo?(sphinx_knight)
Thanks for the feedback. Actually, if I recall correctly (and I am no admin) there is a tool to "nuke" a given tag. If this tool exists, then "we" could nuke the most frequent duplicated tags now and it would work for some time but eventually the unwanted tags will re-appear. I don't know if some really rough dashboard of tags exists in the admin panel but it could be handy to have one in the meantime to be able to detect the unwanted tags and nuke them before proliferation. So that would be my workaround before having a more robust solution. (on a personal note, I'd love to have localized tags;)). I'm ni-ing Jean-Yves as he might have more insight than me on this nuking admin panel.
Flags: needinfo?(sphinx_knight) → needinfo?(jypenator)
This is a difficult call. Our tag management system (from an editor perspective) has always being weak. So let's write down some points, to state thing clearly: * Tags are used by search and sidebars to provide correct information and are therefore important. * Ghost tags are (very) annoying editors (they look like an error, we regularly got either bug reported or comments in the Revision Dashboard like "Trying to remove bogus tag"), but doesn't affect search/sidebars. Manually nuking duplicate ghost tags is really an extreme case: I don't really like this, but if there is a (concrete) plan to have the tag system revamped once for all, I could be ok to survive with it for a couple months (but not years). Nuking duplicate tags means going through 69 (!) pages of admin panel manually and selecting the "bad cases", then deleting them. I count 30'- 1h each time, and they reappear quite quickly. [This is not a very good use of my time, and I will not be doing it indefinitively] Our tagging system has quite a few limitations (l10n mostly) that could be worth a complete revamp. I think we don't have enough information to make a call. I would like to know at least 1) What is the schedule of the postgressql migration. 2) What would be the scope and the schedule of a complete revamp. Globally, and tentatively, I would be ready to go with a (somewhat longer) solution that fix tags once for all, as long as we have a concrete committed plan (I'm not in to manually delete tags as a workaround for a regression for an indefinite amount of time). I think we should set up a quick chat to discuss all this. I have the impression that both teams are not sync about long term needs (of both teams: feature for us, maintenance for you). (I'm not available this week though, feel free to set one up for next week).
Flags: needinfo?(robhudson)
Flags: needinfo?(lcrouch)
Flags: needinfo?(jypenator)
Assignee: nobody → robhudson
Flags: needinfo?(robhudson)
Commits pushed to master at https://github.com/mozilla/kuma https://github.com/mozilla/kuma/commit/655bca6e7e85c0f5bee69daf3ddb152465d9698a Bug 1206820: Make .errorlist class available on all pages. https://github.com/mozilla/kuma/commit/cd0fc684a36a6c95a0c6ed94d898ebc4c9cb42ba Merge pull request #3538 from stephaniehobson/1206820-errorlist-styles Bug 1206820: Make .errorlist class available on all pages.
In the short term here's what we're working on. 1. Clean out existing case similar tags. 2. Add form checking to not allow new case similar tags so they won't come back.
Sounds like a plan :-) Do you need us to clean the tags before or after the form checking is introduced? As a side note, this bug has a side effect: it looks like we can save empty revisions in some cases now, as the automatic addition of the "ghost" tag count as a modification: https://developer.mozilla.org/en-US/docs/Web/Tutorials$compare?locale=en-US&to=927925&from=927523
Commits pushed to master at https://github.com/mozilla/kuma https://github.com/mozilla/kuma/commit/e2113bd95d5a9921f4aa4b4be2dedf95e727aa17 Fix bug 1206820 - Don't allow case similar tags https://github.com/mozilla/kuma/commit/55aa55b31df7cb8630a5ee1e43cbabbed9eb2265 Merge pull request #3539 from mozilla/1206820-dupe-tags Fix bug 1206820 - Don't allow case similar tags
Status: UNCONFIRMED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
(In reply to Jean-Yves Perrier [:teoli] from comment #13) > Sounds like a plan :-) > > Do you need us to clean the tags before or after the form checking is > introduced? The tagging python library will still add these tags to pages if they exist in the database, so the sooner we clean them the better. But we can do it little by little also -- focus on the most annoying ones and work our way down the list. In the code that was just merged there is a logging statement to help try to find these tags. I can grep the logs every so often to see what the tags are.
I've made a first pass on the list.
Flags: needinfo?(lcrouch)
Depends on: 1213229
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: