Closed Bug 1271552 Opened 9 years ago Closed 8 years ago

Duplicate case-sensitive tags appear spontaneously, cannot be deleted

Categories

(developer.mozilla.org Graveyard :: Tags / flags, defect)

All
Other
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sebo, Unassigned)

References

Details

(Keywords: in-triage, Whiteboard: [specification][type:bug])

What did you do? ================ 1. Edited https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/@@match 2. Removed duplicate tags of 'JavaScript' using different case sensitivity 3. Savedd the article What happened? ============== The duplicated tags stayed. What should have happened? ========================== Only the 'JavaScript' tag using the correct case-sensitive writing should have been saved. Is there anything else we should know? ====================================== I think this appeared once bug 776048 was fixed, though I don't know if there is any relation between the two.
See Also: → 776048
Tags can be deleted by admins (from all articles). But they shouldn't appear spontanouously. I think this is a regression, Rob Hudson fixed this last December. ni/ rob hudson and kadir as this is pretty severe.
Flags: needinfo?(robhudson)
Flags: needinfo?(a.topal)
Severity: normal → major
Keywords: in-triage
Flags: needinfo?(robhudson)
While updating the user profile page I found myself fighting a bit with the jQuery UI tag plug-in. We also use that on articles. That should be looked at as part of the investigation into this. I think I fixed but just lowercase()ing everything.
The trick we have is that we want the tags to have a specific case on them (so "JavaScript" not "Javascript" or "javascript" or "JAVASCRIPT" but those four works should all be identical and should automatically be replaced with the correct "JavaScript". If a user adds a tag to a page, we need to look to see if a case-insensitive match for that name already exists in the tag list. If so, use that existing capitalization. Otherwise, add the new tag and treat the letter casing used for the new tag as the default for that case-insensitive matching.
(In reply to Eric Shepherd [:sheppy] from comment #5) > If a user adds a tag to a page, we need to look to see if a case-insensitive > match for that name already exists in the tag list. If so, use that existing > capitalization. Otherwise, add the new tag and treat the letter casing used > for the new tag as the default for that case-insensitive matching. The point is that the tags get added automatically when the article is saved or displayed the next time, they are not added by the user. I.e. it's a server-side issue. So, to solve this I suggest: 1. Remove all the incorrectly capitalized tags from the DB. 2. When entering a tag it should be case-insensitively checked against the ones already existing in the list of tags. 2.1 If one already exists, it should be discarded. 2.2 If it doesn't exist, it should be case-insensitively checked against the ones saved in the DB. 2.2.1 If one already exists in the DB, use that one for the list. 2.2.2 Otherwise add it to the DB and the list. Doing so we ensure that only one variant of capitalization of a tag is saved in the DB. Initially incorrectly entered tags need to be corrected by the admins then. Sebastian
Looking into the bug. Some history, mixed with backend gibberish. PR 3539 [1] to fix bug 1206820 was deployed way back in October 2015. It tries to find a match under the utf8_distinct_ci collation rules (which both lowercases the tag and tries to normalize accented characters like á as a), and uses that if available [2]. It looks like the tag matching behavior relies on the collation. However, the unique index for name is not applied in production, so there may be a broken piece to the algorithm. While investigating this, there are several missing indexes (unique or otherwise) in the production database. My best guess is that they were introduced in the Django 1.7 upgrade, and lost in the transition from South to Django migrations. I'm opening a new blocking bug to catalog and manually apply the indexes, which will involve removing some duplicates from the database. [1] https://github.com/mozilla/kuma/pull/3539 [2] https://github.com/mozilla/kuma/blob/e2113bd95d5a9921f4aa4b4be2dedf95e727aa17/kuma/wiki/forms.py#L309-L337
Depends on: 1293749
The missing indexes and schema issues are worse than I thought. The issue most directly causing this problem is the missing UNIQUE KEY index on the name, which means that the tagging code is not seeing "duplicate" tags. Because of the missing UNIQUE KEY index, the custom collation is not helping as intended. The problem affects all the kuma tagging systems (review tags, profile interests, search tags). None have the expected UNIQUE KEY index in production, or in the anonymized database (so fixes can be tested and verified in staff developer VMs). My plan for fixing this is: 1. Detect and remove bad identifiers in the foreign key relations between Documents, DocumentTags, and the TaggedDocument many-to-many table that relate them. Add the CONSTRAINTS needed to enforce these relations in the future. 2. Delete all unused tags 3. Merge tags with duplicate names, using this automated process: - Collect all the tags with the same name - If one tag is used in more documents than the others, then it is the canonical one - If multiple tags are tied for most used, the one with the lowest ID (first created) wins - Replace the duplicate tags in documents with the "winner" - Delete the duplicate losers 4. Add the UNIQUE KEY index in production with a migration 5. Add basic features to the Django admin to see documents using a tag 6. Fix other schema issues in these tables, to synchronize production with the code-based data models This is probably 6 different pull requests, and will just fix these three tables. There's a lot more in bug 1293718. Having stared at the code all day, I don't think the custom collation is necessary. I think it will work, and prevent someone duplicating the 'Firefox' tag with 'firefox', 'FireFox', 'Ḟirefox', or 'ḞḯŘḛfflồⅫ'. But, I think the 'normalization' could be done Django-side as well, extending the taggit models and managers, and freeing us to use standard databases, instead of having to install the custom collation. That's a stretch task though - first, the schema needs to be fixed, and I'm stopping code bandaids to work around the broken schema.
Bug 1293718 is fixed, which means unused tags have been deleted, and "Duplicate" tags, like JAVASCRIPT and javascript, have been de-duplicated. A "preferred" form was chosen, by selecting the one used the most, or the oldest. If the "wrong" one was chosen, the tag name can be fixed in the admin: Document tags: https://developer.mozilla.org/admin/wiki/documenttag/ Other tags: https://developer.mozilla.org/admin/taggit/tag/ (profile interest and expertise tags, search filter tags) Comment 8 is wrong about the custom collation. For the latest, see: https://github.com/mozilla/kuma/pull/3981#discussion_r80166189 With the custom collation utf8_distinct_ci , 'FireFox' will collide with 'firefox' but not with 'Ḟirefox', which is I think what is desired for the current implementation of translated tags. The normalization can still be done Django-side instead, and actual tag translation implemented, but that's another bug. If duplicate tags arises again. please open a new bug and reference this one.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Great news! You made my day, John! \o/ Sebastian
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.