Change tag names to use utf8_general_ci collation

RESOLVED FIXED

Status

RESOLVED FIXED
a year ago
a year ago

People

(Reporter: jwhitlock, Assigned: jwhitlock)

Tracking

({in-triage})

Details

(Whiteboard: [specification][type:change])

(Assignee)

Description

a year ago
What feature should be changed? Please provide the URL of the feature if possible.
==================================================================================
Tag names use a custom collation utf8_distinct_ci, rather than the utf8_general_ci collation used by other fields and directly supported by MySQL. utf8_disctinct_ci is introduced in a MariaDB blog post [1], and was applied in June 2015.

[1] https://mariadb.com/resources/blog/adding-case-insensitive-distinct-unicode-collation

What problems would this solve?
===============================
The custom collation requires customizing MySQL server, which is not possible with AWS-managed MySQL servers (RDS).

If we can switch to a built-in collation, we gain AWS's monitoring infrastruture, automated updates, fast backups, replication, etc. etc., and do not have to manage the database servers ourselves.

Who would use this?
===================
MDN users, MDN editors, SREs, and MDN developers

What would users see?
=====================
MDN users will see document tags with accented letters that do not match the spelling in their language, such as "Reference" vs the French "Référence".

MDN Editors may try to add tags that only differ in accented characters, and will get alternate spelling.

SREs will use standard AWS tools for provisioning, monitoring, and maintaining databases.

MDN developers will have simpler MySQL installs in development environments for development and testing.

What would users do? What would happen as a result?
===================================================
Some MDN users and editors will be upset or angry at incorrectly spelled tags.

SREs will spend much less time installing, monitoring, maintaining, and upgrading database servers.

MDN developers can focus on a true translated tags feature (bug 671721), and upgrading to Postgres (bug 1159930) may be easier. 

Is there anything else we should know?
======================================
I've written much more on this issue in a Google doc:

https://docs.google.com/document/d/1xGeFQuRZa_aJ_obpgKHgGXT-8NQg61lFyD98eKedyz0/edit#

There's a related spreadsheet with all the tags as of today, and some suggested name changes to make them fit in utf8_general_ci:

https://docs.google.com/spreadsheets/d/1QjuT5vj1-yLcVa-XNt9JFHNKR3lKryOd162PkY5SLD4/edit
(Assignee)

Updated

a year ago
Assignee: nobody → jwhitlock
Blocks: 110799
(Assignee)

Updated

a year ago
Blocks: 1110799
No longer blocks: 110799
(Assignee)

Updated

a year ago
See Also: → bug 1159930
Keywords: in-triage

Comment 1

a year ago
Commits pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/f510674c247a64792748a3510f54ee0951acce72
bug 1391084: Disable duplicate tag test

This test was already marked xfail because it fails with an
IntegrityError when run with --no-migrations. After the collation is
changed from utf8_distinct_ci to utf8_general_ci, it will always fail,
because the unique index will disallow the duplicate tag.

https://github.com/mozilla/kuma/commit/bf945397d5e1472feb05645facf1be2245e2517f
bug 1391084: Migrate tags to utf8_general_ci

Split into a data and schema migration that should be run together:

1. Change tags that will collide in utf8_general_ci, by keeping the name of
the oldest tag an adding (2), (3), etc. to the later tag names.

2. Update the schema to use utf8_general_ci for tag names.

Changes around 100 tags, and runs in 8 seconds locally, so no site
downtime expected.

https://github.com/mozilla/kuma/commit/b63e30855010ea4a984c99663aaf8e00bee70474
Merge pull request #4376 from jwhitlock/utf8_general_ci_1391084

bug 1391084: Switch tag names from utf8_distinct_ci to utf8_general_ci

Comment 2

a year ago
Commits pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/adfe31e4c7b2d0267c1e2209c9331e4ec7e03b90
bug 1391084: Upgrade sqlparse and related reqs

Move sqlparse from contraints to default, because it is needed for
RunSQL data migrations. Also update it and some other requirements:

* sqlparse 0.1.19 → 0.2.3 - Cleanup, refactoring, bug fixes
* django-debug-toolbar 1.4 → 1.8 - sqlparse 0.2 compat, Django 1.11
  compatibility, manual setup required (with code changes)
* hashin 0.9.0 → 0.11.2 - Update how latest version is determined

https://github.com/mozilla/kuma/commit/fc03aad8490991553c149633c2f0280693b6b9ec
Merge pull request #4383 from jwhitlock/upgrade-deps-1391084

bug 1391084: Upgrade sqlparse and related reqs

Comment 3

a year ago
Commits pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/b0f3b3495442a505cdd0921a2fa9a25448f23dcb
bug 1391084: Update sample database resources

* Add the Interactive Editor content experiment
* Add waffle flags wiki_samples, redesign_beta, redesign_live,
  line_length, and sample_frame
* Remove waffle flag iperceptions
* Add waffle switches foundation_callout and helpful-survey-2

The regenerated sample database reflects removing the custom collation,
disambiguating tags, and the homepage with fewer links to
Mozilla-specific documentation.

https://github.com/mozilla/kuma/commit/f6664cc58154b082e6657c980eabb60f027de44c
Merge pull request #4422 from jwhitlock/sample-db-1391084

bug 1391084: Update sample database resources
(Assignee)

Updated

a year ago
See Also: → bug 1401253
(Assignee)

Comment 4

a year ago
The run-time dependency on utf8_distinct_ci has been removed. More work is needed to remove this from testing environments and the code. This is tracked in bug 1401253.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.