Closed Bug 710753 Opened 13 years ago Closed 13 years ago

Handle MindTouch page namespace prefixes in migration to Kuma

Categories

(developer.mozilla.org Graveyard :: Editing, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lorchard, Assigned: lorchard)

References

Details

(Whiteboard: u=user c=wiki p=3)

Attachments

(1 file)

In MindTouch, a page can have a namespace. In the page URL, this is reflected by the use of a colon-delimited keyword. For example: * Project:en/How_to_Help * User:* * User_talk:* * Talk:* In the database, page records have a numeric column indicating the namespace. Here are some references to the mapping between the namespace keyword in URLs and the numeric value in the database: https://svn.mindtouch.com/source/public/dekiwiki/trunk/web/includes/Defines.php https://github.com/mozilla/kuma/blob/mdn/apps/devmo/models.py#L327 This is the breakdown in page namespaces, in terms of page count: (none) 36618 Talk: 2210 User: 39941 User_talk: 901 Project: 460 Project_talk: 63 Template: 646 Template_talk: 17 Help: 58 Help_talk: 9 Special: 24 So, issues as part of this in migration: * Do we want to carry over the same MindTouch keyword-to-number mapping in the database? (Probably not. It's kind of awkward and the prefixes are not in the DB for joins) * This tangles up with locales a bit, since it's in the slug *after* they keyword prefix * Are there any namespaces we can just ignore? (eg. Special, Help, Help_talk, Template_talk, Project_talk?) * Nearly 1/2 of our pages are in the User: namespace. A quick glance at them shows that quite a large number of them are just the "Welcome to MindTouch" page that gets created for users on registration. We can probably delete these!
2.1, maybe?
Target Milestone: --- → 2.1
Ensure all namespace pages make it into django. Migrate: Talk User User_talk Project Project_talk Exclude: Help Help_talk Special Template Template_talk Filter: User - don't bring over default pages
Whiteboard: u=user c=wiki p=3
Assignee: nobody → lorchard
Just because this might be interesting to share, and because I don't want to forget it: mysql> create temporary table hashtmp select page_title, md5(page_text) from pages where page_namespace=2; Query OK, 39941 rows affected (4 min 53.93 sec) Records: 39941 Duplicates: 0 Warnings: 0 mysql> select `md5(page_text)` as hash, count(*) as ct from hashtmp group by hash order by ct desc limit 20; +----------------------------------+-------+ | hash | ct | +----------------------------------+-------+ | 7479e8f30d5ab0e9202195a1bddec69d | 18778 | | 698141d0c92776d60d884ebce6d64d82 | 6927 | | ca0c3622cdb213281cf2dc698b15c357 | 4888 | | ce33312f48b8ce8a68c587173e276f3a | 3578 | | 9ba3b75ba5e3ba82cfad83a50186ab35 | 1988 | | e931344938b19ea93865568712c2b2de | 799 | | a40f1d06233eef791bcf8b55df46cace | 133 | | 14d2e3e51d704084503f67eaaf47dc72 | 68 | | d41d8cd98f00b204e9800998ecf8427e | 55 | | 74ced08578951e424aff4e7a90f2b48b | 47 | | 55abb153d6e5d1bc22dae9938074f38d | 43 | | 43d1c34c5556ebf12e9d0601863eb752 | 35 | | f53c0981035e2378c8e8692a1e7f9649 | 32 | | 68b329da9893e34099c7d8ad5cb9c940 | 22 | | 8766b3552715bed94c106f6824efb535 | 17 | | 7dbb4512068edc202eda2b853c415cb7 | 15 | | 63f484aade7cfab43340bd001370c132 | 9 | | f71abdf1a61d4fbcf7a96c484f602434 | 7 | | baf848927342e7fa737b14277fa566f8 | 6 | | 83c7ff527035fe0dd78c2330e08d6747 | 6 | +----------------------------------+-------+ 20 rows in set (0.04 sec) Still need to see what actual content those MD5 represent, but that seems like a huge huge number of User: pages with similar or identical content
More analysis. If correct, then we have only 2389 of 39941 user pages with unique content. The top 15 or so duplicated hashes I saw were "welcome to MDN" default content, in various languages. This tempts me to skip migration for User:* pages whose content hash matches any from a list of duplicates. mysql> create temporary table hashtmp2 select `md5(page_text)` as hash, page_text, count(*) as ct from hashtmp group by hash; Query OK, 2447 rows affected (0.69 sec) Records: 2447 Duplicates: 0 Warnings: 0 mysql> select sum(ct) from hashtmp2 where ct = 1; +---------+ | sum(ct) | +---------+ | 2389 | +---------+ 1 row in set (0.00 sec) mysql> select sum(ct) from hashtmp2 where ct > 1; +---------+ | sum(ct) | +---------+ | 37552 | +---------+ 1 row in set (0.00 sec)
Adding the transcript of the MySQL queries I used to come up with hashes for User: page content to skip during migration, just so others can check my work if necessary
Commits pushed to https://github.com/mozilla/kuma https://github.com/mozilla/kuma/commit/3783d850ab167fcaec8c52bbf9d16eb1f2a71bf4 bug 710753: Handle MindTouch page namespace prefixes in migration to Kuma * Include additional namespaces in migration - Talk, User, User_talk, Project, Project_talk * Exclude User namespace pages whose content matches known garbage content. * Progress metrics in count, rate, and duration. (TODO: eta) * Misc bugfixes https://github.com/mozilla/kuma/commit/fa281883c6d7d9c69f633451c0ae9af9fa663c2b Merge pull request #93 from lmorchard/bug-710753-migrate-namespaces Fix Bug 710753 migrate namespaces
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Version: MDN → unspecified
Component: Docs Platform → Editing
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: