Closed Bug 710753 Opened 13 years ago Closed 12 years ago

Handle MindTouch page namespace prefixes in migration to Kuma

Categories

(developer.mozilla.org Graveyard :: Editing, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lorchard, Assigned: lorchard)

References

Details

(Whiteboard: u=user c=wiki p=3)

Attachments

(1 file)

In MindTouch, a page can have a namespace. In the page URL, this is reflected by the use of a colon-delimited keyword. For example:

* Project:en/How_to_Help
* User:* 
* User_talk:*
* Talk:*
 
In the database, page records have a numeric column indicating the namespace. Here are some references to the mapping between the namespace keyword in URLs and the numeric value in the database:

https://svn.mindtouch.com/source/public/dekiwiki/trunk/web/includes/Defines.php
https://github.com/mozilla/kuma/blob/mdn/apps/devmo/models.py#L327

This is the breakdown in page namespaces, in terms of page count:

(none) 36618
Talk: 2210
User: 39941
User_talk: 901
Project: 460
Project_talk: 63
Template: 646
Template_talk: 17
Help: 58
Help_talk: 9
Special: 24

So, issues as part of this in migration:

* Do we want to carry over the same MindTouch keyword-to-number mapping in the database? (Probably not. It's kind of awkward and the prefixes are not in the DB for joins)

* This tangles up with locales a bit, since it's in the slug *after* they keyword prefix

* Are there any namespaces we can just ignore? (eg. Special, Help, Help_talk, Template_talk, Project_talk?)

* Nearly 1/2 of our pages are in the User: namespace. A quick glance at them shows that quite a large number of them are just the "Welcome to MindTouch" page that gets created for users on registration. We can probably delete these!
2.1, maybe?
Target Milestone: --- → 2.1
Ensure all namespace pages make it into django.

Migrate:
Talk
User
User_talk
Project
Project_talk

Exclude:
Help
Help_talk
Special
Template
Template_talk

Filter:
User - don't bring over default pages
Whiteboard: u=user c=wiki p=3
Assignee: nobody → lorchard
Just because this might be interesting to share, and because I don't want to forget it:

mysql> create temporary table hashtmp select page_title, md5(page_text) from pages where page_namespace=2;

Query OK, 39941 rows affected (4 min 53.93 sec)
Records: 39941  Duplicates: 0  Warnings: 0

mysql> select `md5(page_text)` as hash, count(*) as ct from hashtmp group by hash order by ct desc limit 20;
+----------------------------------+-------+
| hash                             | ct    |
+----------------------------------+-------+
| 7479e8f30d5ab0e9202195a1bddec69d | 18778 | 
| 698141d0c92776d60d884ebce6d64d82 |  6927 | 
| ca0c3622cdb213281cf2dc698b15c357 |  4888 | 
| ce33312f48b8ce8a68c587173e276f3a |  3578 | 
| 9ba3b75ba5e3ba82cfad83a50186ab35 |  1988 | 
| e931344938b19ea93865568712c2b2de |   799 | 
| a40f1d06233eef791bcf8b55df46cace |   133 | 
| 14d2e3e51d704084503f67eaaf47dc72 |    68 | 
| d41d8cd98f00b204e9800998ecf8427e |    55 | 
| 74ced08578951e424aff4e7a90f2b48b |    47 | 
| 55abb153d6e5d1bc22dae9938074f38d |    43 | 
| 43d1c34c5556ebf12e9d0601863eb752 |    35 | 
| f53c0981035e2378c8e8692a1e7f9649 |    32 | 
| 68b329da9893e34099c7d8ad5cb9c940 |    22 | 
| 8766b3552715bed94c106f6824efb535 |    17 | 
| 7dbb4512068edc202eda2b853c415cb7 |    15 | 
| 63f484aade7cfab43340bd001370c132 |     9 | 
| f71abdf1a61d4fbcf7a96c484f602434 |     7 | 
| baf848927342e7fa737b14277fa566f8 |     6 | 
| 83c7ff527035fe0dd78c2330e08d6747 |     6 | 
+----------------------------------+-------+
20 rows in set (0.04 sec)

Still need to see what actual content those MD5 represent, but that seems like a huge huge number of User: pages with similar or identical content
More analysis. If correct, then we have only 2389 of 39941 user pages with unique content. The top 15 or so duplicated hashes I saw were "welcome to MDN" default content, in various languages.

This tempts me to skip migration for User:* pages whose content hash matches any from a list of duplicates.

mysql> create temporary table hashtmp2 select `md5(page_text)` as hash, page_text, count(*) as ct from hashtmp group by hash;

Query OK, 2447 rows affected (0.69 sec)
Records: 2447  Duplicates: 0  Warnings: 0

mysql> select sum(ct) from hashtmp2 where ct = 1;
+---------+
| sum(ct) |
+---------+
|    2389 | 
+---------+
1 row in set (0.00 sec)

mysql> select sum(ct) from hashtmp2 where ct > 1;
+---------+
| sum(ct) |
+---------+
|   37552 | 
+---------+
1 row in set (0.00 sec)
Adding the transcript of the MySQL queries I used to come up with hashes for User: page content to skip during migration, just so others can check my work if necessary
Commits pushed to https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/3783d850ab167fcaec8c52bbf9d16eb1f2a71bf4
bug 710753: Handle MindTouch page namespace prefixes in migration to Kuma

* Include additional namespaces in migration - Talk, User, User_talk,
  Project, Project_talk

* Exclude User namespace pages whose content matches known garbage
  content.

* Progress metrics in count, rate, and duration. (TODO: eta)

* Misc bugfixes

https://github.com/mozilla/kuma/commit/fa281883c6d7d9c69f633451c0ae9af9fa663c2b
Merge pull request #93 from lmorchard/bug-710753-migrate-namespaces

Fix Bug 710753 migrate namespaces
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Version: MDN → unspecified
Component: Docs Platform → Editing
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: