Closed
Bug 710753
Opened 13 years ago
Closed 13 years ago
Handle MindTouch page namespace prefixes in migration to Kuma
Categories
(developer.mozilla.org Graveyard :: Editing, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
2.1
People
(Reporter: lorchard, Assigned: lorchard)
References
Details
(Whiteboard: u=user c=wiki p=3)
Attachments
(1 file)
In MindTouch, a page can have a namespace. In the page URL, this is reflected by the use of a colon-delimited keyword. For example:
* Project:en/How_to_Help
* User:*
* User_talk:*
* Talk:*
In the database, page records have a numeric column indicating the namespace. Here are some references to the mapping between the namespace keyword in URLs and the numeric value in the database:
https://svn.mindtouch.com/source/public/dekiwiki/trunk/web/includes/Defines.php
https://github.com/mozilla/kuma/blob/mdn/apps/devmo/models.py#L327
This is the breakdown in page namespaces, in terms of page count:
(none) 36618
Talk: 2210
User: 39941
User_talk: 901
Project: 460
Project_talk: 63
Template: 646
Template_talk: 17
Help: 58
Help_talk: 9
Special: 24
So, issues as part of this in migration:
* Do we want to carry over the same MindTouch keyword-to-number mapping in the database? (Probably not. It's kind of awkward and the prefixes are not in the DB for joins)
* This tangles up with locales a bit, since it's in the slug *after* they keyword prefix
* Are there any namespaces we can just ignore? (eg. Special, Help, Help_talk, Template_talk, Project_talk?)
* Nearly 1/2 of our pages are in the User: namespace. A quick glance at them shows that quite a large number of them are just the "Welcome to MindTouch" page that gets created for users on registration. We can probably delete these!
Comment 2•13 years ago
|
||
Ensure all namespace pages make it into django.
Migrate:
Talk
User
User_talk
Project
Project_talk
Exclude:
Help
Help_talk
Special
Template
Template_talk
Filter:
User - don't bring over default pages
Whiteboard: u=user c=wiki p=3
Assignee | ||
Updated•13 years ago
|
Assignee: nobody → lorchard
Assignee | ||
Comment 3•13 years ago
|
||
Just because this might be interesting to share, and because I don't want to forget it:
mysql> create temporary table hashtmp select page_title, md5(page_text) from pages where page_namespace=2;
Query OK, 39941 rows affected (4 min 53.93 sec)
Records: 39941 Duplicates: 0 Warnings: 0
mysql> select `md5(page_text)` as hash, count(*) as ct from hashtmp group by hash order by ct desc limit 20;
+----------------------------------+-------+
| hash | ct |
+----------------------------------+-------+
| 7479e8f30d5ab0e9202195a1bddec69d | 18778 |
| 698141d0c92776d60d884ebce6d64d82 | 6927 |
| ca0c3622cdb213281cf2dc698b15c357 | 4888 |
| ce33312f48b8ce8a68c587173e276f3a | 3578 |
| 9ba3b75ba5e3ba82cfad83a50186ab35 | 1988 |
| e931344938b19ea93865568712c2b2de | 799 |
| a40f1d06233eef791bcf8b55df46cace | 133 |
| 14d2e3e51d704084503f67eaaf47dc72 | 68 |
| d41d8cd98f00b204e9800998ecf8427e | 55 |
| 74ced08578951e424aff4e7a90f2b48b | 47 |
| 55abb153d6e5d1bc22dae9938074f38d | 43 |
| 43d1c34c5556ebf12e9d0601863eb752 | 35 |
| f53c0981035e2378c8e8692a1e7f9649 | 32 |
| 68b329da9893e34099c7d8ad5cb9c940 | 22 |
| 8766b3552715bed94c106f6824efb535 | 17 |
| 7dbb4512068edc202eda2b853c415cb7 | 15 |
| 63f484aade7cfab43340bd001370c132 | 9 |
| f71abdf1a61d4fbcf7a96c484f602434 | 7 |
| baf848927342e7fa737b14277fa566f8 | 6 |
| 83c7ff527035fe0dd78c2330e08d6747 | 6 |
+----------------------------------+-------+
20 rows in set (0.04 sec)
Still need to see what actual content those MD5 represent, but that seems like a huge huge number of User: pages with similar or identical content
Assignee | ||
Comment 4•13 years ago
|
||
More analysis. If correct, then we have only 2389 of 39941 user pages with unique content. The top 15 or so duplicated hashes I saw were "welcome to MDN" default content, in various languages.
This tempts me to skip migration for User:* pages whose content hash matches any from a list of duplicates.
mysql> create temporary table hashtmp2 select `md5(page_text)` as hash, page_text, count(*) as ct from hashtmp group by hash;
Query OK, 2447 rows affected (0.69 sec)
Records: 2447 Duplicates: 0 Warnings: 0
mysql> select sum(ct) from hashtmp2 where ct = 1;
+---------+
| sum(ct) |
+---------+
| 2389 |
+---------+
1 row in set (0.00 sec)
mysql> select sum(ct) from hashtmp2 where ct > 1;
+---------+
| sum(ct) |
+---------+
| 37552 |
+---------+
1 row in set (0.00 sec)
Assignee | ||
Comment 5•13 years ago
|
||
Assignee | ||
Comment 6•13 years ago
|
||
Adding the transcript of the MySQL queries I used to come up with hashes for User: page content to skip during migration, just so others can check my work if necessary
Comment 7•13 years ago
|
||
Commits pushed to https://github.com/mozilla/kuma
https://github.com/mozilla/kuma/commit/3783d850ab167fcaec8c52bbf9d16eb1f2a71bf4
bug 710753: Handle MindTouch page namespace prefixes in migration to Kuma
* Include additional namespaces in migration - Talk, User, User_talk,
Project, Project_talk
* Exclude User namespace pages whose content matches known garbage
content.
* Progress metrics in count, rate, and duration. (TODO: eta)
* Misc bugfixes
https://github.com/mozilla/kuma/commit/fa281883c6d7d9c69f633451c0ae9af9fa663c2b
Merge pull request #93 from lmorchard/bug-710753-migrate-namespaces
Fix Bug 710753 migrate namespaces
Updated•13 years ago
|
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•13 years ago
|
Version: MDN → unspecified
Updated•13 years ago
|
Component: Docs Platform → Editing
Updated•5 years ago
|
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•