Closed Bug 561470 Opened 14 years ago Closed 10 years ago

Need dumps of MDC content as raw source data

Categories

(developer.mozilla.org Graveyard :: General, defect, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: BenB, Unassigned, NeedInfo)

References

Details

MDC has excessively bad response times (upwards of 20 *seconds*! per page load) and uptime (down every other week it seems).
I totally depend on it, because all the XUL, JS, DOM etc. documentation is there. When MDC is down, it's often a serious problem for our work. When MDC takes 20s to respond, that time adds up to significant amounts.

Also, I often want to work offline (notebook, in a cafe).

So, I'd like a dump of MDC (of certain sections or all) to store on my computer, so that I have the documentation locally and it's loading in 50ms and I can develop offline. Please provide a way to download dumps.

Please note that:
- XULPlanet provided ZIP files to download the documentation, and MDC is maintaining the very same XULPlanet documentation now
- Wikipedia also allows to download dumps, of much bigger size, and it's been very useful.
- Most importantly, mozilla.org is an open-source project, so mozilla.org's output is free. MDC is made up by contributions (including mine) under the assumption that this is a free project. MDC is the documentation for important Web standards, so it's important that it's free.
Ben - this is a good idea.  We'll definitely consider this for future updates.  

Right now we're focused on fixing the response time problems and we'll balance the download feature with other priorities after we resolve the current performance problems.  Sound ok to you?
Sounds OK, if the dumps are provided in reasonable time (1-3 months)
Thanks!
morgamic, I've been patient, but I'd like this now. Thank you.
Comment 1 Michael Morgan [:morgamic] 2010-04-23 14:54:08 PDT

> the download feature with other priorities after we resolve the current
> performance problems.  Sound ok to you?

Comment 2 Ben Bucksch (:BenB) 2010-04-23 14:55:49 PDT

> Sounds OK, if the dumps are provided in reasonable time (1-3 months)
> Thanks!

That's now almost a year ago.
Sheppy,

Do you know if MindTouch/DekiWiki has a data dump feature already?
Only in that you can save a PDF of any given page.
A "dump" means that I get the raw database text data so that I can run an exact copy of the site on my server.
What would you do, if you wanted to switch to MediaWiki? How would you get the data out of MindTouch? That's how we need it.
Ben,

We're switching from MindTouch to an in-house django-based wiki system. We're still learning how to dump and migrate data out of MindTouch. If you want to use something now you can use the MindTouch API (http://developer.mindtouch.com/Deki/API_Reference) to get pages in XML.

James,

Does kitsune give us a raw data dump feature?
Getting the raw database data is often a very involved question. Databases usually have foreign key constraints that require additional data--in particular user data--to come along with the document information you want.

Kitsune does not have a raw data dump feature because the information would contain things like email addresses and hashed passwords. You need more than just a mysqldump of the output, you either need to dump the data in another format that does not require the foreign key relationships or you need to securely scrub and anonymize the entire database.
Yeah this is a complicated feature. I know Mozilla WebDev is working on a database scrubbing tool but that would mostly apply to the new wiki database - not MindTouch/DekiWiki. For now the only option is to use the existing MindTouch API. In the future we may support our own data dump feature.
Sorry, I don't understand the problem. Can't you just skip all fields in the user DB table, apart from ID and username?
That's effectively what the API does.

Mozilla has a thorough security review process to cover anything like this. I don't know the specifics; the security team will want to audit and analyze the dekiwiki user tables to know for sure that exporting the raw database data won't violate our users' private data. I'm relatively new to Mozilla, but I've learned that user data and privacy is a paramount value here. Nothing like this can be done lightly.

For now the best way to get wiki data is to use the API.
Again, I am not interested in any user data. I just want to run a mirror.

Where is the API that allows me to fetch the whole site content (not page by page), so that it's still working (including XUL:foo -> XUL/foo and similar stuff)?
There's a MindTouch Desktop suite for Windows that would allow you to copy the wiki site into another instance of MindTouch/DekiWiki - the only platform which supports the DekiScript templates. We have never supported the MindTouch tool so I don't know if it will work for you. And even then, exporting to a mirror would export user data, and we won't do that. 

This isn't as small a request as it seems, and it competes with other MDN priorities. (See https://wiki.mozilla.org/MDN#Current_Projects) Jay, where could this potentially land in the roadmap?
Component: Administration → Website
Priority: -- → P2
QA Contact: administration → website
Hardware: x86 → All
Summary: Need dump of MDC content → Export MDN content for offline reading
Whiteboard: u=user c=wiki p=
Version: unspecified → Kuma
Jay unmarked this as a dupe. Reverting title.
Summary: Export MDN content for offline reading → Need dump of MDC content
Blocks: 710713
> Ben Bucksch (:BenB) 2010-04-23 14:55:49 PDT
> Sounds OK, if the dumps are provided in reasonable time (1-3 months)

Now 2 years ago.

> Luke Crouch [:groovecoder] 2011-03-21 06:34:11 PDT 
> We're switching from MindTouch to an in-house django-based wiki system.

Now 1 year ago.

What's so difficult about giving a download URL for all HTML files in a ZIP? It's the same content I can "wget -m" from any Internet computer. This can't take 2 years.
That sounds like a good way to do it! Very Taco Bell programming, which I love. I'm curious to know if that works.

If it does, we may file a webops bug to do that once a day and publish the result.
This sounds like a post-launch enhancement (bug 756266) and won't itself help us get to finishing off bug 756263

(In reply to Ben Bucksch (:BenB) from comment #18)
> What's so difficult about giving a download URL for all HTML files in a ZIP?
> It's the same content I can "wget -m" from any Internet computer. This can't
> take 2 years.

FWIW, you're welcome to join us and help out. You might learn about all the headaches we've had over the past 2 years. You'd *think* it wouldn't be so difficult, but you'd be wrong.
> FWIW, you're welcome to join us and help out.

Yeah, I've said several times I'd like to help, but I don't have access to the servers and don't know how stuff is set up.

FWIW, there are 2 types of dumps: Raw data and output. Most important is the first, but the second is also very useful for practical reasons.

The first just gives what is stored on the server, you just copy the server disk (only relevant parts), and the format depends on the software you use. It makes sense to wait with that after the switch away from MindTouch.

The second should be the same that the frontend HTTP servers deliver and it *should* be possible to use wget -m. You said it's difficult, but didn't say why. I actually did that when I filed the bug, and the only thing I ran into are: duplicate pages and/or lack of redirects.
Blocks: 756266
Ben, the in-development codebase for the wiki is the kuma project, which is on github: https://github.com/mozilla/kuma/

You don't need to set up a server or know how Mozilla's servers are organized or anything of the sort to hack on it if you want to. All you need is a clone of the git repo; there are instructions in there for setting up your own local instance, or for setting up a VirtualBox VM.

Meanwhile, one of the big difficulties is that we're very constrained by the wiki infrastructure we're trying to migrate away from. Getting off that, and onto the kuma wiki system, will be a big win, and any sort of data-dump feature really needs to be implemented in kuma in order to be available once we switch.

Another difficulty is what's already been mentioned with respect to security and privacy concerns, since raw data access raises questions about how to lock down parts of the data that shouldn't be made publicly available (like user information).

Finally, there are a whole bunch of decisions that'd need to be made when implementing this. For example, the wiki supports a scriptable templating system; should a "raw data" dump just give you the source there, or should it interpret that and give you the output of any templates included in the page? What about things like file references that may not be portable? There's a lot of this stuff that would need to be solved in order to make any sort of data-dump feature available, and right now it isn't solved. If you've got ideas on those, though, especially with respect to accommodating a broad swath of common use cases, we'd be happy to hear them :)
I just see: my mirror had about about 5 million (!) hits, without bots/spiders. Clearly, there was demand.

I did wget below and it worked fairly well. I am only missing the main (overview/index) pages.
wget -m -p -k -T 5 -t 3 -R gz,bz2,zip,exe -np https://developer.mozilla.org/en/
http://download.beonex.com/mirror/developer.mozilla.org-2012-05-19.tar.bz2
> any sort of data-dump feature really needs to be implemented in kuma in order to be available
> once we switch.

Agreed!

> parts of the data that shouldn't be made publicly available (like user information).

Agreed.

> the wiki supports a scriptable templating system; should a "raw data" dump just give you the source
> there, or should it interpret that and give you the output of any templates included in the page?

I think that's answered by my comment 21: We need 1) a result dump which has just HTML, and 2) a data dump, which has the raw source that we edit. The source would include the scripts and invocations of the scripts, not their output. To use the data dump, I'd need either the same software that you use, or only extract parts of the data via my own scripts (similar to what people do with wikipedia information).

> What about things like file references that may not be portable?

Not sure what you mean, but wget -k about converts http links in the result pages.
I would hope that the source doesn't contain absolute file system references.
No longer blocks: 710713, 756266
(Deps: Sorry, software hickup)
Blocks: 710713, 756266
deps conflict.

We've learned not to hope for much with MindTouch. :ubernostrum knows how messy it is with file attachments.

That wget mirror trick is cool. Can we just link to a community-hosted/maintained offline archive like yours?
No longer blocks: 710713
(In reply to Luke Crouch [:groovecoder] from comment #26)
> deps conflict.
> 
> We've learned not to hope for much with MindTouch. :ubernostrum knows how
> messy it is with file attachments.
> 
> That wget mirror trick is cool. Can we just link to a
> community-hosted/maintained offline archive like yours?

Some notes on that wget-sourced tarball:

* There are 80000 documents in the DB, but only about 10000 in the tarball.

* Admittedly, a non-trivial percentage of the 80000 documents in the wiki are spam or boilerplate User:* pages with "Welcome to Mindtouch!" text. IIRC, that leaves about 40000 or so pages after my last migration run

* The tarball covers only the English pages, of which there are around 50000 in the DB. There are about 2-5k each in the Japanese, Polish, Spanish, and French translation sets.

* There are no extensions on the captured files, so there are some paths that are both files and directories (eg en/DOM/element). I had trouble unpacking those

I haven't spent time trying to get a complete wget-sourced tarball out of the site, so I couldn't tell you what the fixes would be.

FWIW: What I *have* spent time working on is getting a DB dump that's safe for public release. That would partly address this bug. The other part might be the automated virtual machine setup we've been working on for local development.

The core MDN devs have had access to production DB dumps for about 7 months, now. In that time, I've spent a few hours here or there auditing the data and trying to find all the places that would expose personal information. There's a lot of it, in a lot of places, and the site depends on it or something fake that looks like it. (eg. user names, passwords, IP addresses, browser user agent strings - our definition of personally identifying information is very broad)

For awhile, I did have some dumps quietly available for download. But, some automated scans from our security team caught them and found some more spots where I'd leaked personal data. So, back to the drawing board. I'm hoping to take a another shot at it in the next week or two.
Also, bug 756547 might be interesting to watch with respect to the issues here.

I'd like to look into moving all the MDN docs out of a private, app-specific MySQL database and into a public git repo with more open formats. But, that's probably something we won't have a chance to even think much about until Fall or Winter of this year
> That wget mirror trick is cool.

:-)

> Can we just link to a community-hosted/maintained offline archive like yours?

Not really, because it my fetch took something around 10 hours, and I wouldn't want to cron than, and we want a moderately current dump (daily or at minimum weekly). It should be a lot faster when coming from your own servers. Most of the time was the SSL handshake for every page due to the forced https and lack of persistent connections in wget, so if you can bypass the SSL stuff, and be in the same network, that'd be a lot faster.

> * There are 80000 documents in the DB, but only about 10000 in the tarball.
> ... about 40000 or so pages after my last migration run
> * The tarball covers only the English pages, of which there are around 50000 in the DB.

Yes, I was fetching only en/ . Not sure whether that explains the whole difference or there're English main articles missing (apart from index).

> * There are no extensions on the captured files, so there are some paths that are both files and
> directories (eg en/DOM/element). I had trouble unpacking those

Ah, that's why I am missing the index pages.

> FWIW: What I *have* spent time working on is getting a DB dump that's safe for public release.

Thanks! That'd be great.

> expose personal information.
> There's a lot of it, in a lot of places, and the site depends on it or something fake that
> looks like it. (eg. user names, passwords, IP addresses, browser user agent strings -
> our definition of personally identifying information is very broad)

If that makes the question simpler, any information that's already on the public website should be safe to include. That means user names are OK, and user names are important for the change history. Further details about the user are not important and shouldn't be included.

BTW: It makes sense for you to also internally structure the data in a way so that sensitive information is in clearly defined and in easily separateable places, in order to make security, admin, syncs, this dump and other things easier for yourself. You'd also need this for:

> I'd like to look into moving all the MDN docs out of a private, app-specific MySQL database
> and into a public git repo with more open formats. But, that's probably something we won't
> have a chance to even think much about until Fall or Winter of this year

That'd be oh so great! Yes, please!
Depends on: 710713
Summary: Need dump of MDC content → Need dumps of MDC content as 1) raw source data and 2) result HTML pages
> > * There are no extensions on the captured files, so there are some paths that are both files and
> > directories (eg en/DOM/element). I had trouble unpacking those

> Ah, that's why I am missing the index pages.

I am trying with wget -E now, but that append .html to all pages, which changes all URLs. Ideally, I'd like the same URL and result as on devmo. Configuring the browser or server to default to HTML shouldn't be too hard. But that leaves the index pages.

We have the problem that there is both .../DOM and .../DOM/foo , whereby foo is a leaf. .../DOM should be saved as .../DOM/index.html and .../DOM/foo should be saved as .../DOM/foo (makes no sense to have .../DOM/foo/index.html for every single leaf page), but wget can't know whether .../DOM is a leaf or also a directory. At least not at the start, only after it saw all pages.
I've put it on my mirror <http://mdn.beonex.com>

Apache config for t he above tarball:
<VirtualHost *>
    ServerName mdn.beonex.com
    ...
    <Location />
        DefaultType text/html
    </Location>
    <Location /skins/>
        DefaultType text/css
    </Location>
    <Location /media/css/>
        DefaultType text/css
    </Location>
    RewriteEngine on
    RewriteRule   ^/$  en/index.html  [R]
    RewriteRule   ^(.*)/$  $1.html   [R]
</VirtualHost>
skins/ was posing a problem, because it's in robots.txt (why? can you remove that, please?), so I had to fetch it manually (added to the new tarball) and convert the links manually. Also, skins/common/js.php?foo isn't nice, because it's delivering CSS, so I had to add an Apache rule. New Apache config:

<VirtualHost *>
    ServerName mdn.beonex.com
    ...
    <Location />
        DefaultType text/html
    </Location>
    <Location /skins/>
        DefaultType text/css
        AddType text/css php
    </Location>
    <Location /media/css/>
        DefaultType text/css
    </Location>
    RewriteEngine on
    RewriteRule   ^/$  en/index.html  [R]
    RewriteRule   ^/docs$  en-US/docs.html  [R]
    RewriteRule   ^/learn$  en/index.html  [R]
    RewriteRule   ^(.*)/$  $1.html   [R]
</VirtualHost>
Depends on: 757461
Luke, thanks for filing bug 757461, so this bug here can focus on the raw data.
Summary: Need dumps of MDC content as 1) raw source data and 2) result HTML pages → Need dumps of MDC content as raw source data
I have been through the exact same pain and decided to write a script that migrates MindTouch DekiWiki's to MediaWiki, and is offering it as a service at ElectricLetters.com .
Version: Kuma → unspecified
Component: Website → Landing pages
No longer blocks: 756266
Any news about this?
In MDN's UserVoice forum there is an answer that claimed this feature would have been released alongside Kuma (which got released in 3rd of August 2012, almost half a year ago...)
(In reply to aklp08 from comment #37)
> Any news about this?
> In MDN's UserVoice forum there is an answer that claimed this feature would
> have been released alongside Kuma (which got released in 3rd of August 2012,
> almost half a year ago...)
There is a weekly snapshot at https://developer.mozilla.org/media/developer.mozilla.org.tar.gz  (1.6GB)

Marking as duplicate of 757461. Re-open if I'm mistaken.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
This isn't a DUP. We had specifically 2 separate bugs: Bug 757461 is the HTML output that I can put on a static Apache without sopftware, and this one here is the raw database / plaintext data that is being edited. Same difference as between EXEcutable and source code. This one here is about the source, in the form that is being edited.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Status: REOPENED → NEW
Depends on: 665084
(In reply to Ben Bucksch (:BenB) from comment #39)
> This isn't a DUP. We had specifically 2 separate bugs: Bug 757461 is the
> HTML output that I can put on a static Apache without sopftware, and this
> one here is the raw database / plaintext data that is being edited. Same
> difference as between EXEcutable and source code. This one here is about the
> source, in the form that is being edited.

For what it's worth: HTML is what's being edited on Kuma. We don't use plaintext with wiki formatting. 

So, there are 2 differences between a tarball of the site and HTML straight from the DB:

* It's not wrapped in site chrome
* KumaScript macros are not rendered
Ah, I see.

So, would a dump of the edited HTML as .html files be useful, or would making that useful end up being the database dump?
Component: Landing pages → General
Whiteboard: u=user c=wiki p=
Both this bug here and bug 561470 are important for classical open source values: You cannot be an open source project, ask the community to contribute to the documentation, and then keep all these docs for yourself, usable only on your website. The information must be free and copyable for everybody, both in source code form (bug 561470) and in resulting form (this bug). This is highly important for very basic open source reasons and preservation of information.
Sorry, correction:
in source code form (this bug) and in resulting form (bug 665750)
We have reliable DB dumps now:

https://developer.allizom.org/landfill/
https://developer.allizom.org/landfill/devmo_sanitized-latest.sql.bz2
Status: NEW → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → FIXED
Thanks to everybody who help with it!
This dump was taken down in response to a personal information disclosure:

https://blog.mozilla.org/security/2014/08/01/mdn-database-disclosure/

As far as I'm aware, this dump is gone permanently. Another solution that doesn't run the risk of leaking personal data needs to be found.
:lorchard+1 publishing the sanitized DB dump risks leaking private information, so we've disabled it. rogerio - how did you find the links? In the docs? Do you want to submit a pull request that removes that part of the docs?
Flags: needinfo?(rogeriopradoj)
I have updated https://developer.mozilla.org/en-US/docs/MDN/About to remove mention of these downloads.
Depends on: 1224953
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.