Closed Bug 757461 Opened 12 years ago Closed 12 years ago

MDN: generate and host a tarball mirror of MDN

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P4)

x86
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: groovecoder, Assigned: nmaul, NeedInfo)

References

Details

(Whiteboard: [triaged 20120910])

In bug 561470, :BenB has devised a wget mode and Apache config to make it possible to host a downloadable tarball or MDC content.

If it isn't too much work, we should see if we can set this up while we develop a solid "offline" feature for Kuma.
http://download.beonex.com/mirror/developer.mozilla.org-2012-05-21.tar.bz2

I did:
wget -m -p -k -E -T 5 -t 3 -R gz,bz2,zip,exe,download -np https://developer.mozilla.org/en/

skins/ was posing a problem, because it's in robots.txt (why? can you remove that, please?), so I had to fetch it manually ([1], added to the new tarball) and convert the links manually.

Furthermore, I did
find . -type f -name "*.html" -print 0 | xargs -0 -n 1 sed -e "s|https://developer.mozilla.org|http://mdn.beonex.com|" -e "s|//www.google.com|//no.google.anymore|"
The latter is important for privacy and EU laws, because Google keeps and analyzes logs about where users were.

Apache config:
<VirtualHost *>
    ServerName mdn.beonex.com
    ...
    <Location />
        DefaultType text/html
    </Location>
    <Location /skins/>
        DefaultType text/css
        # for /skins/common/js.php?perms...
        AddType text/css php
    </Location>
    <Location /media/css/>
        DefaultType text/css
    </Location>
    RewriteEngine on
    RewriteRule   ^/$  en/index.html  [R]
    RewriteRule   ^/docs$  en-US/docs.html  [R]
    RewriteRule   ^/learn$  en/index.html  [R]
    RewriteRule   ^(.*)/$  $1.html   [R]
</VirtualHost>

[1]
https://developer.mozilla.org/skins/common/css.php
https://developer.mozilla.org/skins/common/custom_css.php
https://developer.mozilla.org/skins/common/icons/icon-trans.gif
https://developer.mozilla.org/skins/common/js.php?perms=
https://developer.mozilla.org/skins/common/js.php?perms=LOGIN,BROWSE,READ,SUBSCRIBE
https://developer.mozilla.org/skins/common/print.css
https://developer.mozilla.org/skins/mdn/Transitional/css.php
https://developer.mozilla.org/skins/mdn/Transitional/img/mdn-logo-sm.png
https://developer.mozilla.org/skins/mdn/Transitional/img/mdn-logo-tiny.png
https://developer.mozilla.org/skins/mdn/Transitional/js/javascript.min.js
https://developer.mozilla.org/skins/mdn/Transitional/print.css
The tarball can also be used offline, FWIW.
Can we get oppsec to sign off on this request, kick it back to us and we'll get'er done
Assignee: server-ops → nobody
Component: Server Operations: Web Operations → Security Assurance: Operations
QA Contact: cshields → security-assurance
I would believe that the robots.txt list has been created to alleviate the said load issues, not as a security mean. Which mean it's fine for us / signed off :)

If this is going to be a once-a-day thing I believe you should use -e robots=off --wait 5s (or less) but I would ask webops for that, so back to them. Just make sure you have a wait time so that it's not registered as flood or doesn't kill the site if it's already under heavy load.
Assignee: nobody → server-ops
Component: Security Assurance: Operations → Server Operations: Web Operations
QA Contact: security-assurance → cshields
kang, a once-a-day fetch with --wait 5 wouldn't work. There are some 20-80,000 documents. There are only 86400 seconds in a day, and the server already needs 1-2s to respond to requests, at least from Germany (fast uplink).
In that case I'm ok with zero delay. Up to webops to also agree however (due to the load). It might be better to have a tarball created locally, but, that's also up to you/webops.
> It might be better to have a tarball created locally

Yes, that's the point of this bug :)
might be a dup.
Assignee: server-ops → nmaul
(In reply to Phong Tran [:phong] from comment #8)
> might be a dup.

dup of which bug?
Blocks: 756266
2 questions:

1) Are we happy with the wget command in comment 1?

2) Where should this be hosted? Perhaps somewhere in /media/?


Once we're agreed on this this should be an easy cronjob.
Priority: -- → P4
Whiteboard: [waiting][webdev]
1) IDK. :BenB?

2) Yes, and we can link to it from the site somewhere.
I think so, apart from the skins/ problem I mentioned. I am not aware of other problems at the moment, but it's many months ago that I tried it, and that was before Kuma.
You can see the result of that (old, pre-kuma!) fetch on http://mdn.beonex.com/ and try for yourself.
As far as I can tell, there no longer *is* any robots.txt file at all... so /skins/ should get pulled in just like anything else.

I've written the basic script and it's running now. The tarball should be available here when it completes:

https://developer.mozilla.org/media/developer.mozilla.org.tar.gz

No idea how big it'll be or how long it'll take. I'm tentatively scheduling it for weekly processing. When the first run completes, we can take stock of those things and adjust as necessary.


For the record, the wget command I'm using is:
wget -q -m -p -k -E -T 5 -t 3 -R gz,bz2,zip,exe,download -np https://developer.mozilla.org/en-US/

Basically the same, but 'en' became 'en-US', and I added -q so cron output would be useful. If we should do other locales too, let me know... I imagine they could go into the same archive... depending on size that may or may not be desirable though.
Whiteboard: [waiting][webdev] → [triaged 20120910]
Bringing in :openjck so he knows what's up and can help us file/plan/schedule a bug to add a link to the .tgz download on the site somewhere.
> For the record, the wget command I'm using is:
> wget -q -m -p -k -E -T 5 -t 3 -R gz,bz2,zip,exe,download -np
> https://developer.mozilla.org/en-US/

Thanks!

The URL 
<https://developer.mozilla.org/media/developer.mozilla.org.tar.gz> gives me 403 Forbidden, probably a file permissions/ownership issue.
Yeah, files in /media/ that don't exist apparently throw a 403. It didn't exist until the job finished... it works now, and is apparently around 2.3GB. It took almost 20 hours to generate, and includes only /en-US/ stuff. That's long enough that I'm concerned about running it weekly... monthly seems more feasible. Also I did not do anything with respect to content filtering (like the find/sed in comment 1)... not sure how long that would take, or how much we would want to filter.


But the real issue is, I am afraid this might be a very unclean dataset to work with. At a glance:


1)
I see about 66,000 files like this:

developer.mozilla.org/en-US/users/login?next=%2Fen-US%2Fdocs%2Fwindow.forward$locales.html
developer.mozilla.org/en-US/users/login?next=%2Fen-US%2Fdocs%2FXPCOM_Interface_Reference%2FamIInstallCallback$locales.html
developer.mozilla.org/en-US/users/login?next=%2Fen-US%2Fdocs%2Fnew?slug=Javadoc.html

Repeated calls to the login page with different arguments. This probably happens because the link to the login page is present (with a different argument) from most pages wget visits.


2)
I also see files like this:

developer.mozilla.org/en-US/demos/detail/walking-with-css3/flag?flag_type=inappropriate.html
developer.mozilla.org/en-US/demos/detail/walking-with-css3/flag?flag_type=notworking.html
developer.mozilla.org/en-US/demos/detail/walking-with-css3/flag?flag_type=plagarised.html

This makes me very nervous... did we just end up submitting a whole bunch of feedback for every single demo with this single wget?


3)
And then there's chunks like this:

developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94969.html
developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94979.html
developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94978.html
developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94982.html
developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94975.html

It appears that in many cases we're getting one file *per revision* of a document.


4)
There's also files like this:

developer.mozilla.org/en-US/profiles/darkyndy?sort=likes.html
developer.mozilla.org/en-US/profiles/cxmhiiemn00.html
developer.mozilla.org/en-US/profiles/CristianTincu.html
developer.mozilla.org/en-US/profiles/mediafirresp.html
developer.mozilla.org/en-US/profiles/petef.html
developer.mozilla.org/en-US/profiles/Bchristie.html
developer.mozilla.org/en-US/profiles/Steve McFarland.html
developer.mozilla.org/en-US/profiles/Jcubed.html
developer.mozilla.org/en-US/profiles/StevenGarrity.html
developer.mozilla.org/en-US/profiles/Alexdong.html
developer.mozilla.org/en-US/profiles/OcyTez.html
developer.mozilla.org/en-US/profiles/BrendanMcKeon.html

That is, we appear to have one file per profile... or at least, one file for each profile that's linked to from somewhere else. This info is already public, but I worry about making it available in a tarball like this. Perhaps I'm being to cautious... ?


5)
There are also miscellaneous files from places other than developer.mozilla.org pulled in by this. They don't wind up in the tarball, but wget does spend time fetching them.


In total, there are 357,000 files in the archive.



I think we'll need to do better on this. Any ideas?

I think we can fix at least some of these problems with a better wget:

"-D developer.mozilla.org" might solve problem #5.
"-X" might be able to solve problem #1, #2, and #4, given appropriate directory names to exclude.
"-R" might be able to solve problem #3 with an appropriate pattern.
Priority: P4 → --
I've changed the wget around a bit to hopefully counteract some/all of the errors from comment 16:

wget -q -m -p -k -E -T 5 -t 3 -R 'gz,bz2,zip,exe,download,flag*,login*,*\$history' -D developer.mozilla.org -X '*/profiles' -np https://developer.mozilla.org/en-US/

At a glance this seems to fix problem #1, #4, and #5. I'm not yet sure about #2... will know for sure when the run finishes. It seems to fix #3 in an indirect fashion (by prohibiting files like "/docs/SVG$history", which is where all the links to the individual revisions come from). I haven't had any luck in restricting the individual revision links themselves... if something on the site links directly to a particular revision, this wget will still pull that revision in... but that's probably okay.

Note on problem #2... I think we did not actually submit feedback on those... clicking those "flag" links brings up a form window. Still, these contribute nothing towards the goal of this tarball, so I think it's still good to exclude them.
Priority: -- → P4
*Much* better this time around... took 323min to run (down from ~1200). The resulting tarball is 1.6GB (down from 2.3GB). Total file count is about 65,000 (down from 357,000).

For some reason there's still a good amount of 'login?next=...' page visits, but instead of 66,000 there's only 84 of them. I'm content to ignore this... although I'm puzzled by why this should be the case in the first place.


One thing we still get a lot of that could probably be excluded is this:

developer.mozilla.org/en-US/docs/Eclipse_CDT.html (okay)
developer.mozilla.org/en-US/docs/Eclipse_CDT$json (not needed?)

The .html file is obviously good, but the $json one seems unnecessary for the stated purpose of this tarball. These $json files comprise over 21,000 of the files in the archive... almost a full 1/3rd of the file count (although not nearly that much of the total archive size... they seem to be very small).

Any thoughts on these files?
What's in them? If they don't add anything really useful, I'd remove them.
You can see them on the live site:
https://developer.mozilla.org/en-US/docs/Eclipse_CDT$json

I don't know where they're linked to that causes wget to find and download them.

[root@developeradm.private.scl3 docs]# wc -l Eclipse_CDT.html 
568 Eclipse_CDT.html

[root@developeradm.private.scl3 docs]# wc -l Eclipse_CDT\$json 
0 Eclipse_CDT$json

[root@developeradm.private.scl3 docs]# cat Eclipse_CDT\$json 
{"slug": "Eclipse_CDT", "title": "Eclipse CDT", "locale": "en-US", "summary": "", "url": "/en-US/docs/Eclipse_CDT", "id": 159}

(no line break at the end of the line, so the line count is zero)

I haven't examined these files exhaustively, but the couple I've looked at look approximately like this... one line of JSON-formatted metadata.
Thanks. I'd say remove them.
Generating this now...

wget -q -m -p -k -E -T 5 -t 3 -R 'gz,bz2,zip,exe,download,flag*,login*,*\$history,*\$json' -D developer.mozilla.org -X '*/profiles' -np https://developer.mozilla.org/en-US/
The generate in comment 22 took 278min, and the resulting tarball is 1.6GB, ~44,000 files.

I'm content with this as-is. ~4.6 hours is still far too long for a daily job IMO, but we can definitely run it weekly. It's already in-place this way, and will run again on Sunday. Due to a permissions issue, it hasn't run automatically since my last manual run in comment 22... that's fixed now also.

Calling this one completed. If there are any problems, it should email them to cron-mdn@, as usual.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
No longer blocks: 756266
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Great job, however I wonder if we really need _all_ the .MOVs and .OGVs in there. 

The great part of them seems to be in /presentations/, and each of them is 200+ MB big. I wonder if everybody would need these.
But you decide.
Andreas, you're right. Tarballs shouldn't include video files, just the pages.
Exactly. Well, I guess they've just been overlooked the first time.

Nice, so with the upcoming builds, we might expect some *significantly* smaller tarballs. Since these take away several hundred MB alone. Looking forward to checking out the slimmed-down version.

Ben, do you want me to open a new "bug" for it, or can you do this "in-house"? :)
jakem is doing them.

Andreas, is this the only large thing that sticks out, or are there more classes of files we should exclude? (Only if it saves significant space)
I can tell you in a blink of an eye: (8 MB threshold)

andy@andy-lubuntubox:/media/devdocs $ find developer.mozilla.org -size +8000000c

developer.mozilla.org/presentations/dbaron_architecture_14_dec_2006.mov
developer.mozilla.org/presentations/design_challenge_session_2-extension_bootcamp.ogv
developer.mozilla.org/presentations/jst_architecture_8_dec_2006.mov
developer.mozilla.org/presentations/screencasts/jresig-digg-firebug-jquery.mp4
developer.mozilla.org/presentations/seneca/MozillaLecture1Part1_Broadband.mov
developer.mozilla.org/presentations/seneca/MozillaLecture1Part2_Broadband.mov
developer.mozilla.org/presentations/seneca/MozillaLecture2Part1_Broadband.mov
developer.mozilla.org/presentations/seneca/MozillaLecture2Part2b_Broadband.mov
developer.mozilla.org/presentations/seneca/MozillaLecture2Part2_Broadband.mov
developer.mozilla.org/presentations/seneca/MozillaLecture3Part1_Broadband.mov
developer.mozilla.org/presentations/seneca/MozillaLecture3Part2_Broadband.mov
developer.mozilla.org/presentations/seneca/MozillaLecture4Part1_Broadband.mov
developer.mozilla.org/presentations/seneca/MozillaLecture4Part2_Broadband.mov
developer.mozilla.org/presentations/seneca/MozillaLecture5Part1_Broadband.mov
developer.mozilla.org/presentations/seneca/MozillaLecture5Part2_Broadband.mov

It was a good thing that you asked! Missed the *.mp4 in /presentations/screencasts.
I've updated the script to also exclude mp4, ogv, and mov files. Next time it runs, it should be much smaller.

For reference, the current size is 1.7GB. I don't now what % of that will be excluded here, but I'd imagine quite a bit... I expect those files probably don't compress very well, and thus probably comprise a very large % of the total tarball size.
QA Contact: cshields → nmaul
Downloaded the tarball tonight and had trouble extracting it. I ran 

    tar -xzvf ./developer.mozilla.org.tar.gz

but it failed with numerous errors like:

    tar: developer.mozilla.org/en-US/search?locale=*&kumascript_macros=outdated&page=3.html: Cannot open: Invalid argument

The cause is all the files with names like `search?locale=*&kumascript_macros=XULAttrInherited&page=28.html`, these are not valid on certain file systems, such as FAT. 

The local fix: I extracted to EXT4 and proceeded to copy everything I could to the fat drive.

Are these files useful? Could they be removed or renamed to allow extraction in fat?
No, I don't think think search? pages are useful. We should filter them out.
Note: if you're using this MDN tarball feature, please see https://bugzilla.mozilla.org/show_bug.cgi?id=1041871#c5 as it may change the contents of the tarball.
Hi, i search a tarball version with all languages and it's seem this tarball have only en-US language.
Can you make tarball with all languages or tarballs for each language ?
Flags: needinfo?(nmaul)
Flags: needinfo?(lcrouch)
That's a question for :jakem if/when he has a chance to get back to the script.
Flags: needinfo?(lcrouch)
All languages?! The next important message you Mozilla guys will get on your servers will be "No space left on device" ;)
Additional comment to the languages thing:

e. g. 

developer.mozilla.org/en-US/docs/Web/JavaScript/Dokumentacja_j\304%99zyka_JavaScript_1.5/Obiekty/Packages/netscape$revision/610611.html

This is VERY strange. en-US means "english [USA]" right? And what do these Polish-language documents do inside this tree??
Looks the whole systematic is messed up quite a lot...
Update: 
Deleted all the $revision stuff from my local copy, and did a find en-US/docs -type d -empty -delete. Much more tidied-up now. 
Apparently these foreign-language sub-trees mainly consist of EMPTY folders.
Note: https://kapeli.com/mdn_offline is now available. We're going to find all the links from MDN to this tarball mirror and redirect them to that page. If we can do that and it works for our uses, we can likely kill this script.
I need to get the entire MDN including Docs for Firefox OS, SpiderMonkey and other Mozilla products. https://kapeli.com/mdn_offline is just offering HTML, CSS, JavaScript, SVG and XSLT docs. Should I download https://developer.mozilla.org/
media/developer.mozilla.org.tar.gz? Does this tarball contain up to date docs?
Yes, the tarball should include all MDN docs including Mozilla products. It should be updated by the cron job every week.
Thanks a lot, Luke. Another question is, will it come as simple HTML pages, or as server side scripts?
It will be an offline HTML package.
Thanks. I'm gonna download it with GNU Wget. :-)
Blocks: 1224953
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.