757461 - MDN: generate and host a tarball mirror of MDN

Reporter

Description

•

13 years ago

In bug 561470, :BenB has devised a wget mode and Apache config to make it possible to host a downloadable tarball or MDC content. If it isn't too much work, we should see if we can set this up while we develop a solid "offline" feature for Kuma.

Ben Bucksch (:BenB)

Comment 1

•

13 years ago

http://download.beonex.com/mirror/developer.mozilla.org-2012-05-21.tar.bz2 I did: wget -m -p -k -E -T 5 -t 3 -R gz,bz2,zip,exe,download -np https://developer.mozilla.org/en/ skins/ was posing a problem, because it's in robots.txt (why? can you remove that, please?), so I had to fetch it manually ([1], added to the new tarball) and convert the links manually. Furthermore, I did find . -type f -name "*.html" -print 0 | xargs -0 -n 1 sed -e "s|https://developer.mozilla.org|http://mdn.beonex.com|" -e "s|//www.google.com|//no.google.anymore|" The latter is important for privacy and EU laws, because Google keeps and analyzes logs about where users were. Apache config: <VirtualHost *> ServerName mdn.beonex.com ... <Location /> DefaultType text/html </Location> <Location /skins/> DefaultType text/css # for /skins/common/js.php?perms... AddType text/css php </Location> <Location /media/css/> DefaultType text/css </Location> RewriteEngine on RewriteRule ^/$ en/index.html [R] RewriteRule ^/docs$ en-US/docs.html [R] RewriteRule ^/learn$ en/index.html [R] RewriteRule ^(.*)/$ $1.html [R] </VirtualHost> [1] https://developer.mozilla.org/skins/common/css.php https://developer.mozilla.org/skins/common/custom_css.php https://developer.mozilla.org/skins/common/icons/icon-trans.gif https://developer.mozilla.org/skins/common/js.php?perms= https://developer.mozilla.org/skins/common/js.php?perms=LOGIN,BROWSE,READ,SUBSCRIBE https://developer.mozilla.org/skins/common/print.css https://developer.mozilla.org/skins/mdn/Transitional/css.php https://developer.mozilla.org/skins/mdn/Transitional/img/mdn-logo-sm.png https://developer.mozilla.org/skins/mdn/Transitional/img/mdn-logo-tiny.png https://developer.mozilla.org/skins/mdn/Transitional/js/javascript.min.js https://developer.mozilla.org/skins/mdn/Transitional/print.css

Ben Bucksch (:BenB)

Comment 2

•

13 years ago

The tarball can also be used offline, FWIW.

Corey Shields [:cshields]

Comment 3

•

13 years ago

Can we get oppsec to sign off on this request, kick it back to us and we'll get'er done

Assignee: server-ops → nobody

Component: Server Operations: Web Operations → Security Assurance: Operations

QA Contact: cshields → security-assurance

Guillaume Destuynder [:kang] [this account is no longer active]

Comment 4

•

13 years ago

I would believe that the robots.txt list has been created to alleviate the said load issues, not as a security mean. Which mean it's fine for us / signed off :) If this is going to be a once-a-day thing I believe you should use -e robots=off --wait 5s (or less) but I would ask webops for that, so back to them. Just make sure you have a wait time so that it's not registered as flood or doesn't kill the site if it's already under heavy load.

Assignee: nobody → server-ops

Component: Security Assurance: Operations → Server Operations: Web Operations

QA Contact: security-assurance → cshields

Ben Bucksch (:BenB)

Comment 5

•

13 years ago

kang, a once-a-day fetch with --wait 5 wouldn't work. There are some 20-80,000 documents. There are only 86400 seconds in a day, and the server already needs 1-2s to respond to requests, at least from Germany (fast uplink).

Guillaume Destuynder [:kang] [this account is no longer active]

Comment 6

•

13 years ago

In that case I'm ok with zero delay. Up to webops to also agree however (due to the load). It might be better to have a tarball created locally, but, that's also up to you/webops.

Ben Bucksch (:BenB)

Comment 7

•

13 years ago

> It might be better to have a tarball created locally Yes, that's the point of this bug :)

Phong Tran [:phong]

Comment 8

•

13 years ago

might be a dup.

Assignee: server-ops → nmaul

Les Orchard [:lorchard]

Comment 9

•

13 years ago

(In reply to Phong Tran [:phong] from comment #8) > might be a dup. dup of which bug?

Ben Bucksch (:BenB)

Updated

•

13 years ago

Blocks: 756266

Jake Maul [:jakem]

Assignee

Comment 10

•

13 years ago

2 questions: 1) Are we happy with the wget command in comment 1? 2) Where should this be hosted? Perhaps somewhere in /media/? Once we're agreed on this this should be an easy cronjob.

Priority: -- → P4

Whiteboard: [waiting][webdev]

Luke Crouch [:groovecoder]

Reporter

Comment 11

•

13 years ago

1) IDK. :BenB? 2) Yes, and we can link to it from the site somewhere.

Ben Bucksch (:BenB)

Comment 12

•

13 years ago

I think so, apart from the skins/ problem I mentioned. I am not aware of other problems at the moment, but it's many months ago that I tried it, and that was before Kuma. You can see the result of that (old, pre-kuma!) fetch on http://mdn.beonex.com/ and try for yourself.

Jake Maul [:jakem]

Assignee

Comment 13

•

13 years ago

As far as I can tell, there no longer *is* any robots.txt file at all... so /skins/ should get pulled in just like anything else. I've written the basic script and it's running now. The tarball should be available here when it completes: https://developer.mozilla.org/media/developer.mozilla.org.tar.gz No idea how big it'll be or how long it'll take. I'm tentatively scheduling it for weekly processing. When the first run completes, we can take stock of those things and adjust as necessary. For the record, the wget command I'm using is: wget -q -m -p -k -E -T 5 -t 3 -R gz,bz2,zip,exe,download -np https://developer.mozilla.org/en-US/ Basically the same, but 'en' became 'en-US', and I added -q so cron output would be useful. If we should do other locales too, let me know... I imagine they could go into the same archive... depending on size that may or may not be desirable though.

Jake Maul [:jakem]

Assignee

Updated

•

13 years ago

Whiteboard: [waiting][webdev] → [triaged 20120910]

Luke Crouch [:groovecoder]

Reporter

Comment 14

•

13 years ago

Bringing in :openjck so he knows what's up and can help us file/plan/schedule a bug to add a link to the .tgz download on the site somewhere.

Ben Bucksch (:BenB)

Comment 15

•

13 years ago

> For the record, the wget command I'm using is: > wget -q -m -p -k -E -T 5 -t 3 -R gz,bz2,zip,exe,download -np > https://developer.mozilla.org/en-US/ Thanks! The URL <https://developer.mozilla.org/media/developer.mozilla.org.tar.gz> gives me 403 Forbidden, probably a file permissions/ownership issue.

Jake Maul [:jakem]

Assignee

Comment 16

•

13 years ago

Yeah, files in /media/ that don't exist apparently throw a 403. It didn't exist until the job finished... it works now, and is apparently around 2.3GB. It took almost 20 hours to generate, and includes only /en-US/ stuff. That's long enough that I'm concerned about running it weekly... monthly seems more feasible. Also I did not do anything with respect to content filtering (like the find/sed in comment 1)... not sure how long that would take, or how much we would want to filter. But the real issue is, I am afraid this might be a very unclean dataset to work with. At a glance: 1) I see about 66,000 files like this: developer.mozilla.org/en-US/users/login?next=%2Fen-US%2Fdocs%2Fwindow.forward$locales.html developer.mozilla.org/en-US/users/login?next=%2Fen-US%2Fdocs%2FXPCOM_Interface_Reference%2FamIInstallCallback$locales.html developer.mozilla.org/en-US/users/login?next=%2Fen-US%2Fdocs%2Fnew?slug=Javadoc.html Repeated calls to the login page with different arguments. This probably happens because the link to the login page is present (with a different argument) from most pages wget visits. 2) I also see files like this: developer.mozilla.org/en-US/demos/detail/walking-with-css3/flag?flag_type=inappropriate.html developer.mozilla.org/en-US/demos/detail/walking-with-css3/flag?flag_type=notworking.html developer.mozilla.org/en-US/demos/detail/walking-with-css3/flag?flag_type=plagarised.html This makes me very nervous... did we just end up submitting a whole bunch of feedback for every single demo with this single wget? 3) And then there's chunks like this: developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94969.html developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94979.html developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94978.html developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94982.html developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94975.html It appears that in many cases we're getting one file *per revision* of a document. 4) There's also files like this: developer.mozilla.org/en-US/profiles/darkyndy?sort=likes.html developer.mozilla.org/en-US/profiles/cxmhiiemn00.html developer.mozilla.org/en-US/profiles/CristianTincu.html developer.mozilla.org/en-US/profiles/mediafirresp.html developer.mozilla.org/en-US/profiles/petef.html developer.mozilla.org/en-US/profiles/Bchristie.html developer.mozilla.org/en-US/profiles/Steve McFarland.html developer.mozilla.org/en-US/profiles/Jcubed.html developer.mozilla.org/en-US/profiles/StevenGarrity.html developer.mozilla.org/en-US/profiles/Alexdong.html developer.mozilla.org/en-US/profiles/OcyTez.html developer.mozilla.org/en-US/profiles/BrendanMcKeon.html That is, we appear to have one file per profile... or at least, one file for each profile that's linked to from somewhere else. This info is already public, but I worry about making it available in a tarball like this. Perhaps I'm being to cautious... ? 5) There are also miscellaneous files from places other than developer.mozilla.org pulled in by this. They don't wind up in the tarball, but wget does spend time fetching them. In total, there are 357,000 files in the archive. I think we'll need to do better on this. Any ideas? I think we can fix at least some of these problems with a better wget: "-D developer.mozilla.org" might solve problem #5. "-X" might be able to solve problem #1, #2, and #4, given appropriate directory names to exclude. "-R" might be able to solve problem #3 with an appropriate pattern.

Priority: P4 → --

Jake Maul [:jakem]

Assignee

Comment 17

•

13 years ago

I've changed the wget around a bit to hopefully counteract some/all of the errors from comment 16: wget -q -m -p -k -E -T 5 -t 3 -R 'gz,bz2,zip,exe,download,flag*,login*,*\$history' -D developer.mozilla.org -X '*/profiles' -np https://developer.mozilla.org/en-US/ At a glance this seems to fix problem #1, #4, and #5. I'm not yet sure about #2... will know for sure when the run finishes. It seems to fix #3 in an indirect fashion (by prohibiting files like "/docs/SVG$history", which is where all the links to the individual revisions come from). I haven't had any luck in restricting the individual revision links themselves... if something on the site links directly to a particular revision, this wget will still pull that revision in... but that's probably okay. Note on problem #2... I think we did not actually submit feedback on those... clicking those "flag" links brings up a form window. Still, these contribute nothing towards the goal of this tarball, so I think it's still good to exclude them.

Priority: -- → P4

Jake Maul [:jakem]

Assignee

Comment 18

•

13 years ago

*Much* better this time around... took 323min to run (down from ~1200). The resulting tarball is 1.6GB (down from 2.3GB). Total file count is about 65,000 (down from 357,000). For some reason there's still a good amount of 'login?next=...' page visits, but instead of 66,000 there's only 84 of them. I'm content to ignore this... although I'm puzzled by why this should be the case in the first place. One thing we still get a lot of that could probably be excluded is this: developer.mozilla.org/en-US/docs/Eclipse_CDT.html (okay) developer.mozilla.org/en-US/docs/Eclipse_CDT$json (not needed?) The .html file is obviously good, but the $json one seems unnecessary for the stated purpose of this tarball. These $json files comprise over 21,000 of the files in the archive... almost a full 1/3rd of the file count (although not nearly that much of the total archive size... they seem to be very small). Any thoughts on these files?

Ben Bucksch (:BenB)

Comment 19

•

13 years ago

What's in them? If they don't add anything really useful, I'd remove them.

Jake Maul [:jakem]

Assignee

Comment 20

•

13 years ago

You can see them on the live site: https://developer.mozilla.org/en-US/docs/Eclipse_CDT$json I don't know where they're linked to that causes wget to find and download them. [root@developeradm.private.scl3 docs]# wc -l Eclipse_CDT.html 568 Eclipse_CDT.html [root@developeradm.private.scl3 docs]# wc -l Eclipse_CDT\$json 0 Eclipse_CDT$json [root@developeradm.private.scl3 docs]# cat Eclipse_CDT\$json {"slug": "Eclipse_CDT", "title": "Eclipse CDT", "locale": "en-US", "summary": "", "url": "/en-US/docs/Eclipse_CDT", "id": 159} (no line break at the end of the line, so the line count is zero) I haven't examined these files exhaustively, but the couple I've looked at look approximately like this... one line of JSON-formatted metadata.

Ben Bucksch (:BenB)

Comment 21

•

13 years ago

Thanks. I'd say remove them.

Jake Maul [:jakem]

Assignee

Comment 22

•

13 years ago

Generating this now... wget -q -m -p -k -E -T 5 -t 3 -R 'gz,bz2,zip,exe,download,flag*,login*,*\$history,*\$json' -D developer.mozilla.org -X '*/profiles' -np https://developer.mozilla.org/en-US/

Jake Maul [:jakem]

Assignee

Comment 23

•

13 years ago

The generate in comment 22 took 278min, and the resulting tarball is 1.6GB, ~44,000 files. I'm content with this as-is. ~4.6 hours is still far too long for a daily job IMO, but we can definitely run it weekly. It's already in-place this way, and will run again on Sunday. Due to a permissions issue, it hasn't run automatically since my last manual run in comment 22... that's fixed now also. Calling this one completed. If there are any problems, it should email them to cron-mdn@, as usual.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

John Karahalis [:openjck]

Updated

•

13 years ago

No longer blocks: 756266

Nobody; OK to take it and work on it

Updated

•

12 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

Andreas Eibach

Comment 25

•

12 years ago

Great job, however I wonder if we really need _all_ the .MOVs and .OGVs in there. The great part of them seems to be in /presentations/, and each of them is 200+ MB big. I wonder if everybody would need these. But you decide.

Ben Bucksch (:BenB)

Comment 26

•

12 years ago

Andreas, you're right. Tarballs shouldn't include video files, just the pages.

Andreas Eibach

Comment 27

•

12 years ago

Exactly. Well, I guess they've just been overlooked the first time. Nice, so with the upcoming builds, we might expect some *significantly* smaller tarballs. Since these take away several hundred MB alone. Looking forward to checking out the slimmed-down version. Ben, do you want me to open a new "bug" for it, or can you do this "in-house"? :)

Ben Bucksch (:BenB)

Comment 28

•

12 years ago

jakem is doing them. Andreas, is this the only large thing that sticks out, or are there more classes of files we should exclude? (Only if it saves significant space)

Andreas Eibach

Comment 29

•

12 years ago

I can tell you in a blink of an eye: (8 MB threshold) andy@andy-lubuntubox:/media/devdocs $ find developer.mozilla.org -size +8000000c developer.mozilla.org/presentations/dbaron_architecture_14_dec_2006.mov developer.mozilla.org/presentations/design_challenge_session_2-extension_bootcamp.ogv developer.mozilla.org/presentations/jst_architecture_8_dec_2006.mov developer.mozilla.org/presentations/screencasts/jresig-digg-firebug-jquery.mp4 developer.mozilla.org/presentations/seneca/MozillaLecture1Part1_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture1Part2_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture2Part1_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture2Part2b_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture2Part2_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture3Part1_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture3Part2_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture4Part1_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture4Part2_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture5Part1_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture5Part2_Broadband.mov It was a good thing that you asked! Missed the *.mp4 in /presentations/screencasts.

Jake Maul [:jakem]

Assignee

Comment 30

•

12 years ago

I've updated the script to also exclude mp4, ogv, and mov files. Next time it runs, it should be much smaller. For reference, the current size is 1.7GB. I don't now what % of that will be excluded here, but I'd imagine quite a bit... I expect those files probably don't compress very well, and thus probably comprise a very large % of the total tarball size.

QA Contact: cshields → nmaul

Robin Cafolla

Comment 31

•

11 years ago

Downloaded the tarball tonight and had trouble extracting it. I ran tar -xzvf ./developer.mozilla.org.tar.gz but it failed with numerous errors like: tar: developer.mozilla.org/en-US/search?locale=*&kumascript_macros=outdated&page=3.html: Cannot open: Invalid argument The cause is all the files with names like `search?locale=*&kumascript_macros=XULAttrInherited&page=28.html`, these are not valid on certain file systems, such as FAT. The local fix: I extracted to EXT4 and proceeded to copy everything I could to the fat drive. Are these files useful? Could they be removed or renamed to allow extraction in fat?

Luke Crouch [:groovecoder]

Reporter

Comment 32

•

11 years ago

No, I don't think think search? pages are useful. We should filter them out.

Luke Crouch [:groovecoder]

Reporter

Comment 33

•

11 years ago

Note: if you're using this MDN tarball feature, please see https://bugzilla.mozilla.org/show_bug.cgi?id=1041871#c5 as it may change the contents of the tarball.

dattaz

Comment 34

•

10 years ago

Hi, i search a tarball version with all languages and it's seem this tarball have only en-US language. Can you make tarball with all languages or tarballs for each language ?

Flags: needinfo?(nmaul)

Flags: needinfo?(lcrouch)

Luke Crouch [:groovecoder]

Reporter

Comment 35

•

10 years ago

That's a question for :jakem if/when he has a chance to get back to the script.

Flags: needinfo?(lcrouch)

Andreas Eibach

Comment 36

•

10 years ago

All languages?! The next important message you Mozilla guys will get on your servers will be "No space left on device" ;)

Andreas Eibach

Comment 37

•

10 years ago

Additional comment to the languages thing: e. g. developer.mozilla.org/en-US/docs/Web/JavaScript/Dokumentacja_j\304%99zyka_JavaScript_1.5/Obiekty/Packages/netscape$revision/610611.html This is VERY strange. en-US means "english [USA]" right? And what do these Polish-language documents do inside this tree?? Looks the whole systematic is messed up quite a lot...

Andreas Eibach

Comment 38

•

10 years ago

Update: Deleted all the $revision stuff from my local copy, and did a find en-US/docs -type d -empty -delete. Much more tidied-up now. Apparently these foreign-language sub-trees mainly consist of EMPTY folders.

Luke Crouch [:groovecoder]

Reporter

Comment 39

•

10 years ago

Note: https://kapeli.com/mdn_offline is now available. We're going to find all the links from MDN to this tarball mirror and redirect them to that page. If we can do that and it works for our uses, we can likely kill this script.

u543067

Comment 40

•

10 years ago

I need to get the entire MDN including Docs for Firefox OS, SpiderMonkey and other Mozilla products. https://kapeli.com/mdn_offline is just offering HTML, CSS, JavaScript, SVG and XSLT docs. Should I download https://developer.mozilla.org/ media/developer.mozilla.org.tar.gz? Does this tarball contain up to date docs?

Luke Crouch [:groovecoder]

Reporter

Comment 41

•

10 years ago

Yes, the tarball should include all MDN docs including Mozilla products. It should be updated by the cron job every week.

u543067

Comment 42

•

10 years ago

Thanks a lot, Luke. Another question is, will it come as simple HTML pages, or as server side scripts?

Luke Crouch [:groovecoder]

Reporter

Comment 43

•

10 years ago

It will be an offline HTML package.

u543067

Comment 44

•

10 years ago

Thanks. I'm gonna download it with GNU Wget. :-)

Lonnen :lonnen

Updated

•

10 years ago

Blocks: 1224953

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard