Closed
Bug 757461
Opened 12 years ago
Closed 12 years ago
MDN: generate and host a tarball mirror of MDN
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task, P4)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: groovecoder, Assigned: nmaul, NeedInfo)
References
Details
(Whiteboard: [triaged 20120910])
In bug 561470, :BenB has devised a wget mode and Apache config to make it possible to host a downloadable tarball or MDC content. If it isn't too much work, we should see if we can set this up while we develop a solid "offline" feature for Kuma.
Comment 1•12 years ago
|
||
http://download.beonex.com/mirror/developer.mozilla.org-2012-05-21.tar.bz2 I did: wget -m -p -k -E -T 5 -t 3 -R gz,bz2,zip,exe,download -np https://developer.mozilla.org/en/ skins/ was posing a problem, because it's in robots.txt (why? can you remove that, please?), so I had to fetch it manually ([1], added to the new tarball) and convert the links manually. Furthermore, I did find . -type f -name "*.html" -print 0 | xargs -0 -n 1 sed -e "s|https://developer.mozilla.org|http://mdn.beonex.com|" -e "s|//www.google.com|//no.google.anymore|" The latter is important for privacy and EU laws, because Google keeps and analyzes logs about where users were. Apache config: <VirtualHost *> ServerName mdn.beonex.com ... <Location /> DefaultType text/html </Location> <Location /skins/> DefaultType text/css # for /skins/common/js.php?perms... AddType text/css php </Location> <Location /media/css/> DefaultType text/css </Location> RewriteEngine on RewriteRule ^/$ en/index.html [R] RewriteRule ^/docs$ en-US/docs.html [R] RewriteRule ^/learn$ en/index.html [R] RewriteRule ^(.*)/$ $1.html [R] </VirtualHost> [1] https://developer.mozilla.org/skins/common/css.php https://developer.mozilla.org/skins/common/custom_css.php https://developer.mozilla.org/skins/common/icons/icon-trans.gif https://developer.mozilla.org/skins/common/js.php?perms= https://developer.mozilla.org/skins/common/js.php?perms=LOGIN,BROWSE,READ,SUBSCRIBE https://developer.mozilla.org/skins/common/print.css https://developer.mozilla.org/skins/mdn/Transitional/css.php https://developer.mozilla.org/skins/mdn/Transitional/img/mdn-logo-sm.png https://developer.mozilla.org/skins/mdn/Transitional/img/mdn-logo-tiny.png https://developer.mozilla.org/skins/mdn/Transitional/js/javascript.min.js https://developer.mozilla.org/skins/mdn/Transitional/print.css
Comment 2•12 years ago
|
||
The tarball can also be used offline, FWIW.
Comment 3•12 years ago
|
||
Can we get oppsec to sign off on this request, kick it back to us and we'll get'er done
Assignee: server-ops → nobody
Component: Server Operations: Web Operations → Security Assurance: Operations
QA Contact: cshields → security-assurance
I would believe that the robots.txt list has been created to alleviate the said load issues, not as a security mean. Which mean it's fine for us / signed off :) If this is going to be a once-a-day thing I believe you should use -e robots=off --wait 5s (or less) but I would ask webops for that, so back to them. Just make sure you have a wait time so that it's not registered as flood or doesn't kill the site if it's already under heavy load.
Assignee: nobody → server-ops
Component: Security Assurance: Operations → Server Operations: Web Operations
QA Contact: security-assurance → cshields
Comment 5•12 years ago
|
||
kang, a once-a-day fetch with --wait 5 wouldn't work. There are some 20-80,000 documents. There are only 86400 seconds in a day, and the server already needs 1-2s to respond to requests, at least from Germany (fast uplink).
In that case I'm ok with zero delay. Up to webops to also agree however (due to the load). It might be better to have a tarball created locally, but, that's also up to you/webops.
Comment 7•12 years ago
|
||
> It might be better to have a tarball created locally
Yes, that's the point of this bug :)
Comment 9•12 years ago
|
||
(In reply to Phong Tran [:phong] from comment #8) > might be a dup. dup of which bug?
Assignee | ||
Comment 10•12 years ago
|
||
2 questions: 1) Are we happy with the wget command in comment 1? 2) Where should this be hosted? Perhaps somewhere in /media/? Once we're agreed on this this should be an easy cronjob.
Priority: -- → P4
Whiteboard: [waiting][webdev]
Reporter | ||
Comment 11•12 years ago
|
||
1) IDK. :BenB? 2) Yes, and we can link to it from the site somewhere.
Comment 12•12 years ago
|
||
I think so, apart from the skins/ problem I mentioned. I am not aware of other problems at the moment, but it's many months ago that I tried it, and that was before Kuma. You can see the result of that (old, pre-kuma!) fetch on http://mdn.beonex.com/ and try for yourself.
Assignee | ||
Comment 13•12 years ago
|
||
As far as I can tell, there no longer *is* any robots.txt file at all... so /skins/ should get pulled in just like anything else. I've written the basic script and it's running now. The tarball should be available here when it completes: https://developer.mozilla.org/media/developer.mozilla.org.tar.gz No idea how big it'll be or how long it'll take. I'm tentatively scheduling it for weekly processing. When the first run completes, we can take stock of those things and adjust as necessary. For the record, the wget command I'm using is: wget -q -m -p -k -E -T 5 -t 3 -R gz,bz2,zip,exe,download -np https://developer.mozilla.org/en-US/ Basically the same, but 'en' became 'en-US', and I added -q so cron output would be useful. If we should do other locales too, let me know... I imagine they could go into the same archive... depending on size that may or may not be desirable though.
Assignee | ||
Updated•12 years ago
|
Whiteboard: [waiting][webdev] → [triaged 20120910]
Reporter | ||
Comment 14•12 years ago
|
||
Bringing in :openjck so he knows what's up and can help us file/plan/schedule a bug to add a link to the .tgz download on the site somewhere.
Comment 15•12 years ago
|
||
> For the record, the wget command I'm using is: > wget -q -m -p -k -E -T 5 -t 3 -R gz,bz2,zip,exe,download -np > https://developer.mozilla.org/en-US/ Thanks! The URL <https://developer.mozilla.org/media/developer.mozilla.org.tar.gz> gives me 403 Forbidden, probably a file permissions/ownership issue.
Assignee | ||
Comment 16•12 years ago
|
||
Yeah, files in /media/ that don't exist apparently throw a 403. It didn't exist until the job finished... it works now, and is apparently around 2.3GB. It took almost 20 hours to generate, and includes only /en-US/ stuff. That's long enough that I'm concerned about running it weekly... monthly seems more feasible. Also I did not do anything with respect to content filtering (like the find/sed in comment 1)... not sure how long that would take, or how much we would want to filter. But the real issue is, I am afraid this might be a very unclean dataset to work with. At a glance: 1) I see about 66,000 files like this: developer.mozilla.org/en-US/users/login?next=%2Fen-US%2Fdocs%2Fwindow.forward$locales.html developer.mozilla.org/en-US/users/login?next=%2Fen-US%2Fdocs%2FXPCOM_Interface_Reference%2FamIInstallCallback$locales.html developer.mozilla.org/en-US/users/login?next=%2Fen-US%2Fdocs%2Fnew?slug=Javadoc.html Repeated calls to the login page with different arguments. This probably happens because the link to the login page is present (with a different argument) from most pages wget visits. 2) I also see files like this: developer.mozilla.org/en-US/demos/detail/walking-with-css3/flag?flag_type=inappropriate.html developer.mozilla.org/en-US/demos/detail/walking-with-css3/flag?flag_type=notworking.html developer.mozilla.org/en-US/demos/detail/walking-with-css3/flag?flag_type=plagarised.html This makes me very nervous... did we just end up submitting a whole bunch of feedback for every single demo with this single wget? 3) And then there's chunks like this: developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94969.html developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94979.html developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94978.html developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94982.html developer.mozilla.org/en-US/docs/Components.utils.reportError$revision/94975.html It appears that in many cases we're getting one file *per revision* of a document. 4) There's also files like this: developer.mozilla.org/en-US/profiles/darkyndy?sort=likes.html developer.mozilla.org/en-US/profiles/cxmhiiemn00.html developer.mozilla.org/en-US/profiles/CristianTincu.html developer.mozilla.org/en-US/profiles/mediafirresp.html developer.mozilla.org/en-US/profiles/petef.html developer.mozilla.org/en-US/profiles/Bchristie.html developer.mozilla.org/en-US/profiles/Steve McFarland.html developer.mozilla.org/en-US/profiles/Jcubed.html developer.mozilla.org/en-US/profiles/StevenGarrity.html developer.mozilla.org/en-US/profiles/Alexdong.html developer.mozilla.org/en-US/profiles/OcyTez.html developer.mozilla.org/en-US/profiles/BrendanMcKeon.html That is, we appear to have one file per profile... or at least, one file for each profile that's linked to from somewhere else. This info is already public, but I worry about making it available in a tarball like this. Perhaps I'm being to cautious... ? 5) There are also miscellaneous files from places other than developer.mozilla.org pulled in by this. They don't wind up in the tarball, but wget does spend time fetching them. In total, there are 357,000 files in the archive. I think we'll need to do better on this. Any ideas? I think we can fix at least some of these problems with a better wget: "-D developer.mozilla.org" might solve problem #5. "-X" might be able to solve problem #1, #2, and #4, given appropriate directory names to exclude. "-R" might be able to solve problem #3 with an appropriate pattern.
Priority: P4 → --
Assignee | ||
Comment 17•12 years ago
|
||
I've changed the wget around a bit to hopefully counteract some/all of the errors from comment 16: wget -q -m -p -k -E -T 5 -t 3 -R 'gz,bz2,zip,exe,download,flag*,login*,*\$history' -D developer.mozilla.org -X '*/profiles' -np https://developer.mozilla.org/en-US/ At a glance this seems to fix problem #1, #4, and #5. I'm not yet sure about #2... will know for sure when the run finishes. It seems to fix #3 in an indirect fashion (by prohibiting files like "/docs/SVG$history", which is where all the links to the individual revisions come from). I haven't had any luck in restricting the individual revision links themselves... if something on the site links directly to a particular revision, this wget will still pull that revision in... but that's probably okay. Note on problem #2... I think we did not actually submit feedback on those... clicking those "flag" links brings up a form window. Still, these contribute nothing towards the goal of this tarball, so I think it's still good to exclude them.
Priority: -- → P4
Assignee | ||
Comment 18•12 years ago
|
||
*Much* better this time around... took 323min to run (down from ~1200). The resulting tarball is 1.6GB (down from 2.3GB). Total file count is about 65,000 (down from 357,000). For some reason there's still a good amount of 'login?next=...' page visits, but instead of 66,000 there's only 84 of them. I'm content to ignore this... although I'm puzzled by why this should be the case in the first place. One thing we still get a lot of that could probably be excluded is this: developer.mozilla.org/en-US/docs/Eclipse_CDT.html (okay) developer.mozilla.org/en-US/docs/Eclipse_CDT$json (not needed?) The .html file is obviously good, but the $json one seems unnecessary for the stated purpose of this tarball. These $json files comprise over 21,000 of the files in the archive... almost a full 1/3rd of the file count (although not nearly that much of the total archive size... they seem to be very small). Any thoughts on these files?
Comment 19•12 years ago
|
||
What's in them? If they don't add anything really useful, I'd remove them.
Assignee | ||
Comment 20•12 years ago
|
||
You can see them on the live site: https://developer.mozilla.org/en-US/docs/Eclipse_CDT$json I don't know where they're linked to that causes wget to find and download them. [root@developeradm.private.scl3 docs]# wc -l Eclipse_CDT.html 568 Eclipse_CDT.html [root@developeradm.private.scl3 docs]# wc -l Eclipse_CDT\$json 0 Eclipse_CDT$json [root@developeradm.private.scl3 docs]# cat Eclipse_CDT\$json {"slug": "Eclipse_CDT", "title": "Eclipse CDT", "locale": "en-US", "summary": "", "url": "/en-US/docs/Eclipse_CDT", "id": 159} (no line break at the end of the line, so the line count is zero) I haven't examined these files exhaustively, but the couple I've looked at look approximately like this... one line of JSON-formatted metadata.
Comment 21•12 years ago
|
||
Thanks. I'd say remove them.
Assignee | ||
Comment 22•12 years ago
|
||
Generating this now... wget -q -m -p -k -E -T 5 -t 3 -R 'gz,bz2,zip,exe,download,flag*,login*,*\$history,*\$json' -D developer.mozilla.org -X '*/profiles' -np https://developer.mozilla.org/en-US/
Assignee | ||
Comment 23•12 years ago
|
||
The generate in comment 22 took 278min, and the resulting tarball is 1.6GB, ~44,000 files. I'm content with this as-is. ~4.6 hours is still far too long for a daily job IMO, but we can definitely run it weekly. It's already in-place this way, and will run again on Sunday. Due to a permissions issue, it hasn't run automatically since my last manual run in comment 22... that's fixed now also. Calling this one completed. If there are any problems, it should email them to cron-mdn@, as usual.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Comment 25•11 years ago
|
||
Great job, however I wonder if we really need _all_ the .MOVs and .OGVs in there. The great part of them seems to be in /presentations/, and each of them is 200+ MB big. I wonder if everybody would need these. But you decide.
Comment 26•11 years ago
|
||
Andreas, you're right. Tarballs shouldn't include video files, just the pages.
Comment 27•11 years ago
|
||
Exactly. Well, I guess they've just been overlooked the first time. Nice, so with the upcoming builds, we might expect some *significantly* smaller tarballs. Since these take away several hundred MB alone. Looking forward to checking out the slimmed-down version. Ben, do you want me to open a new "bug" for it, or can you do this "in-house"? :)
Comment 28•11 years ago
|
||
jakem is doing them. Andreas, is this the only large thing that sticks out, or are there more classes of files we should exclude? (Only if it saves significant space)
Comment 29•11 years ago
|
||
I can tell you in a blink of an eye: (8 MB threshold) andy@andy-lubuntubox:/media/devdocs $ find developer.mozilla.org -size +8000000c developer.mozilla.org/presentations/dbaron_architecture_14_dec_2006.mov developer.mozilla.org/presentations/design_challenge_session_2-extension_bootcamp.ogv developer.mozilla.org/presentations/jst_architecture_8_dec_2006.mov developer.mozilla.org/presentations/screencasts/jresig-digg-firebug-jquery.mp4 developer.mozilla.org/presentations/seneca/MozillaLecture1Part1_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture1Part2_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture2Part1_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture2Part2b_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture2Part2_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture3Part1_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture3Part2_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture4Part1_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture4Part2_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture5Part1_Broadband.mov developer.mozilla.org/presentations/seneca/MozillaLecture5Part2_Broadband.mov It was a good thing that you asked! Missed the *.mp4 in /presentations/screencasts.
Assignee | ||
Comment 30•11 years ago
|
||
I've updated the script to also exclude mp4, ogv, and mov files. Next time it runs, it should be much smaller. For reference, the current size is 1.7GB. I don't now what % of that will be excluded here, but I'd imagine quite a bit... I expect those files probably don't compress very well, and thus probably comprise a very large % of the total tarball size.
QA Contact: cshields → nmaul
Comment 31•10 years ago
|
||
Downloaded the tarball tonight and had trouble extracting it. I ran tar -xzvf ./developer.mozilla.org.tar.gz but it failed with numerous errors like: tar: developer.mozilla.org/en-US/search?locale=*&kumascript_macros=outdated&page=3.html: Cannot open: Invalid argument The cause is all the files with names like `search?locale=*&kumascript_macros=XULAttrInherited&page=28.html`, these are not valid on certain file systems, such as FAT. The local fix: I extracted to EXT4 and proceeded to copy everything I could to the fat drive. Are these files useful? Could they be removed or renamed to allow extraction in fat?
Reporter | ||
Comment 32•10 years ago
|
||
No, I don't think think search? pages are useful. We should filter them out.
Reporter | ||
Comment 33•10 years ago
|
||
Note: if you're using this MDN tarball feature, please see https://bugzilla.mozilla.org/show_bug.cgi?id=1041871#c5 as it may change the contents of the tarball.
Comment 34•9 years ago
|
||
Hi, i search a tarball version with all languages and it's seem this tarball have only en-US language. Can you make tarball with all languages or tarballs for each language ?
Flags: needinfo?(nmaul)
Flags: needinfo?(lcrouch)
Reporter | ||
Comment 35•9 years ago
|
||
That's a question for :jakem if/when he has a chance to get back to the script.
Flags: needinfo?(lcrouch)
Comment 36•9 years ago
|
||
All languages?! The next important message you Mozilla guys will get on your servers will be "No space left on device" ;)
Comment 37•9 years ago
|
||
Additional comment to the languages thing: e. g. developer.mozilla.org/en-US/docs/Web/JavaScript/Dokumentacja_j\304%99zyka_JavaScript_1.5/Obiekty/Packages/netscape$revision/610611.html This is VERY strange. en-US means "english [USA]" right? And what do these Polish-language documents do inside this tree?? Looks the whole systematic is messed up quite a lot...
Comment 38•9 years ago
|
||
Update: Deleted all the $revision stuff from my local copy, and did a find en-US/docs -type d -empty -delete. Much more tidied-up now. Apparently these foreign-language sub-trees mainly consist of EMPTY folders.
Reporter | ||
Comment 39•9 years ago
|
||
Note: https://kapeli.com/mdn_offline is now available. We're going to find all the links from MDN to this tarball mirror and redirect them to that page. If we can do that and it works for our uses, we can likely kill this script.
Comment 40•9 years ago
|
||
I need to get the entire MDN including Docs for Firefox OS, SpiderMonkey and other Mozilla products. https://kapeli.com/mdn_offline is just offering HTML, CSS, JavaScript, SVG and XSLT docs. Should I download https://developer.mozilla.org/ media/developer.mozilla.org.tar.gz? Does this tarball contain up to date docs?
Reporter | ||
Comment 41•9 years ago
|
||
Yes, the tarball should include all MDN docs including Mozilla products. It should be updated by the cron job every week.
Comment 42•9 years ago
|
||
Thanks a lot, Luke. Another question is, will it come as simple HTML pages, or as server side scripts?
Reporter | ||
Comment 43•9 years ago
|
||
It will be an offline HTML package.
Comment 44•9 years ago
|
||
Thanks. I'm gonna download it with GNU Wget. :-)
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•