Closed
Bug 362152
Opened 18 years ago
Closed 17 years ago
code for grabbing/cleaning set of web pages.
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: anodelman, Assigned: anodelman)
References
()
Details
Attachments
(1 file, 3 obsolete files)
I've created a set of two scripts to handle grabbing/cleaning web pages. The first script collects the web pages using wget and the second (heavily based upon a script by Darin Fisher) ensures that all links within the collected pages have been localized. The code works best using this custom built wget with css parser. http://www.mail-archive.com/wget@sunsite.dk/msg09142.html While it will collect web pages using a vanilla wget install, the pages collected will be missing most or all of their css. This information would probably be best covered in a README. This code belongs in mozilla/testing. Possibly in something like mozilla/testing/tools/grabber.
Assignee | ||
Comment 1•18 years ago
|
||
Assignee | ||
Comment 2•18 years ago
|
||
Assignee | ||
Updated•18 years ago
|
Attachment #246859 -
Flags: review?(rhelmer)
Assignee | ||
Updated•18 years ago
|
Attachment #246860 -
Flags: review?(rhelmer)
Updated•18 years ago
|
Attachment #246859 -
Flags: review?(rhelmer) → review+
Comment 3•18 years ago
|
||
Comment on attachment 246860 [details]
script to clean web pages
might want to do this one in perl if you are going to call out to perl so much, think it'd solve the perf problem.
Updated•18 years ago
|
Attachment #246860 -
Flags: review?(rhelmer) → review+
Comment 4•18 years ago
|
||
Nice to see someone using my hacked wget :)
Comment 5•18 years ago
|
||
(In reply to comment #4) > Nice to see someone using my hacked wget :) It seems like a useful thing to have. :) (In reply to comment #3) > might want to do this one in perl if you are going to call out to perl so > much, think it'd solve the perf problem. I agree. This should probably be rewritten in perl rather than calling it a bunch of times and evaluating a single expression. Other than that, great stuff. Where should we put this? In mozilla/testing/performance/tools? In a more general mozilla/testing/tools directory?
Assignee | ||
Comment 6•18 years ago
|
||
I combined the two scripts together and reduced the multiple perl invocations to a single perl command. I think that this makes the code cleaner, it should also have better performance.
Attachment #246859 -
Attachment is obsolete: true
Attachment #246860 -
Attachment is obsolete: true
Attachment #246999 -
Flags: review?(rhelmer)
Comment 7•18 years ago
|
||
Comment on attachment 246999 [details]
getpages.sh attempt #2
yes, much better.
Attachment #246999 -
Flags: review?(rhelmer) → review+
Assignee | ||
Comment 8•18 years ago
|
||
Bug fix for getpages.sh. Before the script couldn't handle cleaning downloaded files with spaces in their names (ie, "+ addimg +.html") - use of xargs -0 fixes this issue.
Attachment #246999 -
Attachment is obsolete: true
Attachment #247488 -
Flags: review?(rhelmer)
Updated•18 years ago
|
Attachment #247488 -
Flags: review?(rhelmer) → review+
Comment 9•18 years ago
|
||
Landed on trunk: RCS file: /cvsroot/mozilla/testing/tools/grabber/getpages.sh,v done Checking in getpages.sh; /cvsroot/mozilla/testing/tools/grabber/getpages.sh,v <-- getpages.sh initial revision: 1.1 done
Comment 10•17 years ago
|
||
Can this be marked as FIXED now?
Comment 12•16 years ago
|
||
The wget patch mentioned in comment 0 totally landed on wget trunk a while ago. Just thought I'd mention that!
Assignee: nobody → anodelman
Component: Testing → Release Engineering
Product: Core → mozilla.org
QA Contact: testing → release
Version: Trunk → other
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•