Closed Bug 362152 Opened 18 years ago Closed 17 years ago

code for grabbing/cleaning set of web pages.

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: anodelman, Assigned: anodelman)

References

()

Details

Attachments

(1 file, 3 obsolete files)

I've created a set of two scripts to handle grabbing/cleaning web pages.  The first script collects the web pages using wget and the second (heavily based upon a script by Darin Fisher) ensures that all links within the collected pages have been localized.

The code works best using this custom built wget with css parser.
http://www.mail-archive.com/wget@sunsite.dk/msg09142.html
While it will collect web pages using a vanilla wget install, the pages collected will be missing most or all of their css.  This information would probably be best covered in a README.

This code belongs in mozilla/testing.  Possibly in something like mozilla/testing/tools/grabber.
Attached file script for grabbing web page sets (obsolete) —
Attached file script to clean web pages (obsolete) —
Attachment #246859 - Flags: review?(rhelmer)
Attachment #246860 - Flags: review?(rhelmer)
Attachment #246859 - Flags: review?(rhelmer) → review+
Comment on attachment 246860 [details]
script to clean web pages

might want to do this one in perl if you are going to call out to perl so much, think it'd solve the perf problem.
Attachment #246860 - Flags: review?(rhelmer) → review+
Nice to see someone using my hacked wget :)
(In reply to comment #4)
> Nice to see someone using my hacked wget :)

It seems like a useful thing to have. :)

(In reply to comment #3)
> might want to do this one in perl if you are going to call out to perl so
> much, think it'd solve the perf problem.

I agree. This should probably be rewritten in perl rather than calling it a bunch of times and evaluating a single expression. Other than that, great stuff.

Where should we put this? In mozilla/testing/performance/tools? In a more general mozilla/testing/tools directory?
Attached file getpages.sh attempt #2 (obsolete) —
I combined the two scripts together and reduced the multiple perl invocations to a single perl command.  I think that this makes the code cleaner, it should also have better performance.
Attachment #246859 - Attachment is obsolete: true
Attachment #246860 - Attachment is obsolete: true
Attachment #246999 - Flags: review?(rhelmer)
Comment on attachment 246999 [details]
getpages.sh attempt #2

yes, much better.
Attachment #246999 - Flags: review?(rhelmer) → review+
Attached file getpages.sh attempt #3
Bug fix for getpages.sh.  Before the script couldn't handle cleaning downloaded files with spaces in their names (ie, "+ addimg +.html") - use of xargs -0 fixes this issue.
Attachment #246999 - Attachment is obsolete: true
Attachment #247488 - Flags: review?(rhelmer)
Attachment #247488 - Flags: review?(rhelmer) → review+
Landed on trunk:

RCS file: /cvsroot/mozilla/testing/tools/grabber/getpages.sh,v
done
Checking in getpages.sh;
/cvsroot/mozilla/testing/tools/grabber/getpages.sh,v  <--  getpages.sh
initial revision: 1.1
done
Can this be marked as FIXED now?
yes!
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
The wget patch mentioned in comment 0 totally landed on wget trunk a while ago. Just thought I'd mention that!
Assignee: nobody → anodelman
Component: Testing → Release Engineering
Product: Core → mozilla.org
QA Contact: testing → release
Version: Trunk → other
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: