Closed Bug 353208 Opened 18 years ago Closed 16 years ago

Nutch indexes complete HTML pages instead of just content part

Categories

(developer.mozilla.org Graveyard :: General, defect)

x86
Windows XP
defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: asqueella, Assigned: wenzel)

References

Details

Attachments

(2 files)

...that results in useless snippets on search results and probably irrelevant results. E.g. http://developer.mozilla.org/en/docs/Special:Nutch?query=chrome&language=en&start=0&hitsPerPage=10 "... to main content Visit Mozilla.org Mozilla Developer Network Create an account or ... PageInfoOverlay.xul application=seamonkey@applications.mozilla. ... developer.mozilla.org/en/docs/Chrome_Registration"
Severity: normal → blocker
Depends on: 407050
For this, we probably have to write a little parsing plugin for Nutch; I am thinking about something along the lines of filtering out the content part of the pages and passing only this on to the actual content parsing plugin. Java ftw.
An alternative option would be to use a different mediawiki skin for indexing, which didn't put any of the boilerplate on it, and was selected by the UA or an explicit header. Does nutch support any explicit "don't index this" markup? If so, we could wedge it into the skin...
Looks like the markup issue is just currently being discussed and not implemented yet: https://issues.apache.org/jira/browse/NUTCH-585 I am not sure how the skin could be any different since Nutch is crawling the page autonomously. We could possibly do some magic based on the user agent? But I am not sure if this would work, considering the Netscaler.
Yeah, we'd have to simulate the nutch crawl being by a logged-in user, by sending some bogus auth cookie, or by using a real auth cookie and just making the "purist" skin be the one in use there.
i think you can change nutch's user-agent with <property> <name>http.agent.name</name> <value></value> </property> and you can force nutch to use skin 'xxx' with RewriteCond %{HTTP_USER_AGENT} ^nutch-user-agent RewriteRule ^/en/docs/(.*)$ /en/docs/index.php?title=$1&useskin=xxx[L] in apache config (using mod_rewrite). I think this might be the easiest way..
(In reply to comment #5) > and you can force nutch to use skin 'xxx' with > RewriteCond %{HTTP_USER_AGENT} ^nutch-user-agent > RewriteRule ^/en/docs/(.*)$ /en/docs/index.php?title=$1&useskin=xxx[L] > in apache config (using mod_rewrite). or simply, title=$1&action=raw might be good??
CCing mrz: Matt, if we want to do some mod_rewrite magic on the production MDC based on the user agent, do we need special settings on the Netscalers supporting this? Note: This redirect only has to work for the production Nutch box; there's no need for it to work for anyone from the outside, as far as I can tell.
(In reply to comment #6) > or simply, title=$1&action=raw might be good?? action=render, rather. action=raw gives you the wikitext, which does not contain HTML links, so the crawler would be sad.
This is the rewrite rule that should do the trick. Note though that as of now, the Netscaler question remains unsolved. In other words, I assume there has to be made a config change in the NS too.
Assignee: nobody → fwenzel
Status: NEW → ASSIGNED
(In reply to comment #7) > CCing mrz: > Matt, if we want to do some mod_rewrite magic on the production MDC based on > the user agent, do we need special settings on the Netscalers supporting this? No - the Netscaler doesn't touch the user agent.
Thanks, Matt! But please clarify: Considering that we want to serve different content (lo-fi vs. hi-fi, so to say) for the same Request URI based on User Agent, do I need to do anything (send headers or something?) in order to ensure that not the wrong content (say, e.g. the low-fi version) is being cached and sent out to clients who should get the other version (hi-fi, that is), and vice versa?
What you're asking for is an Apache thing, not a Netscaler thing (it could be I guess but it's not currently setup for content switching).
Yes, we'll do that in Apache. Just wanted to make sure that User-Agent based mod_rewrite magic won't result in innocent (yet unlucky) end users being presented with plain text websites that are supposed to be seen by the Nutch crawler only. Thanks, that's all for now!
Okay, we clarified this more in bug 332818 comment 30 (sorry for the bughopping). The mod_rewrite rules from attachment 293836 [details] lead to different content being served for one user agent than for anyone else (response code 200 in both cases). This will, if I get it right, likely be cached by the NS and served to other people also, which is unintended. I think I can add no-cache headers in the Apache config file too (using mod_header and mod_ifenv or so) but while this may keep the lo-fi version from being served to regular users, do you have a suggestion on how to keep the public content from being served to the crawler? As a test case I could write you a simple PHP file that echos diffent content based on UA? Or what were you looking for?
Another thing I could do is send a 301 (temporary) redirect (along with no-cache headers) to the crawler (provided it still stores the right URI then, I will have to test this). This should keep public users from seeing crawler content. But can we find out if this would be sufficient for not getting the public page sent to the crawler also?
If you have a test case I can look at the cache policy rules or even looking at doing content filtering on the load balancer. So make me something that returns different content and I can test on by end.
I made you a test case with real-world data: The test.php script will return a complete HTML file if your UA is whatever, but if you have one that starts with "MDC Nutch", it'll serve you the content part only. Is that okay?
Is this online anywhere?
http://developer-test.mozilla.org/ua-filter-testcase/test.php I just realized you can test directly with developer-test.m.o too; any wiki page there will return a nice and fancy version for any user agent but for /MDC Nutch.*/ it'll serve the content part only, in raw HTML.
This site's not behind the netscaler - do you have something similar you can push out to a site that is?
(In reply to comment #20) > This site's not behind the netscaler - do you have something similar you can > push out to a site that is? Actually, in order to be able to test code in its desired environment we should put developer-test behind the Netscaler. I filed bug 409360.
Depends on: 409360
MDC doesn't use nutch anymore for its search.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → WORKSFORME
Component: Deki Infrastructure → Other
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: