Closed
Bug 353208
Opened 18 years ago
Closed 16 years ago
Nutch indexes complete HTML pages instead of just content part
Categories
(developer.mozilla.org Graveyard :: General, defect)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: asqueella, Assigned: wenzel)
References
Details
Attachments
(2 files)
...that results in useless snippets on search results and probably irrelevant results.
E.g. http://developer.mozilla.org/en/docs/Special:Nutch?query=chrome&language=en&start=0&hitsPerPage=10
"... to main content Visit Mozilla.org Mozilla Developer Network Create an account or ... PageInfoOverlay.xul application=seamonkey@applications.mozilla. ...
developer.mozilla.org/en/docs/Chrome_Registration"
Updated•17 years ago
|
Severity: normal → blocker
Assignee | ||
Comment 1•17 years ago
|
||
For this, we probably have to write a little parsing plugin for Nutch; I am thinking about something along the lines of filtering out the content part of the pages and passing only this on to the actual content parsing plugin.
Java ftw.
Comment 2•17 years ago
|
||
An alternative option would be to use a different mediawiki skin for indexing, which didn't put any of the boilerplate on it, and was selected by the UA or an explicit header.
Does nutch support any explicit "don't index this" markup? If so, we could wedge it into the skin...
Assignee | ||
Comment 3•17 years ago
|
||
Looks like the markup issue is just currently being discussed and not implemented yet: https://issues.apache.org/jira/browse/NUTCH-585
I am not sure how the skin could be any different since Nutch is crawling the page autonomously. We could possibly do some magic based on the user agent? But I am not sure if this would work, considering the Netscaler.
Comment 4•17 years ago
|
||
Yeah, we'd have to simulate the nutch crawl being by a logged-in user, by sending some bogus auth cookie, or by using a real auth cookie and just making the "purist" skin be the one in use there.
Comment 5•17 years ago
|
||
i think you can change nutch's user-agent with
<property>
<name>http.agent.name</name>
<value></value>
</property>
and you can force nutch to use skin 'xxx' with
RewriteCond %{HTTP_USER_AGENT} ^nutch-user-agent
RewriteRule ^/en/docs/(.*)$ /en/docs/index.php?title=$1&useskin=xxx[L]
in apache config (using mod_rewrite).
I think this might be the easiest way..
Comment 6•17 years ago
|
||
(In reply to comment #5)
> and you can force nutch to use skin 'xxx' with
> RewriteCond %{HTTP_USER_AGENT} ^nutch-user-agent
> RewriteRule ^/en/docs/(.*)$ /en/docs/index.php?title=$1&useskin=xxx[L]
> in apache config (using mod_rewrite).
or simply, title=$1&action=raw might be good??
Assignee | ||
Comment 7•17 years ago
|
||
CCing mrz:
Matt, if we want to do some mod_rewrite magic on the production MDC based on the user agent, do we need special settings on the Netscalers supporting this? Note: This redirect only has to work for the production Nutch box; there's no need for it to work for anyone from the outside, as far as I can tell.
Assignee | ||
Comment 8•17 years ago
|
||
(In reply to comment #6)
> or simply, title=$1&action=raw might be good??
action=render, rather. action=raw gives you the wikitext, which does not contain HTML links, so the crawler would be sad.
Assignee | ||
Comment 9•17 years ago
|
||
This is the rewrite rule that should do the trick. Note though that as of now, the Netscaler question remains unsolved. In other words, I assume there has to be made a config change in the NS too.
Assignee: nobody → fwenzel
Status: NEW → ASSIGNED
Comment 10•17 years ago
|
||
(In reply to comment #7)
> CCing mrz:
> Matt, if we want to do some mod_rewrite magic on the production MDC based on
> the user agent, do we need special settings on the Netscalers supporting this?
No - the Netscaler doesn't touch the user agent.
Assignee | ||
Comment 11•17 years ago
|
||
Thanks, Matt! But please clarify: Considering that we want to serve different content (lo-fi vs. hi-fi, so to say) for the same Request URI based on User Agent, do I need to do anything (send headers or something?) in order to ensure that not the wrong content (say, e.g. the low-fi version) is being cached and sent out to clients who should get the other version (hi-fi, that is), and vice versa?
Comment 12•17 years ago
|
||
What you're asking for is an Apache thing, not a Netscaler thing (it could be I guess but it's not currently setup for content switching).
Assignee | ||
Comment 13•17 years ago
|
||
Yes, we'll do that in Apache. Just wanted to make sure that User-Agent based mod_rewrite magic won't result in innocent (yet unlucky) end users being presented with plain text websites that are supposed to be seen by the Nutch crawler only.
Thanks, that's all for now!
Assignee | ||
Comment 14•17 years ago
|
||
Okay, we clarified this more in bug 332818 comment 30 (sorry for the bughopping).
The mod_rewrite rules from attachment 293836 [details] lead to different content being served for one user agent than for anyone else (response code 200 in both cases). This will, if I get it right, likely be cached by the NS and served to other people also, which is unintended.
I think I can add no-cache headers in the Apache config file too (using mod_header and mod_ifenv or so) but while this may keep the lo-fi version from being served to regular users, do you have a suggestion on how to keep the public content from being served to the crawler?
As a test case I could write you a simple PHP file that echos diffent content based on UA? Or what were you looking for?
Assignee | ||
Comment 15•17 years ago
|
||
Another thing I could do is send a 301 (temporary) redirect (along with no-cache headers) to the crawler (provided it still stores the right URI then, I will have to test this). This should keep public users from seeing crawler content. But can we find out if this would be sufficient for not getting the public page sent to the crawler also?
Comment 16•17 years ago
|
||
If you have a test case I can look at the cache policy rules or even looking at doing content filtering on the load balancer. So make me something that returns different content and I can test on by end.
Assignee | ||
Comment 17•17 years ago
|
||
I made you a test case with real-world data: The test.php script will return a complete HTML file if your UA is whatever, but if you have one that starts with "MDC Nutch", it'll serve you the content part only.
Is that okay?
Comment 18•17 years ago
|
||
Is this online anywhere?
Assignee | ||
Comment 19•17 years ago
|
||
http://developer-test.mozilla.org/ua-filter-testcase/test.php
I just realized you can test directly with developer-test.m.o too; any wiki page there will return a nice and fancy version for any user agent but for /MDC Nutch.*/ it'll serve the content part only, in raw HTML.
Comment 20•17 years ago
|
||
This site's not behind the netscaler - do you have something similar you can push out to a site that is?
Assignee | ||
Comment 21•17 years ago
|
||
(In reply to comment #20)
> This site's not behind the netscaler - do you have something similar you can
> push out to a site that is?
Actually, in order to be able to test code in its desired environment we should put developer-test behind the Netscaler. I filed bug 409360.
Depends on: 409360
Assignee | ||
Comment 22•16 years ago
|
||
MDC doesn't use nutch anymore for its search.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → WORKSFORME
Updated•12 years ago
|
Component: Deki Infrastructure → Other
Updated•5 years ago
|
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•