Closed Bug 362482 Opened 18 years ago Closed 15 years ago

Nutch indexes skin files (controls.js)

Categories

(developer.mozilla.org Graveyard :: General, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: asqueella, Assigned: aravind)

References

Details

http://developer.mozilla.org/en/docs/Special:Nutch?language=en&start=0&hitsPerPage=10&query=don%27t&fulltext=Search

The first result is developer.mozilla.org/en/docs/skins/common/wikibits.js

Skin files shouldn't be indexed.
We should rel=nofollow all the script and CSS stuff, yeah.
Severity: normal → critical
Severity: critical → blocker
I suspect, quite strongly, that adding a regex urlfilter for "*.js" or matching the "skin" dirs will alleviate this, which we could do along with an upgrade to nutch 0.9 (bunch of bug fixes, massive perf gains).

Putting this in crawl-urlfilter.txt might well suffice for most of the JS cases; we should probably do CSS as well:

-\.js$
-gen=js$

If we could filter out text/javascript and application/x-javascript based on MIME type, that might work better even.  Not sure how to do that off the top of my head, but I recall reading that Nutch 0.9 could do so.
Depends on: 407050
Upgrading to Nutch 0.9 may not work at the moment: I am running into this bug when indexing devmo:
https://issues.apache.org/jira/browse/NUTCH-554

Though it is fixed, it is only available in the Nutch nightlies and not in any release yet. Unless they plan on making a bugfix release 0.9.1 or so, we would have to wait until Nutch 1.0, judging by their roadmap: http://issues.apache.org/jira/browse/NUTCH?report=com.atlassian.jira.plugin.system.project:roadmap-panel

So for now, we can only move to Nutch 0.8.1 which may not bring the improvements with it that you mentioned above.

Suggestions?
Are there any known issues that would prevent us from using a nightly of Nutch to deal with this problem?
(In reply to comment #5)
> Are there any known issues that would prevent us from using a nightly of Nutch
> to deal with this problem?

Besides the obvious (http://lucene.apache.org/nutch/nightly.html : "They might or might not be functional.")... Maybe I should try a current nightly and see if it works as expected. Since we do not do nothing fancy besides indexing and search, I should be able to test if the nightly works for us.
Let's try it and see if it works.  It seems like the least unpleasant alternative at this point. :)
Short update: Yesterday's Nutch nightly seems to work fine, though there are still required JAR files missing on the dev box (filed bug 407941), so Nutch's OpenSearch interface won't work yet. Judging by what shows up when I directly search nutch though, the additional URL filter for .js files worked fine. I'll look into the possibility of MIME type filtering too.
In bug 419676, I had IT update nutch to a current nightly. This fixed the issue. Thanks everyone.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
http://developer.mozilla.org/en/docs/Special:Nutch?query=don%27t&language=en&start=10&hitsPerPage=10
http://developer.mozilla.org/en/docs/Special:Nutch?language=en&start=0&hitsPerPage=10&query=controls.js&fulltext=Search

list developer.mozilla.org/en/docs/skins/devmo/controls.js, so I'm reopening this. Let me know if you want a separate bug instead.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: Nutch indexes skin files (wikibits.js) → Nutch indexes skin files (controls.js)
It's indexing jquery.js, too; Fred: can you take a look?  It's been broken for some time, not quite sure what changed to make it happen.

(Reassigning to you; if there's a better way to put something into the general webdev pool, like there is for IT, please let me know!)
Assignee: nobody → fwenzel
Severity: blocker → major
Status: REOPENED → NEW
Argh, I think I found the reason: The rule excluding .js files was in the filter file *after* the rule including all devmo URLs. However, only the first matching filter counts.

I am just trying to confirm this by indexing MDC on the dev box.
OS: Windows XP → All
Hardware: PC → All
Hm nutch on the dev box is broken :( I committed the line change anyway, I am quite confident that'll do it.

Aravind, could you please update the nutch config file "crawl-urlfilter.txt" in production? (The SVN commit was r17004). The next time nutch indexes, it should not pick up Javascript anymore.

Also (new bug, I guess), Aravind, do you remember what you did last time in order to get nutch and tomcat to run on sm-devnutch01? It just returns an empty page for me :(
Assignee: fwenzel → aravind
Shaver: By the way, the general webdev "pool" is mozilla.org :: Webdev in bugzilla.
OK, thanks -- should this go over to the IT queue now to get the push done?
Ah yeah, can you do it please? I don't know what component etc. I need to assign it to.
Depends on: 444905
It's assigned to Aravind, I think that's enough.  Eager!
Don't need this anymore.
Status: NEW → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → INVALID
> Don't need this anymore.

It's still indexing .js files so I'm not sure why we wouldn't need this anymore?  Comment #11 made this more generic than the original report but the main problem isn't fixed.

Example: http://www.mozilla.com/en-US/search/?query=re_update&hits_per_page=10&hits_per_site=0
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
(In reply to comment #20)
> > Don't need this anymore.
> 
> It's still indexing .js files so I'm not sure why we wouldn't need this
> anymore?

Didn't I check in a config file change approximately ages ago? http://viewvc.svn.mozilla.org/vc?view=rev&revision=17004

Aravind, can you make sure the production nutch config represents this change?
Is there still something that needs to be done here?
nevermind, its mdc...
Status: REOPENED → RESOLVED
Closed: 16 years ago15 years ago
Resolution: --- → INVALID
This isn't invalid.  The example in comment 20 is still relevant.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Here is the updated crawl-urlfilter.txt file.  I did the svn update, but it looks like I already had the right text in it.  Either way, please re-open if this isn't fixed in tomorrows crawl.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# MDC: Do not parse javascript files
-\.(js|JS)$

# accept hosts in MY.DOMAIN.NAME
+^http://([A-Za-z0-9]*\.)*mozilla\.com/

# skip everything else
-.
Status: REOPENED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
(In reply to comment #25)
> Here is the updated crawl-urlfilter.txt file.  I did the svn update, but it
> looks like I already had the right text in it.

So, did something change?
The svn update had a status of G.  According to the manual,

"G foo

    File foo received new changes from the repository, but your local copy of the file had your modifications. Either the changes did not intersect, or the changes were exactly the same as your local modifications, so Subversion has successfully merGed the repository's changes into the file without a problem."

So I don't know if I already had the change or if I had some partial change.
Component: Deki Infrastructure → Other
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.