Closed
Bug 362482
Opened 18 years ago
Closed 15 years ago
Nutch indexes skin files (controls.js)
Categories
(developer.mozilla.org Graveyard :: General, defect)
developer.mozilla.org Graveyard
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: asqueella, Assigned: aravind)
References
Details
http://developer.mozilla.org/en/docs/Special:Nutch?language=en&start=0&hitsPerPage=10&query=don%27t&fulltext=Search The first result is developer.mozilla.org/en/docs/skins/common/wikibits.js Skin files shouldn't be indexed.
We should rel=nofollow all the script and CSS stuff, yeah.
Updated•17 years ago
|
Severity: normal → critical
Updated•17 years ago
|
Severity: critical → blocker
I suspect, quite strongly, that adding a regex urlfilter for "*.js" or matching the "skin" dirs will alleviate this, which we could do along with an upgrade to nutch 0.9 (bunch of bug fixes, massive perf gains). Putting this in crawl-urlfilter.txt might well suffice for most of the JS cases; we should probably do CSS as well: -\.js$ -gen=js$ If we could filter out text/javascript and application/x-javascript based on MIME type, that might work better even. Not sure how to do that off the top of my head, but I recall reading that Nutch 0.9 could do so.
Comment 4•17 years ago
|
||
Upgrading to Nutch 0.9 may not work at the moment: I am running into this bug when indexing devmo: https://issues.apache.org/jira/browse/NUTCH-554 Though it is fixed, it is only available in the Nutch nightlies and not in any release yet. Unless they plan on making a bugfix release 0.9.1 or so, we would have to wait until Nutch 1.0, judging by their roadmap: http://issues.apache.org/jira/browse/NUTCH?report=com.atlassian.jira.plugin.system.project:roadmap-panel So for now, we can only move to Nutch 0.8.1 which may not bring the improvements with it that you mentioned above. Suggestions?
Comment 5•17 years ago
|
||
Are there any known issues that would prevent us from using a nightly of Nutch to deal with this problem?
Comment 6•17 years ago
|
||
(In reply to comment #5) > Are there any known issues that would prevent us from using a nightly of Nutch > to deal with this problem? Besides the obvious (http://lucene.apache.org/nutch/nightly.html : "They might or might not be functional.")... Maybe I should try a current nightly and see if it works as expected. Since we do not do nothing fancy besides indexing and search, I should be able to test if the nightly works for us.
Comment 7•17 years ago
|
||
Let's try it and see if it works. It seems like the least unpleasant alternative at this point. :)
Comment 8•17 years ago
|
||
Short update: Yesterday's Nutch nightly seems to work fine, though there are still required JAR files missing on the dev box (filed bug 407941), so Nutch's OpenSearch interface won't work yet. Judging by what shows up when I directly search nutch though, the additional URL filter for .js files worked fine. I'll look into the possibility of MIME type filtering too.
Comment 9•16 years ago
|
||
In bug 419676, I had IT update nutch to a current nightly. This fixed the issue. Thanks everyone.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 10•16 years ago
|
||
http://developer.mozilla.org/en/docs/Special:Nutch?query=don%27t&language=en&start=10&hitsPerPage=10 http://developer.mozilla.org/en/docs/Special:Nutch?language=en&start=0&hitsPerPage=10&query=controls.js&fulltext=Search list developer.mozilla.org/en/docs/skins/devmo/controls.js, so I'm reopening this. Let me know if you want a separate bug instead.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: Nutch indexes skin files (wikibits.js) → Nutch indexes skin files (controls.js)
It's indexing jquery.js, too; Fred: can you take a look? It's been broken for some time, not quite sure what changed to make it happen. (Reassigning to you; if there's a better way to put something into the general webdev pool, like there is for IT, please let me know!)
Assignee: nobody → fwenzel
Severity: blocker → major
Status: REOPENED → NEW
Comment 13•16 years ago
|
||
Argh, I think I found the reason: The rule excluding .js files was in the filter file *after* the rule including all devmo URLs. However, only the first matching filter counts. I am just trying to confirm this by indexing MDC on the dev box.
OS: Windows XP → All
Hardware: PC → All
Comment 14•16 years ago
|
||
Hm nutch on the dev box is broken :( I committed the line change anyway, I am quite confident that'll do it. Aravind, could you please update the nutch config file "crawl-urlfilter.txt" in production? (The SVN commit was r17004). The next time nutch indexes, it should not pick up Javascript anymore. Also (new bug, I guess), Aravind, do you remember what you did last time in order to get nutch and tomcat to run on sm-devnutch01? It just returns an empty page for me :(
Assignee: fwenzel → aravind
Comment 15•16 years ago
|
||
Shaver: By the way, the general webdev "pool" is mozilla.org :: Webdev in bugzilla.
OK, thanks -- should this go over to the IT queue now to get the push done?
Comment 17•16 years ago
|
||
Ah yeah, can you do it please? I don't know what component etc. I need to assign it to.
It's assigned to Aravind, I think that's enough. Eager!
Assignee | ||
Comment 19•16 years ago
|
||
Don't need this anymore.
Status: NEW → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → INVALID
Comment 20•16 years ago
|
||
> Don't need this anymore. It's still indexing .js files so I'm not sure why we wouldn't need this anymore? Comment #11 made this more generic than the original report but the main problem isn't fixed. Example: http://www.mozilla.com/en-US/search/?query=re_update&hits_per_page=10&hits_per_site=0
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Comment 21•16 years ago
|
||
(In reply to comment #20) > > Don't need this anymore. > > It's still indexing .js files so I'm not sure why we wouldn't need this > anymore? Didn't I check in a config file change approximately ages ago? http://viewvc.svn.mozilla.org/vc?view=rev&revision=17004 Aravind, can you make sure the production nutch config represents this change?
Assignee | ||
Comment 22•15 years ago
|
||
Is there still something that needs to be done here?
Assignee | ||
Comment 23•15 years ago
|
||
nevermind, its mdc...
Status: REOPENED → RESOLVED
Closed: 16 years ago → 15 years ago
Resolution: --- → INVALID
Comment 24•15 years ago
|
||
This isn't invalid. The example in comment 20 is still relevant.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Assignee | ||
Comment 25•15 years ago
|
||
Here is the updated crawl-urlfilter.txt file. I did the svn update, but it looks like I already had the right text in it. Either way, please re-open if this isn't fixed in tomorrows crawl. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # MDC: Do not parse javascript files -\.(js|JS)$ # accept hosts in MY.DOMAIN.NAME +^http://([A-Za-z0-9]*\.)*mozilla\.com/ # skip everything else -.
Status: REOPENED → RESOLVED
Closed: 15 years ago → 15 years ago
Resolution: --- → FIXED
Comment 26•15 years ago
|
||
(In reply to comment #25) > Here is the updated crawl-urlfilter.txt file. I did the svn update, but it > looks like I already had the right text in it. So, did something change?
Assignee | ||
Comment 27•15 years ago
|
||
The svn update had a status of G. According to the manual, "G foo File foo received new changes from the repository, but your local copy of the file had your modifications. Either the changes did not intersect, or the changes were exactly the same as your local modifications, so Subversion has successfully merGed the repository's changes into the file without a problem." So I don't know if I already had the change or if I had some partial change.
Updated•12 years ago
|
Component: Deki Infrastructure → Other
Updated•4 years ago
|
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•