362482 - Nutch indexes skin files (controls.js)

Reporter

Description

•

18 years ago

http://developer.mozilla.org/en/docs/Special:Nutch?language=en&start=0&hitsPerPage=10&query=don%27t&fulltext=Search

The first result is developer.mozilla.org/en/docs/skins/common/wikibits.js

Skin files shouldn't be indexed.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 1

•

18 years ago

We should rel=nofollow all the script and CSS stuff, yeah.

Eric Shepherd [:sheppy]

Updated

•

17 years ago

Severity: normal → critical

Eric Shepherd [:sheppy]

Updated

•

17 years ago

Severity: critical → blocker

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 3

•

17 years ago

I suspect, quite strongly, that adding a regex urlfilter for "*.js" or matching the "skin" dirs will alleviate this, which we could do along with an upgrade to nutch 0.9 (bunch of bug fixes, massive perf gains).

Putting this in crawl-urlfilter.txt might well suffice for most of the JS cases; we should probably do CSS as well:

-\.js$
-gen=js$

If we could filter out text/javascript and application/x-javascript based on MIME type, that might work better even.  Not sure how to do that off the top of my head, but I recall reading that Nutch 0.9 could do so.

Fred Wenzel [:wenzel]

Updated

•

17 years ago

Depends on: 407050

Fred Wenzel [:wenzel]

Comment 4

•

17 years ago

Upgrading to Nutch 0.9 may not work at the moment: I am running into this bug when indexing devmo:
https://issues.apache.org/jira/browse/NUTCH-554

Though it is fixed, it is only available in the Nutch nightlies and not in any release yet. Unless they plan on making a bugfix release 0.9.1 or so, we would have to wait until Nutch 1.0, judging by their roadmap: http://issues.apache.org/jira/browse/NUTCH?report=com.atlassian.jira.plugin.system.project:roadmap-panel

So for now, we can only move to Nutch 0.8.1 which may not bring the improvements with it that you mentioned above.

Suggestions?

Eric Shepherd [:sheppy]

Comment 5

•

17 years ago

Are there any known issues that would prevent us from using a nightly of Nutch to deal with this problem?

Fred Wenzel [:wenzel]

Comment 6

•

17 years ago

(In reply to comment #5)
> Are there any known issues that would prevent us from using a nightly of Nutch
> to deal with this problem?

Besides the obvious (http://lucene.apache.org/nutch/nightly.html : "They might or might not be functional.")... Maybe I should try a current nightly and see if it works as expected. Since we do not do nothing fancy besides indexing and search, I should be able to test if the nightly works for us.

Eric Shepherd [:sheppy]

Comment 7

•

17 years ago

Let's try it and see if it works.  It seems like the least unpleasant alternative at this point. :)

Fred Wenzel [:wenzel]

Comment 8

•

17 years ago

Short update: Yesterday's Nutch nightly seems to work fine, though there are still required JAR files missing on the dev box (filed bug 407941), so Nutch's OpenSearch interface won't work yet. Judging by what shows up when I directly search nutch though, the additional URL filter for .js files worked fine. I'll look into the possibility of MIME type filtering too.

Fred Wenzel [:wenzel]

Comment 9

•

16 years ago

In bug 419676, I had IT update nutch to a current nightly. This fixed the issue. Thanks everyone.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Nickolay_Ponomarev

Reporter

Comment 10

•

16 years ago

http://developer.mozilla.org/en/docs/Special:Nutch?query=don%27t&language=en&start=10&hitsPerPage=10
http://developer.mozilla.org/en/docs/Special:Nutch?language=en&start=0&hitsPerPage=10&query=controls.js&fulltext=Search

list developer.mozilla.org/en/docs/skins/devmo/controls.js, so I'm reopening this. Let me know if you want a separate bug instead.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Summary: Nutch indexes skin files (wikibits.js) → Nutch indexes skin files (controls.js)

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 12

•

16 years ago

It's indexing jquery.js, too; Fred: can you take a look?  It's been broken for some time, not quite sure what changed to make it happen.

(Reassigning to you; if there's a better way to put something into the general webdev pool, like there is for IT, please let me know!)

Assignee: nobody → fwenzel

Severity: blocker → major

Status: REOPENED → NEW

Fred Wenzel [:wenzel]

Comment 13

•

16 years ago

Argh, I think I found the reason: The rule excluding .js files was in the filter file *after* the rule including all devmo URLs. However, only the first matching filter counts.

I am just trying to confirm this by indexing MDC on the dev box.

OS: Windows XP → All

Hardware: PC → All

Fred Wenzel [:wenzel]

Comment 14

•

16 years ago

Hm nutch on the dev box is broken :( I committed the line change anyway, I am quite confident that'll do it.

Aravind, could you please update the nutch config file "crawl-urlfilter.txt" in production? (The SVN commit was r17004). The next time nutch indexes, it should not pick up Javascript anymore.

Also (new bug, I guess), Aravind, do you remember what you did last time in order to get nutch and tomcat to run on sm-devnutch01? It just returns an empty page for me :(

Assignee: fwenzel → aravind

Fred Wenzel [:wenzel]

Comment 15

•

16 years ago

Shaver: By the way, the general webdev "pool" is mozilla.org :: Webdev in bugzilla.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 16

•

16 years ago

OK, thanks -- should this go over to the IT queue now to get the push done?

Fred Wenzel [:wenzel]

Comment 17

•

16 years ago

Ah yeah, can you do it please? I don't know what component etc. I need to assign it to.

Fred Wenzel [:wenzel]

Updated

•

16 years ago

Depends on: 444905

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 18

•

16 years ago

It's assigned to Aravind, I think that's enough.  Eager!

Aravind Gottipati [:aravind]

Assignee

Comment 19

•

16 years ago

Don't need this anymore.

Status: NEW → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → INVALID

Wil Clouser [:clouserw]

Comment 20

•

16 years ago

> Don't need this anymore.

It's still indexing .js files so I'm not sure why we wouldn't need this anymore?  Comment #11 made this more generic than the original report but the main problem isn't fixed.

Example: http://www.mozilla.com/en-US/search/?query=re_update&hits_per_page=10&hits_per_site=0

Status: RESOLVED → REOPENED

Resolution: INVALID → ---

Fred Wenzel [:wenzel]

Comment 21

•

16 years ago

(In reply to comment #20)
> > Don't need this anymore.
> 
> It's still indexing .js files so I'm not sure why we wouldn't need this
> anymore?

Didn't I check in a config file change approximately ages ago? http://viewvc.svn.mozilla.org/vc?view=rev&revision=17004

Aravind, can you make sure the production nutch config represents this change?

Aravind Gottipati [:aravind]

Assignee

Comment 22

•

15 years ago

Is there still something that needs to be done here?

Aravind Gottipati [:aravind]

Assignee

Comment 23

•

15 years ago

nevermind, its mdc...

Status: REOPENED → RESOLVED

Closed: 16 years ago → 15 years ago

Resolution: --- → INVALID

Wil Clouser [:clouserw]

Comment 24

•

15 years ago

This isn't invalid.  The example in comment 20 is still relevant.

Status: RESOLVED → REOPENED

Resolution: INVALID → ---

Aravind Gottipati [:aravind]

Assignee

Comment 25

•

15 years ago

Here is the updated crawl-urlfilter.txt file.  I did the svn update, but it looks like I already had the right text in it.  Either way, please re-open if this isn't fixed in tomorrows crawl.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# MDC: Do not parse javascript files
-\.(js|JS)$

# accept hosts in MY.DOMAIN.NAME
+^http://([A-Za-z0-9]*\.)*mozilla\.com/

# skip everything else
-.

Status: REOPENED → RESOLVED

Closed: 15 years ago → 15 years ago

Resolution: --- → FIXED

Wil Clouser [:clouserw]

Comment 26

•

15 years ago

(In reply to comment #25)
> Here is the updated crawl-urlfilter.txt file.  I did the svn update, but it
> looks like I already had the right text in it.

So, did something change?

Aravind Gottipati [:aravind]

Assignee

Comment 27

•

15 years ago

The svn update had a status of G.  According to the manual,

"G foo

    File foo received new changes from the repository, but your local copy of the file had your modifications. Either the changes did not intersect, or the changes were exactly the same as your local modifications, so Subversion has successfully merGed the repository's changes into the file without a problem."

So I don't know if I already had the change or if I had some partial change.

Nobody; OK to take it and work on it

Updated

•

12 years ago

Component: Deki Infrastructure → Other

BMO Automation

Updated

•

4 years ago

Product: developer.mozilla.org → developer.mozilla.org Graveyard