Closed Bug 503591 Opened 16 years ago Closed 16 years ago

Mozbot badly parses extended characters in search results

Categories

(Webtools Graveyard :: Mozbot, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Noah, Assigned: cww)

Details

Attachments

(1 file)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090626 Firefox/3.0.10 Build Identifier: The bots have encoding/parsing issues when using the google module to search and return links such as: http://answers.yahoo.com/question/index%3Fqid%3D20090320050357AA4z7c3 instead of: http://answers.yahoo.com/question/index?qid=20090320050357AA4z7c3 causing them to fail when used. Apostrophes are shown as ' in link titles: Symantec kills 'broken' NAV script blocker: News - Security ... -- http://www.zdnet.com.au/news/security/soa/Symantec-kills-broken-NAV-script-blocker/0,130061744,139234250,00.htm Reproducible: Always
I tested with firebot, which runs on XP, doing a search for "Symantec kills broken NAV script" and it rendered properly. Is this related to the UTF-8/Encoding issue in the infamous Bug 490052? Only because it shows a similar encoding issue on Ubuntu and firebot rendered correctly, but that wouldn't explain why this shows up on an XP install of mozbot. Noah: Is this on XP or another OS? Wolf, Cww: Any ideas?
Yes, XP. Retested "Symantec kills broken NAV script" search & it indeed does work. I had originally seen this consistently around 7/14/2008 and afterward. Firebot did update its Google module earlier this year, that must've fixed that. That'll teach me to retest before posting like that again. But the other issue still does remain. google norton yahoo answers <- for examples.
This is not a duplicate of Bug 490052, having talked with Cww, and I am assigning it to Cww. Also with firebot "google norton yahoo answer" returns "What is the difference between mcafee and norton? - Yahoo! Answers -- http://answers.yahoo.com/question/index%3Fqid%3D20081205115645AAupIjR" which appears to be the "buggy" behavior.
Assignee: nobody → cwwmozilla
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Ok, so what's going on is Google seems to have changed their REST API to double-%-escape URLs (probably for things like IE compatibility). I'm going to try to manually unescape it first but a more subtle approach is needed if that ends up with weird unicode.
Attached patch patch v1Splinter Review
patch v1
Attachment #388931 - Flags: review?
Attachment #388931 - Flags: review? → review?(bugtrap)
QA Contact: mozbot → mozilla.bugs
QA Contact: mozilla.bugs → mozbot
Comment on attachment 388931 [details] [diff] [review] patch v1 Before: What is the difference between mcafee and norton? - Yahoo! Answers -- http://answers.yahoo.com/question/index%3Fqid%3D20081205115645AAupIjR After: What is the difference between mcafee and norton? - Yahoo! Answers -- http://answers.yahoo.com/question/index?qid=20081205115645AAupIjR Looks ok to me. r+
Attachment #388931 - Flags: review?(bugtrap) → review+
Keywords: checkin-needed
Checking in BotModules/Google.bm; /cvsroot/mozilla/webtools/mozbot/BotModules/Google.bm,v <-- Google.bm new revision: 1.5; previous revision: 1.4 done
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: