Add UTF-8 support in mozbot

RESOLVED FIXED in 2.8

Status

Webtools
Mozbot
--
critical
RESOLVED FIXED
9 years ago
7 years ago

People

(Reporter: tmyoung, Assigned: Cww)

Tracking

Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 4 obsolete attachments)

(Reporter)

Description

9 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1b3) Gecko/20090305 Firefox/3.1b3
Build Identifier: Mozbot 2.6

When the Google Module outputs a series of search results with a UTF-8 character (using the function "google"), perl crashes and cites a "wide character error."

In the steps to reproduce I have tested this with a search string that will load my website "Al Young Studios" as the first result, in the title is a (TM) character.

Reproducible: Always

Steps to Reproduce:
1. In IRC as mozbot admin, "mozobt: load Google" (if it not already loaded)
2. type "mozbot: google Al Young Studios"
Actual Results:  
Perl suffered a fatal error and closed the bot script.

Expected Results:  
The bot should have printed the full list search results, but stopped before it loaded the page title with the UTF-8 char.

Script is running on an Ubuntu 8.10 server with Perl 5.10.0.  The IRC server is InspIRCD.
(Reporter)

Comment 1

9 years ago
In trying to fix the module, I added "use Encode;" and changed "my $title = $result->title;" to "my $title = encode('ascii',$result->title,Encode::FB_HTMLCREF);" which prevents the crash; however UTF-8 Characters are output as &#xxx.

The other option I have tried is loading "HTML::Entities" and trying "my $title = HTML::Entities::encode($result->title);"  This returns the characters as their HTML, value (e.g. ™ or ’), but it shows the bold tags from the Google API.  Any attempt I've made to decode or recode the chars in another format seems to return the same error.
(Reporter)

Comment 2

9 years ago
I have been able to reproduce it on another install of Ubuntu 8.10 with "Perl 5.10.0 built for i486-linux-gnu-thread-multi" with the same settings and same testcase.
Severity: major → normal
OS: All → Linux
Hardware: All → x86
(Assignee)

Comment 3

9 years ago
REST::Google::Search probably doesn't handle encoding right and Net::IRC has a nasty habit of crashing fatally when it sees it.  I can take this since I've had lots of practice with UTF-8 working with Net::IRC.
Assignee: nobody → cwwmozilla
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
(Assignee)

Comment 4

9 years ago
Created attachment 379063 [details] [diff] [review]
Patch to encode into utf8

You were on the right track... the solution is to use encode utf8 which handles pretty much everything sanely and if it doesn't will fall back sanely.  Since (TM) isn't an ascii character Encode has to fall back to &# entities to represent it.
Attachment #379063 - Flags: review?
(Assignee)

Updated

9 years ago
Attachment #379063 - Flags: review? → review?(bugtrap)
(Assignee)

Comment 5

9 years ago
I guess I should clarify my comment.  I mean it has to fall back to &# when using encode ascii but won't fall back when using encode utf8.  Most IRC clients can display utf8 just fine.  I'm also upping the severity (crash bugs are critical).
Severity: normal → critical

Comment 6

9 years ago
Comment on attachment 379063 [details] [diff] [review]
Patch to encode into utf8

The changes look simple enough. but the patch fails to apply for me. Has it been tested on a mozbot that can reliably reproduce the crash? (firebot and firewolfbot (my test bot) don't crash since they're on windows.) 

Patch gives this output (patch 2.5.8): 
Hmm...  Looks like a unified diff to me...
can't find file to patch at input line 8
Perhaps you should have used the -p or --strip option?
The text leading up to this was:
--------------------------
|Index: BotModules/Google.bm
|===================================================================
|RCS file: /cvsroot/mozilla/webtools/mozbot/BotModules/Google.bm,v
|retrieving revision 1.2
|diff -u -3 -w -r1.2 Google.bm
|--- BotModules/Google.bm       21 Feb 2009 02:15:14 -0000      1.2
|+++ BotModules/Google.bm       22 May 2009 06:33:07 -0000
--------------------------
File to patch: BotModules/Google.bm
Patching file BotModules/Google.bm using Plan A...
Hunk #1 succeeded at 17.
patch: **** malformed patch at line 26:                 my $term2 = $3;
(Reporter)

Comment 7

9 years ago
I have had another user and myself test the issue, and we have confirmed the problem on two more Ubuntu versions, 8.04 and 9.04.

On one system (9.04), it caused the bot to crash, on the other (8.04) this message was sent as a warning:

Wide character in print at ./mozbot.pl line 1348 (#2)
    (W utf8) Perl met a wide character (>255) when it wasn't expecting
    one.  This warning is by default on for I/O (like print).  The easiest
    way to quiet this warning is simply to add the :utf8 layer to the
    output, e.g. binmode STDOUT, ':utf8'.  Another way to turn off the
    warning is to add no warnings 'utf8'; but that is often closer to
    cheating.  In general, you are supposed to explicitly mark the
    filehandle with an encoding, see open and perlfunc/binmode.

But the bot recovered from the error and displayed the entry (though it appeared while using the WWW module, which I will file as a separate bug, since it caused crashes on the other two systems).
(Reporter)

Updated

9 years ago
Blocks: 494554
(Reporter)

Comment 8

9 years ago
Created attachment 379625 [details] [diff] [review]
Alternate export of the utf-8 patch

I have exported another version of the patch duplicating Cww's changes which should have the correct cvs diff flags.
Attachment #379063 - Attachment is obsolete: true
Attachment #379063 - Flags: review?(bugtrap)
(Reporter)

Updated

9 years ago
Attachment #379625 - Flags: review?(bugtrap)

Comment 9

9 years ago
Comment on attachment 379625 [details] [diff] [review]
Alternate export of the utf-8 patch

>+use Encode;

>-                $self->say($event, $event->{'from'} . ': ' . $result);
>+                $self->say($event, $event->{'from'} . ': ' . Encode::encode('utf-8', $result));
>             }

Is there any reason we can't/shouldn't do the Encode call from within the ->say() method (http://mxr.mozilla.org/webtools/source/mozbot/mozbot.pl#1948), instead of encoding the msg in each module? (See also Bug 494820 Attachment 379624 [details] [diff]; Bug 494554 Attachment 379621 [details] [diff]; which also use say().) It seems like we'd fix more instances of this crash that way.

Updated

9 years ago
Blocks: 494820
(Reporter)

Comment 10

9 years ago
Created attachment 382431 [details] [diff] [review]
Updated say function in mozbot.pl

I agree with Wolf.  His solution in Comment #9 is much more elegant and scalable.

I will continue running tests, but so far this solves all UTF-8 crashes.
Attachment #379625 - Attachment is obsolete: true
Attachment #379625 - Flags: review?(bugtrap)
(Reporter)

Updated

9 years ago
Attachment #382431 - Flags: review?(bugtrap)
(Reporter)

Comment 11

9 years ago
I haven't run into any regression with attachment 382431 [details] [diff] [review], so unless, Wolf finds any regressions, it should fix things very well.
(Reporter)

Comment 12

9 years ago
Changing bug name to "Add UTF-8 support in mozbot" since it is more inclusive now.
Summary: Google Module crash when UTF-8 character in search results output → Add UTF-8 support in mozbot
(Assignee)

Comment 13

9 years ago
(In reply to comment #9)
This would break all modules that currently have an Encode as well as all modules that pipe content pre-encoded without unencoding and then reencoding (I think infobot does) as it would double-encode.  So all open mozbot installs so there's a lot of work involved.  We can't take this patch as is.

I agree it's the elegant solution, though.

Instead, we can do a try/catch (although that would be a small performance hit) and then encode only if we have to OR we could just patch individual modules.
(Assignee)

Updated

9 years ago
Attachment #382431 - Flags: review-
(Assignee)

Comment 14

9 years ago
Comment on attachment 382431 [details] [diff] [review]
Updated say function in mozbot.pl

r- from me.

Comment 15

9 years ago
Comment on attachment 382431 [details] [diff] [review]
Updated say function in mozbot.pl

Per Comment #13
Attachment #382431 - Flags: review?(bugtrap) → review-

Comment 16

9 years ago
(In reply to comment #13)
> Instead, we can do a try/catch (although that would be a small performance hit)
> and then encode only if we have to OR we could just patch individual modules.
Would be nice if we could quantify how much of a perf hit it would be. I'm ok with leaving it as an exercise for the module though. (Though, if only infobot currently does it, couldn't we just fix infobot?)
(Assignee)

Comment 17

9 years ago
I have no idea which modules encode but it's every module that handles UTF-8 wide characters correctly without crashing.  What I'm more worried about is modules that are written by other mozbot users -- we're essentially drastically changing the API with no input whatsoever.  Also we'd have to change all the other "outputs" that is do the same to notices and emotes since it would suck to have to encode for emotes but NOT for say and be thoroughly confusing.  With a try/catch, you avoid all of these since it only does it if it has to.

(And now that I think about it, I'm not sure it's actually THAT MUCH of a perf hit because we're not passing in wide characters all too often.)  I'll test this when I write it.
(Assignee)

Comment 18

9 years ago
Created attachment 393072 [details] [diff] [review]
patch v1

patch doesn't have a terrible perf hit but I haven't really tested it hard.
Attachment #393072 - Flags: review?
(Assignee)

Updated

9 years ago
Attachment #393072 - Flags: review? → review?(bugtrap)
(Reporter)

Updated

9 years ago
QA Contact: mozbot → mozilla.bugs
QA Contact: mozilla.bugs → mozbot
(Assignee)

Comment 19

9 years ago
Created attachment 393078 [details] [diff] [review]
patch v1.1

Oooops... forgot to add the use Encode at the top.
Attachment #393072 - Attachment is obsolete: true
Attachment #393078 - Flags: review?(bugtrap)
Attachment #393072 - Flags: review?(bugtrap)

Comment 20

9 years ago
Is the perf hit end-user noticeable?

Would be nice to have an idea how much this slows things down, if its at all noticeable. (its probably negligible compared to the throttling mozbot already does.. but..)
(Assignee)

Comment 21

9 years ago
I didn't notice it at all.
(Reporter)

Updated

9 years ago
Duplicate of this bug: 494820
(Reporter)

Updated

9 years ago
Duplicate of this bug: 494554
(Reporter)

Comment 24

8 years ago
I just thoroughly tested the patch, and there is no perceptible performance hit.  

If there is any change, than it is a few ms, not anything that I have tools to measure.  Consistently with Google.bm, a query run unpatched takes 2 sec., and a query with the patch takes 2 sec.  I don't have anything that I am aware that would allow me to test this to any finer degree, but if there is, let me know, but since this difference is less than .5 sec, there shouldn't be any end-user perf hit.

I would say this is ready for check-in, with Wolf's approval.
(Reporter)

Updated

8 years ago
Target Milestone: --- → 2.8
(Reporter)

Updated

8 years ago
Attachment #382431 - Attachment is obsolete: true

Updated

8 years ago
Attachment #393078 - Flags: review?(bugtrap) → review+
(Reporter)

Comment 25

8 years ago
Cww, Wolf: Which of you plan to land this?
Keywords: checkin-needed
Whiteboard: [c-n cvs]

Comment 26

7 years ago
Checking in mozbot.pl;
/cvsroot/mozilla/webtools/mozbot/mozbot.pl,v  <--  mozbot.pl
new revision: 2.63; previous revision: 2.62
done
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Whiteboard: [c-n cvs]
You need to log in before you can comment on or make changes to this bug.