Closed Bug 132583 Opened 22 years ago Closed 22 years ago

Anchors that contain non-UTF-8 characters may hang Mozilla

Categories

(Core :: XPCOM, defect, P4)

x86
Windows 2000
defect

Tracking

()

VERIFIED FIXED
mozilla1.0

People

(Reporter: emaijala+moz, Assigned: jag+mozilla)

References

Details

(Keywords: intl, Whiteboard: [adt2])

Attachments

(2 files, 2 obsolete files)

Clicking anchor that contains scandinavian characters (anchor is "Origo
käytössä") on a page with Windows-1252 character set hangs Mozilla. This is
because the conversion from UTF8 to USC2 goes to an infinite loop if given wrong
input. 

I'll attach a patch that checks the validity of the string before passing it to
the conversion. 

I hope that the component is correct.
I added a small function to string/obsolete/nsString2.h. It is used in
docshell/base/nsDocShell.cpp to check if the string is valid UTF-8 before
passing it to NS_ConvertUTF8toUCS2().

Now I'm not sure if this is the best way to do it or if the functions are in
correct places. Input requested.
Don't try this in the middle of something important :)
isn't the real issue that NS_ConvertUTF8toUCS2 will loop on invalid input?
Summary: Clicking anchor that contains scandinavian characters may hang Mozilla → Clicking anchor that contains scandinavian characters may hang Mozilla
Well, in the conversion code there are lines such as this: 
NS_ERROR("Not a UTF-8 string. This code should only be used for converting from
known UTF-8 strings.");

That's why I assumed it's better to avoid calling it..
Oh, I forgot to say that it would be a simple fix to make it not loop forever.
It will already fail an assertion in the loop. Anyway, I'm not sure if it's
proper, because then the result might be unexpected.

Sorry for the spam. 
After investigating more I've found that there are more places where this can
happen (nsHTMLContentSink, nsXMLContentSink at least). Changing summary to
reflect this better.
Assignee: adamlock → jaggernaut
Component: Embedding: Docshell → String
QA Contact: adamlock → scc
Summary: Clicking anchor that contains scandinavian characters may hang Mozilla → Anchors that contain non-UTF-8 characters may hang Mozilla
Comment on attachment 75387 [details] [diff] [review]
Patch to avoid trying the conversion if the string isn't UTF-8

Does not fix all situations. Need to find better place for isValidUTF8() also.
Attachment #75387 - Attachment is obsolete: true
If we don't get anything else into 1.0, I suggest we use this. This will
prevent the string iterator from staying in the infinite loop. Practically I've
just done what there was in comments (breaking out if !one_hop).
*** Bug 134714 has been marked as a duplicate of this bug. ***
adding a test case from bug 134714 and nominating for nsbeta1

const PRUnichar * str = NS_ConvertUTF8toUCS2("\xE0\xE0\xE0").get();

Keywords: nsbeta1
Nav triage team: nsbeta1+, adt2
Keywords: nsbeta1nsbeta1+
Whiteboard: [adt2]
-> 1.0
Target Milestone: --- → mozilla1.0
Priority: -- → P4
This has to be fixed soon. Bug 139842, bug 142930 are caused by this.
cc to waterson, I think NS_ConvertUTF8toUCS2 should validate the input and
detect illegal UTF-8.
This is not limited to the corrupted data outside from Mozilla.
It can be caused by defrating PRUnichar to char then calling
NS_ConvertUTF8toUCS2 later.
This will set the string to an empty string if the input isn't UTF8.
Attachment #77060 - Attachment is obsolete: true
Can the code put 0xFFFD for invalid bytes until it sees any valid bytes?
That way, the user can see something like "???abc" instead of a blank string.
> Can the code put 0xFFFD for invalid bytes until it sees any valid bytes?
> That way, the user can see something like "???abc" instead of a blank string.
> 
see the patch of bug 128896
http://bugzilla.mozilla.org/attachment.cgi?id=76351&action=view
Yes, I see your solution there, but there is (or was) code out there that went
something like:

NS_ConvertUTF8toUCS2 temp(someUnknownEncodingStr);
if (temp.IsEmpty()) {
  // Looks like we're not UTF8, try something else ...
  ...
}

Which the old version of this code supported and I broke when I refactored parts
of it.

What I really want (and that's why I've put NS_ERRORs in place) is that people
only call this code when they know that it's a UTF8 string, and to use the intl
converter code when they don't know what they're dealing with.
Comment on attachment 82858 [details] [diff] [review]
Make NS_ConvertUTF8toUCS2 deal better with faulty input

>Index: nsString2.cpp

>-    SetCapacity(count);
>+    SetLength(count);
>       // |SetCapacity| normally doesn't guarantee the use we are putting it to here (see its interface comment in nsAString.h),
>       //  we can only use it since our local implementation, |nsString::SetCapacity|, is known to do what we want

seems like this comment no longer applies.

otherwise, the patch looks good to me.	r/sr=darin
Attachment #82858 - Flags: superreview+
Comment on attachment 82858 [details] [diff] [review]
Make NS_ConvertUTF8toUCS2 deal better with faulty input

r=scc.	It looks reasonable to me.
Attachment #82858 - Flags: review+
Checked in on trunk. Will check in on branch when I get the appropriate approvals.
Status: NEW → RESOLVED
Closed: 22 years ago
Keywords: adt1.0.0
Resolution: --- → FIXED
Blocks: 141008
Keywords: intl
Let's get this verified on the trunk first.
Verified. Very good :)
Status: RESOLVED → VERIFIED
adding adt1.0.0+.  Please get drivers approval and then checkin to the 1.0 Branch.
Keywords: adt1.0.0adt1.0.0+
Comment on attachment 82858 [details] [diff] [review]
Make NS_ConvertUTF8toUCS2 deal better with faulty input

a=chofmann,dbaron,valeski for the 1.0 branch
Attachment #82858 - Flags: approval+
Checked into the branch.
Keywords: fixed1.0.0
*** Bug 145479 has been marked as a duplicate of this bug. ***
Blocks: 146292
No longer blocks: 141008
Component: String → XPCOM
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: