Open Bug 538946 Opened 15 years ago Updated 9 years ago

The XML file for bugs is broken if a comment contains nul characters

Categories

(Bugzilla :: Bug Import/Export & Moving, defect)

defect
Not set
minor

Tracking

()

People

(Reporter: gerv, Unassigned)

References

()

Details

This URL (which is the XML output for bug 53703):
https://bugzilla.mozilla.org/show_bug.cgi?id=53703&ctype=xml
gives an XML parsing error when loaded in Firefox. That's due to the random strange characters in comment #1. Not sure how they got there, but our XML outputting code should do the right thing with them and escape them correctly if necessary.

Either that, or it's a Firefox XML parser bug. But let's see if it's us first :-)

Gerv
(bug 53703 comment #0)
> mention the use of <> to separate the name from the e-mail address.  I������t
> does 
> emphasize the importance of making sure the user is aware of what's going on.

These are six zero bytes.  Broken UTF-8, unescaped.  In IE same character looks like newline, 'Show html source' stopped before that.

IE 8 does not parse, either.
Although calling "ctype=xml" on _this_ bug, into which you have inserted those characters, does work OK. So there is some difference between comment 1 here, and the relevant comment on the other bug.

Gerv
(In reply to comment #2)
> Although calling "ctype=xml" on _this_ bug, into which you have inserted those
> characters, does work OK. So there is some difference between comment 1 here,
> and the relevant comment on the other bug.

Each zero byte was rendered by Firefox as UTF-8 0xEF 0xBF 0xBD, which is valid Unicode REPLACEMENT CHARACTER:

http://www.fileformat.info/info/unicode/char/fffd/index.htm

Copying and pasting that from HTML page into textarea results in same UTF-8 sequence.

Perhaps this situation should be addressed by checksetup.pl or sanitycheck.cgi, not XML output.
We need to figire out:

1) How they got in there (can you paste a bunch of NULs into a textarea and have them end up in the database)

2) Whether we think those characters should be valid in a comment, and if not, why not, and where should they be stripped out?

3) The Nul character is excluded from XML, even when escaping. So if we do allow nul characters in comments, we need to filter them out before writing the comment out to XML.

Gerv
Oh, and:

4) If we do decide to exclude them, what should we do about databases in which they are already present?

I would try and put some Nul characters in here, but I can't find a way to copy and paste one :-(

Gerv
(In reply to comment #4)
> We need to figire out:

> 1) How they got in there (can you paste a bunch of NULs into a textarea and
> have them end up in the database)

Went in well before bug 126266 and somehow not fixed by bug 280633

> 2) Whether we think those characters should be valid in a comment, and if not,
> why not, and where should they be stripped out?

These are invalid, from UTF-8 standpoint.

> 3) The Nul character is excluded from XML, even when escaping. So if we do
> allow nul characters in comments, we need to filter them out before writing the
> comment out to XML.

IMHO we don't allow them, at least intentionally :-)

> I would try and put some Nul characters in here, but I can't find a way to copy
> and paste one :-(

You should be using command line web client with low level capabilities to do so.
Wow, this is definitely pretty weird. I've never seen this happen in a Bugzilla before, but it's also possible that it's just that nobody's noticed it.

I think modifying the XML filter to remove them would be a reasonable solution, unless it's actually currently the XML filter mis-parsing them and causing some problem.
Severity: normal → minor
Summary: Broken XML output by ctype=xml for a particular b.m.o. bug → The XML file for bugs is broken if a comment contains nul characters
Is there a workaround to dump the file to xml, excluding the bugs that have illegal bytes in them?

The xml dump is quite useful for statistic studies :)
Mihai: if you are a researcher and want a sanitized copy of the Bugzilla database, ask for one. Don't scrape all 600,000 bugs.

Gerv
This also affects non-NULL (but invalid XML characters).

See https://landfill.bugzilla.org/bugzilla-3.6-branch/show_bug.cgi?ctype=xml&id=11565 as an example.

Also occurs when making the Bugs.comment RPC call.

  -- simon
(In reply to comment #11)
> Also occurs when making the Bugs.comment RPC call.

  I bet that'd be resolved by using the JSON-RPC interface, BTW, for now as a workaround.
You need to log in before you can comment on or make changes to this bug.