Closed Bug 105960 Opened 23 years ago Closed 16 years ago

xml.cgi and other future xml pages generate invalid XML

Categories

(Bugzilla :: Query/Bug List, defect, P3)

2.14
defect

Tracking

()

RESOLVED FIXED
Bugzilla 3.2

People

(Reporter: luis, Assigned: khampton)

References

()

Details

Attachments

(3 files, 3 obsolete files)

Because xml.cgi basically hand crafts xml instead of using the standard perl XML
libraries, it doesn't escape/encode umlauts (for example), so the XML it creates
is invalid. This means that sane XML libraries looking to parse the code are...
well, displeased. 

Because we are moving towards templates for all data output (not just xml.cgi),
it seems unlikely that this problem is going to be solved the 'correct' way-
i.e., by using the perl XML libraries to create properly encoded XML.
  
So... I'm not entirely sure that there is any elegant solution to this, except
maybe to provide a 'encode_this_like_XML_likes()' function that can be called
iff xml is chosen as the output format. Not sure how that would work, though.
But it is a problem, and will continue to be if people want to use standard
system libraries (on both linux and windows) to create interaction with bugzilla.
We expect proper escaping for HTML templates, we should do the same for XML
templates.  Template Toolkit has the 'html' filter, does it have an 'xml'
filter, or are they the same?
Priority: -- → P3
Target Milestone: --- → Bugzilla 2.18
Even for cases that have been anticipated, it looks like something weird is
going on here.  Go to http://bugzilla.mozilla.org/xml.cgi?id=1000 and browse the
output and notice the invalid entities being generated for known cases.

QuoteXMLChars() in Bug.pm is supposed to be taking care of &, <, >, ', and ",
but the output above is somehow translating & into &amp;amp;, < into &amp;lt;,
and so on.

It looks as though translation is somehow happening twice, once expanding < to
&lt; for example, then expanding that to &amp;lt; .  But QuoteXMLChars() doesn't
seem to be getting called twice within the <text /> tags output, and it
translates & characters before <, not after, as my guess would indicate.  If it
is because of a recursive substitution, why does it stop?  I am completely
baffled.  Perhaps it is an escaping problem in QuoteXMLChars()?

Templates might be able to fix this, but without knowing what the problem is or
what the html filter covers, I can't be certain.  IIRC, the html filter didn't
cover &quot;, much less umlauts, and I didn't see an xml filter.
To handle the accented characters, would it make sense to set the encoding of
the xml output to latin1?  Something like this:
  <?xml version="1.0" encoding="iso8859-1" standalone="no"?>

Alternatively, the data could be reencoded to UTF-8 (which is the default
encoding for XML files).

The first option is probably the easiest one, and is probably correct if
bugzilla is assuming things work in latin1.

The encoding could easily be part of the template, so that bugzillas in other
countries which have bugs using a different encoding could be exported correctly.

I think a compromise like this would be necessary, given that the encoding of
the text returned by the bugzilla user isn't known (or isn't recorded).
An example of the accented chars causing problems is:
  http://bugzilla.mozilla.org/xml.cgi?id=384

Setting the encoding in the <?xml?> PI makes the file parseable.
note that xml.cgi sends the wrong mimetype, too.
The current version on b.m.o sends text/plain, and lxr said it would send
text/html. tststs, text/xml it should be.

Talking about entities, the xml produces by xml.cgi should be readable for
Mozilla, IMHO, so it needs to be standalone. No dtds are loaded over the
network at all.
If you want to use entities, declare them in the document, or stick to the basic
XML entities plus numerical.
lxr is showing cvs bugzilla - that first content-type happens if you don't put
an id in at all, in which case we do generate an html page.

The dtd is just for validation - there are no entities used.

justdave: should we fix the mimetype for 2.16? Its a simple, obviously correct
fix. (Which will mean that we can't see the output in mozilla directly, w/o
using view source, but....)
If the DTD is only good for validation, then the xml pi should have
<?xml version="1.0" standalone="yes"?>, but xml.cgi serves "no".
If you want to keep validation information, give the right hint that this is 
only needed for validation, but not for the creation of the content.
http://www.w3.org/TR/2000/REC-xml-20001006#sec-rmd
Brad: if you guys agree it needs to be done, I won't stop it from getting
checked in. :-)  Although I learned a lot about DTD-writing recently I'm still
generally clueless on the way the XML gets transmitted.
An easy way to test if the output is formal XML is with the xmllint program that
comes with the libxml2 library (http://www.xmlsoft.org).  Just download the
output of xml.cgi, then run xmllint on it like this:

  xmllint -noout file.xml

If it prints anything, then the file has character set or tag balancing issues.
 You can also perform validity checks (it should be able to download the DTD) with:

  xmllint -noout -valid file.xml

At the moment, it looks like the output from bugzilla.mozilla.org has formality
problems (isn't utf-8, but doesn't specify its character set), and validity
problems (looks like some elements are out of order).

For my purposes (my python module for talking to a bugzilla installation), only
the formality issue is a problem.  It might be worth fixing the validity bug though.

Here is the output from a validity check after setting the character set to
iso8859-1 in the <?xml?> PI:

$ xmllint -noout -valid 384.xml 
384.xml:20: validity error: No declaration for element resolution
  <resolution>INVALID</resolution>
                                 ^
384.xml:199: validity error: Element bug content doesn't follow the DTD
Expecting (bug_id , exporter , urlbase , bug_status , resolution? , product ,
priority , version , rep_platform , assigned_to , delta_ts , component ,
reporter , target_milestone? , bug_severity , creation_ts , qa_contact? ,
status_whiteboard? , op_sys , short_desc? , keywords* , dependson* , blocks* ,
cc* , long_desc? , attachment*), got (bug_id bug_status product priority version
rep_platform assigned_to delta_ts component reporter target_milestone
bug_severity creation_ts qa_contact op_sys resolution short_desc long_desc
long_desc long_desc long_desc long_desc long_desc long_desc long_desc long_desc )
</bug>
     ^
Axel: to summarise: for 2.16, you want:

a) The XML served as text/xml
b) The standalone="no" to become standalone="yes"?

Gerv
Target Milestone: Bugzilla 2.18 → Bugzilla 2.16
Attached patch patch (obsolete) — Splinter Review
That seems like the correct thing to do
Keywords: patch, review
Comment on attachment 81244 [details] [diff] [review]
patch

2r=caillon.  This change is fine.  (is there anything else we need to do though
before closing this bug?  Comment 9 hints that we don't validate per the
bugzilla DTD)
Attachment #81244 - Flags: review+
we validate now - that was a separate bug.

Charset encoding stuff is covered by a separate bug (bug 126255) Won't using
utf8 require us to use perl 5.6? That bug has a patch on it, but it doesn't
really cover whats needed here, it just allows an encoding to be set.

Leaving open, -> 2.18, and depending on bug 126255. I'll mark the other patch as
obsolete, too, so that it doesn't affect queries.
Assignee: endico → bbaetz
Depends on: 126255
Keywords: patch, review
Target Milestone: Bugzilla 2.16 → Bugzilla 2.18
Attachment #81244 - Attachment is obsolete: true
Bug 126255 seems to be something about logging into amazon.com, rather than a
bugzilla bug.  I think you meant bug 126266.  That patches for that bug seems to
cover setting the encoding of the html output of xml.cgi (ie. when you don't
give any bug numbers).  Not sure which bug the charset issue fits into better
(we definitely aren't producing UTF-8 at the moment, so we should be setting it
to _some_ encoding, if we want it to validate).

As for the charset issue, specifying the encoding as iso8859-1 is probably the
best option at the moment.
I forgot to mention that the reason I marked my patch obsolete, and moved this
to 2.18, was because I checked that patch in. Oops.

Yes, I did mean bug 126266, sorry. If that patch goes in as is, we could just
add the use of the Param to the xml header. In any event, working out the
encoding for the xml output implies working out the encoding for the rest of
bugzilla.

We only really support ascii at the moment, I believe, so UTF8 is sort of
correct. The problem in that other bug and related issues is that we don't have
a way of defined what encoding is being used.
I just tried on http://landfill.tequilarista.org/bugzilla-tip/xml.cgi?id=267,
that looks pretty much like what you would expect.
It works in document inspector, entities are as they should be.

Rossi, would you take a look and see if the output works for you?
yes, this one works great. thanks for checking with me!
I found another problem caused by the QuoteXMLChars function in Bug.pm regarding
the '&' -> '&amp;' conversion.

If I have for example a sequence like '->' in a bug description it is converted
to '-&amp;gt;' when exporting this bug with xml.cgi.

Seems to be caused by QuoteXMLChars to be applied more than once on the same
field, thus first converting '>' to '&gt' and then '&gt' to '&amp;gt;'.

The solution is an additional look-ahead in QuoteXMLChars:
  $_[0] =~ s/&/&amp;/g;
should be replaced with
  $_[0] =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&amp;/g;

This way, a '&' is not touched if it is used as a character prefix (as in &gt;
or &Auml; or &#8482;).
What verson are you using? I have a recollection of our double (and in some
cases triple) XML Escaping being fixed a while back. See bug 109530.

This is fixed in CVS where the whole xml thing got redone to just be a display
format.
Unloved bugs targetted for 2.18 but untouched since 9-15-2003 are being
retargeted to 2.20
If you plan to act on one immediately, go ahead and pull it back to 2.18.
Target Milestone: Bugzilla 2.18 → Bugzilla 2.20
It still generates invalid xml if you use things like smart quotes, etc. It 
should convert them to &<ordinal value>; type deals. 
I've attached a patch that fixes it. 
 
Attached patch XML quoting fixSplinter Review
Attachment #158828 - Flags: review?
Sigh, i was up too late that night.
What i meant is that you still have to remove the unrepresentable characters.
There are some characters that xml simply doesn't allow.
However, they can still make it into the db occassionally.
I believe they are 0x01-0x08, 0x0b-0x0c 0x0e-0x19. THe XML Spec 2.2 says:

[2]   	Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]

Those not in here need to be simply dropped from the output.

ie $var =~ s<([\x01\x02\x03\x04\x05\x06\x07\x08\x0b\x0c\x0e-\x19])><>seg;
Attachment #158828 - Flags: review?(kiko)
Whiteboard: patch awaiting review
This bug has not been touched by its owner in over six months, even though it is
targeted to 2.20, for which the freeze is 10 days away. Unsetting the target
milestone, on the assumption that nobody is actually working on it or has any
plans to soon.

If you are the owner, and you plan to work on the bug, please give it a real
target milestone. If you are the owner, and you do *not* plan to work on it,
please reassign it to nobody@bugzilla.org or a .bugs component owner. If you are
*anybody*, and you get this comment, and *you* plan to work on the bug, please
reassign it to yourself if you have the ability.
Target Milestone: Bugzilla 2.20 → ---
Comment on attachment 158828 [details] [diff] [review]
XML quoting fix

Canceling review requests since comment 23 indicates this patch needs more
work.  Daniel, are you still around to update your patch?
Attachment #158828 - Flags: review?(kiko)
Attachment #158828 - Flags: review?
*** Bug 291503 has been marked as a duplicate of this bug. ***
Whiteboard: patch awaiting review
Poking dberlin, who AFAICT was not on the cc list.
I'll update it. It's a one liner.
QA Contact: mattyt-bugzilla → default-qa
It's still possible to generate bad XML. See for example http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?ctype=xml&id=5032 (plus real-world example at http://bugzilla.gnome.org/show_bug.cgi?id=417196&ctype=xml)
I stumbled across another real-world example of this today; the following link results in invalid XML...

https://bugs.eclipse.org/bugs/show_bug.cgi?ctype=xml&id=140108
Attached patch V1 (obsolete) — Splinter Review
The proposed patch addresses two issues, both related to generating well-formed XML:

1) It replaces instances of '&' with '&amp;' while skipping over valid character entities (so that '&trade;' would *not* become '&amp;trade;' for example).

2) It knocks out characters from disallowed character ranges (control characters, etc) per the XML 1.0 Spec (production 2.2). Note that the more verbose form of expressing the hex characters is used in the regex because the abbreviated form appearing in character classes evidently makes certain perl's lexers cry bitter tears of failure.
Attachment #354687 - Flags: review?(mkanat)
Comment on attachment 354687 [details] [diff] [review]
V1

>+    # substitute &amp; for & unless it is already 
>+    # used in a character entity.
>+    $var =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&amp;/g;

  That's getting too complex. The way the filter is used, it should be displaying "&trade;" if somebody writes "&trade;".

>+    # the following nukes characters disallowed by the XML 1.0
>+    # spec, Production 2.2. 1.0 declares that only the following 
>+    # are valid:
>+    # (#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF])
>+    $var =~ s/([\x{0001}-\x{0008}]|
>+               [\x{000B}-\x{000C}]|
>+               [\x{00E}-\x{0019}]|
>+               [\x{D800}-\x{DFFF}]|
>+               [\x{FFFE}-\x{FFFF}])//gx;

  I'd rather replace them with HTML entities, is that possible? People export data via XML, and theoretically some of these characters could be in comments (as unlikely as it seems).
Attachment #354687 - Flags: review?(mkanat) → review-
html entities have nothing to do in xml output. You could use numeric xml entities, fwiw, http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-references.
Except that you can't, see the fourth line in that section (or any of the other thousands of places googling "form feed xml" will show you people being shocked to find they can't "just use XML" to wrap any data). You simply *cannot* represent a form-feed in XML 1.0 content. You can use &amp;#x12;, to turn an actual form-feed character into the literal string "&x12;" if you find that noisy dataloss is better than silent dataloss, but you can no more use &#x12; than you can use the actual character.

(Thanks to its desire to be as incompatible and unused as possible, you actually can have &#x12; in your XML 1.1, though good luck finding anything to consume XML 1.1 :)
(In reply to comment #34)

> >+    $var =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&amp;/g;
> 
>   That's getting too complex. The way the filter is used, it should be
> displaying "&trade;" if somebody writes "&trade;".

Well, that's the real question, I suppose: "display" vs. "consume". If I'm only looking at a bug "in XML" in a browser I'd expect what you expect; if I'm passing that data to some kind of XML tool chain in order to do something with it, I'd expect the entities to be preserved as-is.
> 
> >+    # the following nukes characters disallowed by the XML 1.0
> >+    # spec, Production 2.2.

::snipping fulgy regex::
 
>   I'd rather replace them with HTML entities, is that possible? People export
> data via XML, and theoretically some of these characters could be in comments
> (as unlikely as it seems).

As Phil rightly points out below, any instance of one of the characters not allowed in Production 2.2-- whether expressed by cutting and pasting from a Word doc or written by hand as entity ref to the code point-- instantly makes the document not-well-formed and all XML 1.0 parsers are /required/ to throw a fatal error. I realize that just nuking them is lossy and might break expectations, but there's simply no way to doll them up to make XML parsers happy and still maintain data integrity.
(In reply to comment #37)
> Well, that's the real question, I suppose: "display" vs. "consume".

If I type "&amp; means & in HTML" in a comment, I want it to be set back to this exact string after the XML parser has read the XML file. I don't want the XML parser to convert it back to "& means & in HTML". So when you generate the XML file, you should encode & in all cases, and not assume that the input data was already escaped.

About illegal characters, as philor says we have no way to pass them due to limitations of the XML 1.0 spec, I think it's better to silently skip them as it's very unlikely that we will "see" them anyway.

So simply remove the first part of your patch about & and re-request review.
Assignee: bbaetz → khampton
OS: Other → All
(In reply to comment #36)
> Except that you can't, see the fourth line in that section (or any of the other
> thousands of places googling "form feed xml" will show you people being shocked
> to find they can't "just use XML" to wrap any data). You simply *cannot*
> represent a form-feed in XML 1.0 content. 

Exactly.

> You can use &amp;#x12;, to turn an
> actual form-feed character into the literal string "&x12;" if you find that
> noisy dataloss is better than silent dataloss, but you can no more use &#x12;
> than you can use the actual character.

Okay, that's interesting. It doesn't buy much but at least it keep the code
from silently eating parts of the users' data. Personally, I favor just nuking them since the most common case in my experience is that these characters are artifacts of copy/paste and are not intentional.

> (Thanks to its desire to be as incompatible and unused as possible, you
> actually can have &#x12; in your XML 1.1, though good luck finding anything to
> consume XML 1.1 :)

Heh, yeah. I expect to have my flying car /long/ before any 1.1 parsers make it
out of the lab.
(In reply to comment #35)
> html entities have nothing to do in xml output. You could use numeric xml
> entities, fwiw, http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-references.

Check the version on that spec. XML 1.1 does indeed allow for certain characters that XML 1.0 does not, but, to my knowledge, there are /no/ 1.1 parsers in the wild.
(In reply to comment #38)

> So simply remove the first part of your patch about & and re-request review.

Thanks, Frédéric. Will do.
Attached patch Character Stripping V2 (obsolete) — Splinter Review
Resubmitted with suggested changes. The patch now only nukes non-XML 1.0-friendly characters.
Attachment #354687 - Attachment is obsolete: true
Attachment #354764 - Flags: review?(mkanat)
Comment on attachment 354764 [details] [diff] [review]
Character Stripping V2

>+               [\x{00E}-\x{0019}]|

For consistency, should you write 000E instead of 00E? That's minor, though.
(In reply to comment #43)
> For consistency, should you write 000E instead of 00E? That's minor, though.
 
Yep, typo. Good catch. Resubmitting.
Fixed typo.
Attachment #354764 - Attachment is obsolete: true
Attachment #354766 - Flags: review?(mkanat)
Attachment #354764 - Flags: review?(mkanat)
Comment on attachment 354766 [details] [diff] [review]
Character Stripping V3

This looks fine to me. :-)
Attachment #354766 - Flags: review?(mkanat) → review+
Flags: approval3.2+
Flags: approval+
Target Milestone: --- → Bugzilla 3.2
tip:

Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v  <--  Util.pm
new revision: 1.78; previous revision: 1.77
done

3.2:

Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v  <--  Util.pm
new revision: 1.69.2.6; previous revision: 1.69.2.5
done
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Attached patch patch to V3Splinter Review
I've made this same mistake twice, publicly, in the last 24 hours!

The character that precedes 0x20 is 0x1f, not 0x19.
Attachment #365723 - Flags: review?(mkanat)
ubu: Is dwm correct?
(In reply to comment #49)
> ubu: Is dwm correct?

Yes.
Comment on attachment 365723 [details] [diff] [review]
patch to V3

Yes, this change is correct, per the spec. r=LpSolit
Attachment #365723 - Flags: review?(mkanat) → review+
tip:

Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v  <--  Util.pm
new revision: 1.86; previous revision: 1.85
done

3.2.2:

Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v  <--  Util.pm
new revision: 1.69.2.9; previous revision: 1.69.2.8
done
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: