Closed
Bug 368071
Opened 18 years ago
Closed 18 years ago
importxml fails to import if there is no newline before <?xml
Categories
(Bugzilla :: Bug Import/Export & Moving, defect)
Tracking
()
RESOLVED
FIXED
Bugzilla 2.22
People
(Reporter: valorbugzilla, Assigned: valorbugzilla)
Details
Attachments
(3 files, 4 obsolete files)
1.41 KB,
text/plain
|
Details | |
1.14 KB,
patch
|
gregaryh
:
review+
|
Details | Diff | Splinter Review |
1.14 KB,
patch
|
gregaryh
:
review+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20061113 Debian/1.7.8-1sarge8
Build Identifier: 2.22.1
Hi,
From MIME::Parser help:
> Warning: it is assumed that, once the files are cat'ed together,
> there will be a blank line separating the head part and the body part.
So if we try to import a file which starts with <?xml when importxml.pl throw the file to MIME::Parser, MIME::Parser silently returns nothing when we ask for the parsed string.
Reproducible: Always
Steps to Reproduce:
1. Create a xml file to import starting on first line with <?xml
2. Try to import it.
3.
Actual Results:
XML:: complains because there are no tag.
This is because there are no text at all in $xml as MIME::Parser returned an empty string.
Expected Results:
Parse it as usual.
Assignee | ||
Comment 1•18 years ago
|
||
This affects 2.22.1 and the tip.
I'll submit patch for both, in a minute for 2.22.1 and in a couple of hours when I get to my home for 2.23.1.
Version: unspecified → 2.22.1
Assignee | ||
Comment 2•18 years ago
|
||
Regexp to check if $xml start with xml.
Attachment #252626 -
Flags: review?(LpSolit)
![]() |
||
Comment 3•18 years ago
|
||
No such problem with version 5.420 of MIME::Parser:
perl -MMIME::Parser -we 'print $MIME::Parser::VERSION;'
5.420
Assignee | ||
Comment 4•18 years ago
|
||
I'm using 5.420 too where I tryed the problem. (!)
Are you sure you tryed to import a file which started on first line with the xml file without any new line before "<?xml .... ?>" ?
The xml file attached is what I used for testing, proovided by Nestor Vigo, the one who found the bug.
![]() |
||
Comment 5•18 years ago
|
||
Comment on attachment 252643 [details]
Xml file to test the bug.
><?xml version="1.0" standalone="yes" ?>
><!DOCTYPE bugzilla SYSTEM "http://arspi/bugzilla/bugzilla.dtd">
><bugzilla version="2.22" urlbase="http://arspi/bugzilla/"
>maintainer="Nestor.Vigo@pearson.com" exporter="Nestor.Vigo@pearson.com">
This file is not the one generated by Bugzilla! Mines have:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "http://localhost/bugzilla222/bugzilla.dtd">
<bugzilla version="2.22.1+"
urlbase="http://localhost/bugzilla222/"
maintainer="LpSolit@netscape.net"
exporter="LpSolit@netscape.net"
>
Note the empty line before <bugzilla.
Comment 6•18 years ago
|
||
this may be a dup of bug 366672
![]() |
||
Comment 7•18 years ago
|
||
Per discussion with ghendricks on IRC, we agree that this issue is not present in 2.22.2. So marking this bug as WFM. If the issue is real in 2.22.1, then you will have to upgrade to 2.22.2 when it's released.
Status: UNCONFIRMED → RESOLVED
Closed: 18 years ago
Resolution: --- → WORKSFORME
Comment 8•18 years ago
|
||
I am not able to reproduce it with any xml exported directly from Bugzilla, but all those files have a newline break either right after the first line or before the <bugzilla tag.
Your example above does not have any newlines at all and I _AM_ able to reproduce your problem with that file on every version since 2.22.1.
Since we state that we can accept any xml that conforms to the bugzilla DTD (and this appears to do so) we should account for this.
importxml.pl only cares if it sees a <bugzilla> tag. This means a newline after the first <?xml line would still work (which is what Bugzilla exports). I am not familiar enough with MIME::Parser to know if there is a way to have it ignore xml. That would be more ideal.
Status: RESOLVED → UNCONFIRMED
Resolution: WORKSFORME → ---
Assignee | ||
Comment 9•18 years ago
|
||
Hi!
Today I spend near two hours with this learning about MIME::Parser, MIME::Empty and why this happens.
After read bug 366672 I can confirm this is a dup.
The problem indeed is very simple, when we call MIME::Parser to parse it, no matter which parse_xxxx function you use, as soon everything is read, it looks for an empty line to split headers from body, and this is not a bug, EVERY mime compliant message must have a blank line between header and body. It does not even care about what's left on the header side or in the body side, it doesn't matter if it's xml, txt, a zip file, whatever, it only looks for an empty line, wherever it is.
So, if a bug is exported from bugzilla, this can't be reproduced because as Frédéric said, bugzilla exports leaving an empty line between DOCTYPE and <bugzilla>. So MIME::Parser is happy and everything continue as expected.
But if a bug is made with whatever software you use to export to xml your information, allowing or not to give the dtd to the software to automatically export it, as far I can see on bugzilla.dtd you will have an xml file without empty lines. Look, I'm not a guru reading dtd files but it seems that to me.
So I'm agree with Greg that we should account for this.
If you want I can change the patch regexp to add the newline not before <?xml but I think that doesn't matter.
Comment 10•18 years ago
|
||
Comment on attachment 252626 [details] [diff] [review]
Patch 1.1 for 2.22-BRANCH
This will suffice. Thanks for looking into it.
Can you provide a patch for the tip as well?
Attachment #252626 -
Flags: review?(LpSolit) → review+
Updated•18 years ago
|
Flags: approval2.22?
Updated•18 years ago
|
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Assignee | ||
Comment 11•18 years ago
|
||
Yeah I can, I will do it asap.
By the way I was checking the patch I submitted for 2.22-BRANCH and I saw a mistake in indent, inside conditional I left only two spaces instead of four.
Should I submit a new one for 2.22-BRANCH with correct indents inside the conditional or whoever commit the code can change it?
Comment 12•18 years ago
|
||
That can be fixed on checkin.
Assignee | ||
Comment 13•18 years ago
|
||
Ok regarding indent of patch to 2.22
Here it is for the tip.
Attachment #252783 -
Flags: review?(ghendricks)
Comment 14•18 years ago
|
||
Since I don't see it mentioned here explicitly, and several people seem to have been confused by it... importxml.pl as written currently is intended to receive an *email* on standard input. The email contains the xml document in the message body. This means the first thing it does when it processes the input is to strip the email header off. What's the easiest way to strip the email headers? Discard everything before the first blank line. This is why it works if there's a blank line.
So basically, if we're trying to make it so you can paste a raw XML file in and not have to have it in an email, then the correct way to fix this is to do a basic sanity check to see if there's really an email header there or not before trying to strip it, and if there's no email headers, just skip all the MIME::Parser stuff entirely. There's also currently a regexp after the MIME::Parser stuff that strips everything before the <?xml, with a comment stating that it's stripping email headers. I think this is a leftover from before the MIME::Parser stuff was added, and shouldn't be needed at all; that or the comment is wrong.
We're still missing a patch for the tip here, and I'd prefer they both go in at once.
Flags: approval2.22?
Comment 15•18 years ago
|
||
(In reply to comment #14)
> We're still missing a patch for the tip here, and I'd prefer they both go in at
> once.
Er, the patch for tip is here now, but not reviewed yet. :) I got midaired adding that comment.
Updated•18 years ago
|
Attachment #252783 -
Flags: review?(ghendricks) → review+
Comment 16•18 years ago
|
||
Comment on attachment 252626 [details] [diff] [review]
Patch 1.1 for 2.22-BRANCH
>+if ($xml =~ /\A.*<\?xml/) {
What's \A do in a regexp? I can't find it listed in perldoc perlre.
Comment 17•18 years ago
|
||
(In reply to comment #14)
> Since I don't see it mentioned here explicitly, and several people seem to have
> been confused by it... importxml.pl as written currently is intended to
> receive an *email* on standard input.
Actually, as part of the rewrite that I did, I intended for it to take either an email or a generated xml file from the command line. This is stated in the POD.
> The email contains the xml document in
> the message body. This means the first thing it does when it processes the
> input is to strip the email header off. What's the easiest way to strip the
> email headers? Discard everything before the first blank line. This is why it
> works if there's a blank line.
>
> So basically, if we're trying to make it so you can paste a raw XML file in and
> not have to have it in an email, then the correct way to fix this is to do a
> basic sanity check to see if there's really an email header there or not before
> trying to strip it, and if there's no email headers, just skip all the
> MIME::Parser stuff entirely. There's also currently a regexp after the
> MIME::Parser stuff that strips everything before the <?xml, with a comment
> stating that it's stripping email headers. I think this is a leftover from
> before the MIME::Parser stuff was added, and shouldn't be needed at all; that
> or the comment is wrong.
>
So I think the point of having the MIME::Parser stuff was to decode the body in case it was encoded somehow. I am not familiar with how this works (I avoid email as a matter of principle ;-) ) and I believe it was added last minute by mkanat. Perhaps he can shed some light on the situation.
So what you are saying is, we need a valid check to see if the input is an email and if not, skip the email processing entirely? I agree. I just don't know how I can be 100% sure that it is an email.
> We're still missing a patch for the tip here, and I'd prefer they both go in at
> once.
>
Done. :)
Flags: approval?
Flags: approval2.22?
Comment 18•18 years ago
|
||
(In reply to comment #16)
> What's \A do in a regexp? I can't find it listed in perldoc perlre.
Ah, I found it. Have to scroll down quite a ways from the rest of the character definitions. Looks like it's equivalent to how ^ behaves in /s mode without having to invoke /s.
Comment 19•18 years ago
|
||
(In reply to comment #17)
> So what you are saying is, we need a valid check to see if the input is an
> email and if not, skip the email processing entirely? I agree. I just don't
> know how I can be 100% sure that it is an email.
Well, if the first line starts with <?xml then it's probably not an email. ;)
The fix that's here works (albeit by ensuring that the data looks like an email by the time MIME::Parser sees it), so no reason not to put it in.
Flags: approval?
Flags: approval2.22?
Flags: approval2.22+
Flags: approval+
Assignee | ||
Comment 20•18 years ago
|
||
Ok, I can submit new patches to detect if it's a mail or not.
Are we sure if first line isn't <?xml then it's a mail?
Could not another software export xml with something before <?xml like software sign, comments, whatever?. Or that's not possible?
Tell me then how you want the patch, checking if first line have <?xml or checking if there is a mail header before <?xml in $xml ?
In case of second option I would do this regexp:
if ( $xml =~ /Message-ID(.*)<\?xml/si ) {
its a mail
} else {
its not a mail;
}
justdave?
Assignee | ||
Comment 21•18 years ago
|
||
By the way, I've tested value of $xml before
> $xml =~ s/^.+(\?xml version.+)$/$1/s;
and after it with all supported inputs and input was the same.
So in the patch I'm going to submit I will remove those lines too.
As soon someone points me out which of both suggested regexp to do to define if a mail is comming I will submit the patch for 2.22-BRANCH and the tip.
Flags: approval?
Flags: approval2.22?
Flags: approval2.22+
Flags: approval+
Assignee | ||
Comment 22•18 years ago
|
||
I'm sorry but I'm not familiar yet with review system.
In the emails I saw that in my last comment together with it I removed both approvals. In this case I know it's right because patch will be made again, but It was not my intention to do it.
So, it was a browser or a mistake by me with the tab button or is something that happens automatically?
![]() |
||
Updated•18 years ago
|
Assignee: import-export → gbn
Status: ASSIGNED → NEW
Target Milestone: --- → Bugzilla 2.22
Assignee | ||
Comment 23•18 years ago
|
||
Ok, because of talk with ghendricks on IRC the way will be that if there's anything non white space before xml, it's a mail. Else MIME::Parser code will be avoided.
Also I'm removing the regexp after MIME code which was old and useless.
Status: NEW → ASSIGNED
Assignee | ||
Comment 24•18 years ago
|
||
Attachment #252783 -
Attachment is obsolete: true
Assignee | ||
Comment 25•18 years ago
|
||
Attachment #252626 -
Attachment is obsolete: true
Assignee | ||
Updated•18 years ago
|
Attachment #253860 -
Flags: review?(ghendricks)
Assignee | ||
Updated•18 years ago
|
Attachment #253859 -
Flags: review?(ghendricks)
![]() |
||
Comment 26•18 years ago
|
||
Comment on attachment 253859 [details] [diff] [review]
Patch 1.1 for the tip.
>-# remove everything in file before xml header (i.e. remove the mail header)
>-$xml =~ s/^.+(<\?xml version.+)$/$1/s;
Why removing these lines?
Assignee | ||
Comment 27•18 years ago
|
||
(In reply to comment #26)
> (From update of attachment 253859 [details] [diff] [review])
> >-# remove everything in file before xml header (i.e. remove the mail header)
> >-$xml =~ s/^.+(<\?xml version.+)$/$1/s;
>
> Why removing these lines?
>
From justdave comment #14:
> There's also currently a regexp after the MIME::Parser stuff
> that strips everything before the <?xml, with a comment stating
> that it's stripping email headers. I think this is a leftover
> from before the MIME::Parser stuff was added, and shouldn't be
> needed at all; that or the comment is wrong.
And then from comment #21:
> By the way, I've tested value of $xml before
> > $xml =~ s/^.+(\?xml version.+)$/$1/s;
> and after it with all supported inputs and input was the same.
I wrote it wrong the last line, sorry:
and after doing it with all supported inputs, output of the regexp was the same.
As justdave said that regexp should be removed whenever MIME::Parser code was checked in. After MIME::Parser runs there are no headers in $xml.
And if MIME::Parser do not run thanks to this patch, it means there is already nothing before <?xml.
Comment 28•18 years ago
|
||
Also, if you want to switch it to using Email::MIME instead of MIME::Parser, that would be good, since importxml.pl is the last thing using MIME::Parser. Although I don't know if we should really do that when we're frozen. (It's up to ghendricks.)
![]() |
||
Updated•18 years ago
|
Flags: approval?
Flags: approval2.22?
Assignee | ||
Comment 29•18 years ago
|
||
Ok, I can do it if ghendricks want it.
Else we can submit a new bug for it and do it for 3.2 or 3.0.1
Comment 30•18 years ago
|
||
I can't really comment as to which is better, Email::Mime or MIME::Parser.
I think, being frozen, we should consider that change separately and just get this in for now.
Comment 31•18 years ago
|
||
Comment on attachment 253859 [details] [diff] [review]
Patch 1.1 for the tip.
>-# remove everything in file before xml header (i.e. remove the mail header)
>-$xml =~ s/^.+(<\?xml version.+)$/$1/s;
>+}
>
This is still necessary since the parser is expecting the <?xml as the first character it sees. If you have whitespace above it, it will fail.
Attachment #253859 -
Flags: review?(ghendricks) → review-
Assignee | ||
Comment 32•18 years ago
|
||
Ok I can remove it, but what a strange parser we have, I see it complains if it have whitespaces but do not complain if there is no <?xml at all.
Before this bug was found, Bugzilla exports by mail were leaving $xml starting with <?xml. But this bug was never found before because bugzilla exports by hand were also working. Why they were working? Because there was a blank line between <!DOCTYPE ...> and <bugzilla ...>.
So before this bug was found, every one who used to copy xml output from xml.cgi or an export e-mail, and paste it to importxml, the content of $xml that was given to the parser started with <bugzilla ...> as MIME::Parser was taking that empty line as the split line between headers and body.
I saw this when testing about this bug.
I'll submit the patch again without removing those lines.
Assignee | ||
Comment 33•18 years ago
|
||
Attachment #253859 -
Attachment is obsolete: true
Attachment #254048 -
Flags: review?(ghendricks)
Assignee | ||
Comment 34•18 years ago
|
||
Attachment #253860 -
Attachment is obsolete: true
Attachment #254049 -
Flags: review?(ghendricks)
Attachment #253860 -
Flags: review?(ghendricks)
Updated•18 years ago
|
Attachment #254048 -
Flags: review?(ghendricks) → review+
Updated•18 years ago
|
Attachment #254049 -
Flags: review?(ghendricks) → review+
Comment 35•18 years ago
|
||
What it is really looking for is a < as the first character. It doesn't really care what comes after that as long as it is xml.
Flags: approval?
Flags: approval2.22?
![]() |
||
Updated•18 years ago
|
Flags: approval?
Flags: approval2.22?
Flags: approval2.22+
Flags: approval+
Comment 36•18 years ago
|
||
Checking in importxml.pl;
/cvsroot/mozilla/webtools/bugzilla/importxml.pl,v <-- importxml.pl
new revision: 1.73; previous revision: 1.72
done
Checking in importxml.pl;
/cvsroot/mozilla/webtools/bugzilla/importxml.pl,v <-- importxml.pl
new revision: 1.47.2.8; previous revision: 1.47.2.7
done
Status: ASSIGNED → RESOLVED
Closed: 18 years ago → 18 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•