368071 - importxml fails to import if there is no newline before <?xml

Assignee

Description

•

18 years ago

User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20061113 Debian/1.7.8-1sarge8 Build Identifier: 2.22.1 Hi, From MIME::Parser help: > Warning: it is assumed that, once the files are cat'ed together, > there will be a blank line separating the head part and the body part. So if we try to import a file which starts with <?xml when importxml.pl throw the file to MIME::Parser, MIME::Parser silently returns nothing when we ask for the parsed string. Reproducible: Always Steps to Reproduce: 1. Create a xml file to import starting on first line with <?xml 2. Try to import it. 3. Actual Results: XML:: complains because there are no tag. This is because there are no text at all in $xml as MIME::Parser returned an empty string. Expected Results: Parse it as usual.

Guzmán Brasó

Assignee

Comment 1

•

18 years ago

This affects 2.22.1 and the tip. I'll submit patch for both, in a minute for 2.22.1 and in a couple of hours when I get to my home for 2.23.1.

Version: unspecified → 2.22.1

Guzmán Brasó

Assignee

Comment 2

•

18 years ago

Attached patch Patch 1.1 for 2.22-BRANCH (obsolete) — Details — Splinter Review

Regexp to check if $xml start with xml.

Attachment #252626 - Flags: review?(LpSolit)

Frédéric Buclin

Comment 3

•

18 years ago

No such problem with version 5.420 of MIME::Parser: perl -MMIME::Parser -we 'print $MIME::Parser::VERSION;' 5.420

Guzmán Brasó

Assignee

Comment 4

•

18 years ago

Attached file Xml file to test the bug. — Details

I'm using 5.420 too where I tryed the problem. (!) Are you sure you tryed to import a file which started on first line with the xml file without any new line before "<?xml .... ?>" ? The xml file attached is what I used for testing, proovided by Nestor Vigo, the one who found the bug.

Frédéric Buclin

Comment 5

•

18 years ago

Comment on attachment 252643 [details] Xml file to test the bug. ><?xml version="1.0" standalone="yes" ?> ><!DOCTYPE bugzilla SYSTEM "http://arspi/bugzilla/bugzilla.dtd"> ><bugzilla version="2.22" urlbase="http://arspi/bugzilla/" >maintainer="Nestor.Vigo@pearson.com" exporter="Nestor.Vigo@pearson.com"> This file is not the one generated by Bugzilla! Mines have: <?xml version="1.0" encoding="UTF-8" standalone="yes" ?> <!DOCTYPE bugzilla SYSTEM "http://localhost/bugzilla222/bugzilla.dtd"> <bugzilla version="2.22.1+" urlbase="http://localhost/bugzilla222/" maintainer="LpSolit@netscape.net" exporter="LpSolit@netscape.net" > Note the empty line before <bugzilla.

Greg Hendricks

Comment 6

•

18 years ago

this may be a dup of bug 366672

Frédéric Buclin

Comment 7

•

18 years ago

Per discussion with ghendricks on IRC, we agree that this issue is not present in 2.22.2. So marking this bug as WFM. If the issue is real in 2.22.1, then you will have to upgrade to 2.22.2 when it's released.

Status: UNCONFIRMED → RESOLVED

Closed: 18 years ago

Resolution: --- → WORKSFORME

Greg Hendricks

Comment 8

•

18 years ago

I am not able to reproduce it with any xml exported directly from Bugzilla, but all those files have a newline break either right after the first line or before the <bugzilla tag. Your example above does not have any newlines at all and I _AM_ able to reproduce your problem with that file on every version since 2.22.1. Since we state that we can accept any xml that conforms to the bugzilla DTD (and this appears to do so) we should account for this. importxml.pl only cares if it sees a <bugzilla> tag. This means a newline after the first <?xml line would still work (which is what Bugzilla exports). I am not familiar enough with MIME::Parser to know if there is a way to have it ignore xml. That would be more ideal.

Status: RESOLVED → UNCONFIRMED

Resolution: WORKSFORME → ---

Guzmán Brasó

Assignee

Comment 9

•

18 years ago

Hi! Today I spend near two hours with this learning about MIME::Parser, MIME::Empty and why this happens. After read bug 366672 I can confirm this is a dup. The problem indeed is very simple, when we call MIME::Parser to parse it, no matter which parse_xxxx function you use, as soon everything is read, it looks for an empty line to split headers from body, and this is not a bug, EVERY mime compliant message must have a blank line between header and body. It does not even care about what's left on the header side or in the body side, it doesn't matter if it's xml, txt, a zip file, whatever, it only looks for an empty line, wherever it is. So, if a bug is exported from bugzilla, this can't be reproduced because as Frédéric said, bugzilla exports leaving an empty line between DOCTYPE and <bugzilla>. So MIME::Parser is happy and everything continue as expected. But if a bug is made with whatever software you use to export to xml your information, allowing or not to give the dtd to the software to automatically export it, as far I can see on bugzilla.dtd you will have an xml file without empty lines. Look, I'm not a guru reading dtd files but it seems that to me. So I'm agree with Greg that we should account for this. If you want I can change the patch regexp to add the newline not before <?xml but I think that doesn't matter.

Greg Hendricks

Comment 10

•

18 years ago

Comment on attachment 252626 [details] [diff] [review] Patch 1.1 for 2.22-BRANCH This will suffice. Thanks for looking into it. Can you provide a patch for the tip as well?

Attachment #252626 - Flags: review?(LpSolit) → review+

Greg Hendricks

Updated

•

18 years ago

Flags: approval2.22?

Greg Hendricks

Updated

•

18 years ago

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

Guzmán Brasó

Assignee

Comment 11

•

18 years ago

Yeah I can, I will do it asap. By the way I was checking the patch I submitted for 2.22-BRANCH and I saw a mistake in indent, inside conditional I left only two spaces instead of four. Should I submit a new one for 2.22-BRANCH with correct indents inside the conditional or whoever commit the code can change it?

Greg Hendricks

Comment 12

•

18 years ago

That can be fixed on checkin.

Guzmán Brasó

Assignee

Comment 13

•

18 years ago

Attached patch Patch 1.0 for 2.23+ (obsolete) — Details — Splinter Review

Ok regarding indent of patch to 2.22 Here it is for the tip.

Attachment #252783 - Flags: review?(ghendricks)

Dave Miller [:justdave]

Comment 14

•

18 years ago

Since I don't see it mentioned here explicitly, and several people seem to have been confused by it... importxml.pl as written currently is intended to receive an *email* on standard input. The email contains the xml document in the message body. This means the first thing it does when it processes the input is to strip the email header off. What's the easiest way to strip the email headers? Discard everything before the first blank line. This is why it works if there's a blank line. So basically, if we're trying to make it so you can paste a raw XML file in and not have to have it in an email, then the correct way to fix this is to do a basic sanity check to see if there's really an email header there or not before trying to strip it, and if there's no email headers, just skip all the MIME::Parser stuff entirely. There's also currently a regexp after the MIME::Parser stuff that strips everything before the <?xml, with a comment stating that it's stripping email headers. I think this is a leftover from before the MIME::Parser stuff was added, and shouldn't be needed at all; that or the comment is wrong. We're still missing a patch for the tip here, and I'd prefer they both go in at once.

Flags: approval2.22?

Dave Miller [:justdave]

Comment 15

•

18 years ago

(In reply to comment #14) > We're still missing a patch for the tip here, and I'd prefer they both go in at > once. Er, the patch for tip is here now, but not reviewed yet. :) I got midaired adding that comment.

Greg Hendricks

Updated

•

18 years ago

Attachment #252783 - Flags: review?(ghendricks) → review+

Dave Miller [:justdave]

Comment 16

•

18 years ago

Comment on attachment 252626 [details] [diff] [review] Patch 1.1 for 2.22-BRANCH >+if ($xml =~ /\A.*<\?xml/) { What's \A do in a regexp? I can't find it listed in perldoc perlre.

Greg Hendricks

Comment 17

•

18 years ago

(In reply to comment #14) > Since I don't see it mentioned here explicitly, and several people seem to have > been confused by it... importxml.pl as written currently is intended to > receive an *email* on standard input. Actually, as part of the rewrite that I did, I intended for it to take either an email or a generated xml file from the command line. This is stated in the POD. > The email contains the xml document in > the message body. This means the first thing it does when it processes the > input is to strip the email header off. What's the easiest way to strip the > email headers? Discard everything before the first blank line. This is why it > works if there's a blank line. > > So basically, if we're trying to make it so you can paste a raw XML file in and > not have to have it in an email, then the correct way to fix this is to do a > basic sanity check to see if there's really an email header there or not before > trying to strip it, and if there's no email headers, just skip all the > MIME::Parser stuff entirely. There's also currently a regexp after the > MIME::Parser stuff that strips everything before the <?xml, with a comment > stating that it's stripping email headers. I think this is a leftover from > before the MIME::Parser stuff was added, and shouldn't be needed at all; that > or the comment is wrong. > So I think the point of having the MIME::Parser stuff was to decode the body in case it was encoded somehow. I am not familiar with how this works (I avoid email as a matter of principle ;-) ) and I believe it was added last minute by mkanat. Perhaps he can shed some light on the situation. So what you are saying is, we need a valid check to see if the input is an email and if not, skip the email processing entirely? I agree. I just don't know how I can be 100% sure that it is an email. > We're still missing a patch for the tip here, and I'd prefer they both go in at > once. > Done. :)

Flags: approval?

Flags: approval2.22?

Dave Miller [:justdave]

Comment 18

•

18 years ago

(In reply to comment #16) > What's \A do in a regexp? I can't find it listed in perldoc perlre. Ah, I found it. Have to scroll down quite a ways from the rest of the character definitions. Looks like it's equivalent to how ^ behaves in /s mode without having to invoke /s.

Dave Miller [:justdave]

Comment 19

•

18 years ago

(In reply to comment #17) > So what you are saying is, we need a valid check to see if the input is an > email and if not, skip the email processing entirely? I agree. I just don't > know how I can be 100% sure that it is an email. Well, if the first line starts with <?xml then it's probably not an email. ;) The fix that's here works (albeit by ensuring that the data looks like an email by the time MIME::Parser sees it), so no reason not to put it in.

Flags: approval?

Flags: approval2.22?

Flags: approval2.22+

Flags: approval+

Guzmán Brasó

Assignee

Comment 20

•

18 years ago

Ok, I can submit new patches to detect if it's a mail or not. Are we sure if first line isn't <?xml then it's a mail? Could not another software export xml with something before <?xml like software sign, comments, whatever?. Or that's not possible? Tell me then how you want the patch, checking if first line have <?xml or checking if there is a mail header before <?xml in $xml ? In case of second option I would do this regexp: if ( $xml =~ /Message-ID(.*)<\?xml/si ) { its a mail } else { its not a mail; } justdave?

Guzmán Brasó

Assignee

Comment 21

•

18 years ago

By the way, I've tested value of $xml before > $xml =~ s/^.+(\?xml version.+)$/$1/s; and after it with all supported inputs and input was the same. So in the patch I'm going to submit I will remove those lines too. As soon someone points me out which of both suggested regexp to do to define if a mail is comming I will submit the patch for 2.22-BRANCH and the tip.

Flags: approval?

Flags: approval2.22?

Flags: approval2.22+

Flags: approval+

Guzmán Brasó

Assignee

Comment 22

•

18 years ago

I'm sorry but I'm not familiar yet with review system. In the emails I saw that in my last comment together with it I removed both approvals. In this case I know it's right because patch will be made again, but It was not my intention to do it. So, it was a browser or a mistake by me with the tab button or is something that happens automatically?

Frédéric Buclin

Updated

•

18 years ago

Assignee: import-export → gbn

Status: ASSIGNED → NEW

Target Milestone: --- → Bugzilla 2.22

Guzmán Brasó

Assignee

Comment 23

•

18 years ago

Ok, because of talk with ghendricks on IRC the way will be that if there's anything non white space before xml, it's a mail. Else MIME::Parser code will be avoided. Also I'm removing the regexp after MIME code which was old and useless.

Status: NEW → ASSIGNED

Guzmán Brasó

Assignee

Comment 24

•

18 years ago

Attached patch Patch 1.1 for the tip. (obsolete) — Details — Splinter Review

Attachment #252783 - Attachment is obsolete: true

Guzmán Brasó

Assignee

Comment 25

•

18 years ago

Attached patch Patch 1.2 for 2.22-BRANCH (obsolete) — Details — Splinter Review

Attachment #252626 - Attachment is obsolete: true

Guzmán Brasó

Assignee

Updated

•

18 years ago

Attachment #253860 - Flags: review?(ghendricks)

Guzmán Brasó

Assignee

Updated

•

18 years ago

Attachment #253859 - Flags: review?(ghendricks)

Frédéric Buclin

Comment 26

•

18 years ago

Comment on attachment 253859 [details] [diff] [review] Patch 1.1 for the tip. >-# remove everything in file before xml header (i.e. remove the mail header) >-$xml =~ s/^.+(<\?xml version.+)$/$1/s; Why removing these lines?

Guzmán Brasó

Assignee

Comment 27

•

18 years ago

(In reply to comment #26) > (From update of attachment 253859 [details] [diff] [review]) > >-# remove everything in file before xml header (i.e. remove the mail header) > >-$xml =~ s/^.+(<\?xml version.+)$/$1/s; > > Why removing these lines? > From justdave comment #14: > There's also currently a regexp after the MIME::Parser stuff > that strips everything before the <?xml, with a comment stating > that it's stripping email headers. I think this is a leftover > from before the MIME::Parser stuff was added, and shouldn't be > needed at all; that or the comment is wrong. And then from comment #21: > By the way, I've tested value of $xml before > > $xml =~ s/^.+(\?xml version.+)$/$1/s; > and after it with all supported inputs and input was the same. I wrote it wrong the last line, sorry: and after doing it with all supported inputs, output of the regexp was the same. As justdave said that regexp should be removed whenever MIME::Parser code was checked in. After MIME::Parser runs there are no headers in $xml. And if MIME::Parser do not run thanks to this patch, it means there is already nothing before <?xml.

Max Kanat-Alexander

Comment 28

•

18 years ago

Also, if you want to switch it to using Email::MIME instead of MIME::Parser, that would be good, since importxml.pl is the last thing using MIME::Parser. Although I don't know if we should really do that when we're frozen. (It's up to ghendricks.)

Frédéric Buclin

Updated

•

18 years ago

Flags: approval?

Flags: approval2.22?

Guzmán Brasó

Assignee

Comment 29

•

18 years ago

Ok, I can do it if ghendricks want it. Else we can submit a new bug for it and do it for 3.2 or 3.0.1

Greg Hendricks

Comment 30

•

18 years ago

I can't really comment as to which is better, Email::Mime or MIME::Parser. I think, being frozen, we should consider that change separately and just get this in for now.

Greg Hendricks

Comment 31

•

18 years ago

Comment on attachment 253859 [details] [diff] [review] Patch 1.1 for the tip. >-# remove everything in file before xml header (i.e. remove the mail header) >-$xml =~ s/^.+(<\?xml version.+)$/$1/s; >+} > This is still necessary since the parser is expecting the <?xml as the first character it sees. If you have whitespace above it, it will fail.

Attachment #253859 - Flags: review?(ghendricks) → review-

Guzmán Brasó

Assignee

Comment 32

•

18 years ago

Ok I can remove it, but what a strange parser we have, I see it complains if it have whitespaces but do not complain if there is no <?xml at all. Before this bug was found, Bugzilla exports by mail were leaving $xml starting with <?xml. But this bug was never found before because bugzilla exports by hand were also working. Why they were working? Because there was a blank line between <!DOCTYPE ...> and <bugzilla ...>. So before this bug was found, every one who used to copy xml output from xml.cgi or an export e-mail, and paste it to importxml, the content of $xml that was given to the parser started with <bugzilla ...> as MIME::Parser was taking that empty line as the split line between headers and body. I saw this when testing about this bug. I'll submit the patch again without removing those lines.

Guzmán Brasó

Assignee

Comment 33

•

18 years ago

Attached patch Patch 1.2 for the tip — Details — Splinter Review

Attachment #253859 - Attachment is obsolete: true

Attachment #254048 - Flags: review?(ghendricks)

Guzmán Brasó

Assignee

Comment 34

•

18 years ago

Attached patch Patch 1.3 for 2.22-BRANCH — Details — Splinter Review

Attachment #253860 - Attachment is obsolete: true

Attachment #254049 - Flags: review?(ghendricks)

Attachment #253860 - Flags: review?(ghendricks)

Greg Hendricks

Updated

•

18 years ago

Attachment #254048 - Flags: review?(ghendricks) → review+

Greg Hendricks

Updated

•

18 years ago

Attachment #254049 - Flags: review?(ghendricks) → review+

Greg Hendricks

Comment 35

•

18 years ago

What it is really looking for is a < as the first character. It doesn't really care what comes after that as long as it is xml.

Flags: approval?

Flags: approval2.22?

Frédéric Buclin

Updated

•

18 years ago

Flags: approval?

Flags: approval2.22?

Flags: approval2.22+

Flags: approval+

Greg Hendricks

Comment 36

•

18 years ago

Checking in importxml.pl; /cvsroot/mozilla/webtools/bugzilla/importxml.pl,v <-- importxml.pl new revision: 1.73; previous revision: 1.72 done Checking in importxml.pl; /cvsroot/mozilla/webtools/bugzilla/importxml.pl,v <-- importxml.pl new revision: 1.47.2.8; previous revision: 1.47.2.7 done

Status: ASSIGNED → RESOLVED

Closed: 18 years ago → 18 years ago

Resolution: --- → FIXED

Patch 1.1 for 2.22-BRANCH 18 years ago Guzmán Brasó 535 bytes, patch	gregaryh : review+	Details \| Diff \| Splinter Review
Xml file to test the bug. 18 years ago Guzmán Brasó 1.41 KB, text/plain		Details
Patch 1.0 for 2.23+ 18 years ago Guzmán Brasó 540 bytes, patch	gregaryh : review+	Details \| Diff \| Splinter Review
Patch 1.1 for the tip. 18 years ago Guzmán Brasó 1.21 KB, patch	gregaryh : review-	Details \| Diff \| Splinter Review
Patch 1.2 for 2.22-BRANCH 18 years ago Guzmán Brasó 1.22 KB, patch		Details \| Diff \| Splinter Review
Patch 1.2 for the tip 18 years ago Guzmán Brasó 1.14 KB, patch	gregaryh : review+	Details \| Diff \| Splinter Review
Patch 1.3 for 2.22-BRANCH 18 years ago Guzmán Brasó 1.14 KB, patch	gregaryh : review+	Details \| Diff \| Splinter Review