Closed Bug 179264 Opened 22 years ago Closed 22 years ago

csv output for the buglist (see url inside) is invalid (too many "s)

Categories

(Bugzilla :: Query/Bug List, defect, P2)

2.17.1
defect

Tracking

()

RESOLVED FIXED
Bugzilla 2.18

People

(Reporter: mcsmurf, Assigned: bugreport)

References

()

Details

Attachments

(2 files, 2 obsolete files)

it doesn't follow the rules for "s, when you store the file as CSV; it makes in 
HTML
178933 nor -- PC justdave@netscape.com NEW create user error includes "Content-
type: text/html"

and in CSV it gets 178933,normal,--,PC,justdave@netscape.com,NEW,,create user 
error includes ""Content-type: text/html",

You see, in HTML it is 2 "s and in CSV it is 3 "s. But thats wrong because the  
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm lists the rules as:

#A field that contains double quote characters must be surounded by double- 
#quotes, and the embedded double-quotes must each be represented by a pair of  
#consecutive double quotes.

the other rule the csv output probably breaks is:

# Fields with embedded commas must be delimited with double-quote characters.
Summary: csv output for the buglist (see url inside) is invalid (too many "s and ) → csv output for the buglist (see url inside) is invalid (too many "s)
CSV output for the buglist also fails to properly quote or escape "," in the bug
summary.

The same fix should surround the entire field with double-quotes and escape any
double-quotes that are within it.
yep, have made a seperate bug on that, Bug 179265 (sure releated to this)
Attached patch Patch (obsolete) — Splinter Review
Changes CSV filter to quote any string that has " or , in it and use a
double-quote to escape any double-quotes that appear in the string.
Blocks: 179265
Taking assignment
Assignee: endico → bugreport
OS: Windows 2000 → All
Priority: -- → P2
Hardware: PC → All
Target Milestone: --- → Bugzilla 2.18
Version: unspecified → 2.17.1
Attachment #105718 - Flags: review?
Comment on attachment 105718 [details] [diff] [review]
Patch

r=gerv; but please use cvs diff -u, because it makes patches much, much easier
to read :-)

Gerv
Attachment #105718 - Flags: review? → review+
*** Bug 179265 has been marked as a duplicate of this bug. ***

I'm going to update this.....  from IRC...

In general, you can escape a string for display in MS Excel CSV format
by using this algorithm
if string contains separator-character or double-quote or line-break
 replace each double quote in string by 2 double quotes
 replace each line-break in string by one LF character

 wrap string in duoble quotes
else
 leave string as is
Comment on attachment 105718 [details] [diff] [review]
Patch

needs-work.

This needs to surround the value with double-quotes if the value contains
linebreaks, too.
Attachment #105718 - Flags: review-
Still no good example of a line-break, but aggressively quoting everything.
Attachment #105718 - Attachment is obsolete: true
Attachment #105734 - Flags: review?
Comment on attachment 105734 [details] [diff] [review]
patch - quotes everything

tested on landfill, with bugs in the list "loaded" with the stuff that broke it
before...  Excel likes it. :-)

r=justdave
a=justdave
Attachment #105734 - Flags: review? → review+
Blocks: 179176

Checking in globals.pl;                                     
/cvsroot/mozilla/webtools/bugzilla/globals.pl,v  <--  globals.pl
new revision: 1.215; previous revision: 1.214
done                                                            
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
I don't think we should be quoting values unless they contain quotes or commas
(or linebreaks; but would that ever happen?). The current solution does work,
admittedly, but may cause some spreadsheets to assume our numbers are strings,
when they aren't.

This could be fixed merely by reinstating the check, rather than quoting
unconditionally.

Gerv
I haven't found any documents or reasonably current examples where quoting a
number causes it to be handled differently.

Perhaps a subsequent patch should identify pure-numeric values and skip quoting
those, while leaving anything other than ^[0-9\.]*$ quoted.
Is there any reason why we can't use the original (albeit buggily implemented)
algorithm, that if it contains a quote or a comma, then quote it, otherwise don't?

Gerv
We could use the corrected original, that will actually quote if there are
quotes or commas in it AND escape ALL double-quotes.  However, there was an
insistance on handline newlines, etc...

The original would fail with commas as well as anything with more than one
double-quote.

> The original would fail with commas as well as anything with more than one
> double-quote.

Yes, hence "buggily implemented." 

> We could use the corrected original, that will actually quote if there are
> quotes or commas in it AND escape ALL double-quotes.  However, there was an
> insistance on handline newlines, etc...

Then we can add newlines to " and , in the regexp. Not hard. Anyway, as far as I
can see, no values we'd be displaying in buglists have embedded newlines.

Gerv
Attached patch Patch v.1 (obsolete) — Splinter Review
Trivial patch to change the quoting to happen only when necessary.

Gerv
Comment on attachment 106291 [details] [diff] [review]
Patch v.1

Joel - could you please sign off on this?

Gerv
Attachment #106291 - Flags: review?(bugreport)
Comment on attachment 106291 [details] [diff] [review]
Patch v.1


This needs to handle leading or trailing whitespace differently.  The way this
would work right now, it would be inconsistent.

If I had 
"This is a bug ending in whitespace    " in a string, the whitespace would be
trimmed.
If I has "However, this one gets preserved    " the whitespace would be
presserved.

This should be consistent.

The consensus on the IRC channel is that the whitespace should never be trimmed
rather than always trimmed.
Attachment #106291 - Flags: review?(bugreport) → review-
Er, joel... which part of this function does any whitespace trimming at all?

Gerv
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I should have been clearer on the comment....

If a string has leading or trailing whitespace, it is lost in CSV format unless
the string is quoted.  If we quote everything, then the whitespaces is always
preserved.  If we quote only strings with certain characters, then strings that
contain those characters will preserve whitespace while strings not conatining
those characters will have the whitespace lost.

There was a fair amount of discussion about this on IRC and the consensus was
that the leading/trailing whitespace should be preserved.  This implies that, if
we are not quoting everything, that we still quote any string with
leading/trailing whitespace.
> If a string has leading or trailing whitespace, it is lost in CSV format unless
> the string is quoted.

Where is that written? :-) According to a book on good programming style I read,
which had a CSV parser as an example of a fuzzy problem, there is no central
definition of the format, and implementatations vary.

If we don't quote a CSV value with leading/trailing whitespace, and whatever
application reads it decides to discard that whitespace, it's their problem. IMO :-)

Gerv
The reference in the initial comment states that.  It is true, though, that
there is no such thing as an authoritative reference.  I still think we should
either trim all leadin/trailing whitespace or quote anything that has it. 
Inconsistenacy stinks.

Why is it inconsistent to say:
"Any value with a quote or comma gets quoted, any value without one doesn't"? 

Even if whitespace is discarded by a given consumer of CSV, who is to say that
they won't do that step _after_ the remove the quotes and unescape the string?
Then both forms would have their whitespace discarded.

http://ostermiller.org/utils/CSVParser.java.html (Java CSV parser):
"* If field includes a comma or a new line, the whole field must be surrounded
with double quotes."
That's the only documentation I can find on the web, but it fits with my book
and my understanding.

On a side note, I note from this page:
http://ostermiller.org/utils/CSVLexer.html
that we are using the Excel standard for escaping CSV, rather than the standard
standard. What should we do about that?

Gerv
Well, if I ask OpenOffice to generate a CSV, it quotes any non-numeric, non-null
field.  Personally, I think that is the reasonable thing to do.  


"No space here","Trailing space on this     ","   Leading space here","More
text","Hmm... This, that, theother",,12345.67

OK, I'll buy that.

if ($x !~ /^\d*$/) {
   $x = "\"$x\"";
}

?

Gerv
probably ...

$x !~ /^-?\d+\.?\d*$/;

So negative numbers and non-integers don't get quoted.
Attached patch Patch v.2Splinter Review
Here's v.2. A slight change from Joel's excellent suggestion, to make sure it
still matches the empty string.

Gerv
Attachment #106291 - Attachment is obsolete: true
Comment on attachment 106470 [details] [diff] [review]
Patch v.2

r=joel
Attachment #106470 - Flags: review+
Dave - can I get an a= for this?

Gerv
Status: REOPENED → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → FIXED
Oops.

Gerv
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
a=justdave
Really fixed this time.

Checking in globals.pl;
/cvsroot/mozilla/webtools/bugzilla/globals.pl,v  <--  globals.pl
new revision: 1.216; previous revision: 1.215
done

Gerv
Status: REOPENED → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → FIXED
QA Contact: matty_is_a_geek → default-qa
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: