Open Bug 334657 Opened 18 years ago Updated 2 years ago

Make ICS provider more tolerate of charset (eg UTF8) decoding errors

Categories

(Calendar :: General, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: dbo, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 2 obsolete files)

I recently got a google calendar account. Subscribing to my calendar's private URL via ICS provider (remote->webdav, readonly), I figured out that the returned content is not/improperly encoded, e.g. ...CN=Daniel Bölzle.
One can argue that this bug is at google's, but I would suggest making the ICS provider being more error tolerant. Actually, this is very easy using the unichar-stream-loader which replaces improperly encoded sequences with a '?'; and IMO it is much more straight-forward to use it for loading unicode data...
Severity: normal → enhancement
Status: NEW → ASSIGNED
Bug 315672?
Attached patch using unichar-stream-loader (obsolete) — — Splinter Review
Assignee: nobody → daniel.boelzle
Attachment #219000 - Flags: first-review?(mvl)
(In reply to comment #1)
> Bug 315672?

Whoa, fast reply!

IMO different issue, because it relates to import code, but in essence similar problem.
Having a look at import-export/calIcsImportExport.js, there are other fixes missing, too, e.g. exception-handling as recently fixed in ics provider (issue 327890).
OT: 
Would it be possible to unify ICS import and ICS provider code in some way? Both have to read and parse ICS data from a stream, handle encoding errors, etc.
I can't find the documentation that nsIUnicharStreamLoader converts unknowns data to question marks. So relying on that behaviour doesn't sounds like a good idea.
The unichar-stream-loader uses the converter-input-stream service and sets its default replacement character. We can do it the same way, if we don't want to rely on the current implementation of unichar-stream-loader.
The other option is to try to get the behaviour documented in the .idl, and thus specified.
I like the idea of getting this behavior documented in the IDL and thus made part of the contract.  Probably the thing to do is file another bug on that, and make this one depend on it.  I'd be surprised if the owner(s) weren't amenable to this idea.
How about extending the API allowing the user to define the replacement character? nsIUnicharStreamLoaderObserver may fit, because it is called to determine the charset.
Anyway, documentation is always good, +1.
> Created an attachment (id=219000)

-    onStreamComplete: function(loader, ctxt, status, resultLength, result)
+    onDetermineCharset: function( loader, context, firstSegment, length )
+    {
+        // ics files are always utf8
+        return "UTF-8";
+    },

If we receive an encoding in the Content Type header other than UTF-8, we should respect that, regardless of rfc2445's decree that .ics == UTF-8



         try {
-            str = unicodeConverter.convertFromByteArray(result, result.length);
+            var str_ = {};
+            while (unicharData.readString( -1, str_ )) {
+                str += str_.value;
+            }
         } catch(e) {
             this.mObserver.onError(calIErrors.CAL_UTF8_DECODING_FAILED, e.toString());
             this.unlock();
             return;
         }

Given the large number of problems we've been having with non-ASCII characters, I think we should do something other than die if we're handed a non- UTF-8 .ics file. Firefox has the ability to try and guess a file's encoding, so it isn't code we'd have to write from scratch.

See http://www.mozilla.org/projects/intl/ChardetInterface.htm

(In reply to comment #10)
> If we receive an encoding in the Content Type header other than UTF-8, we
> should respect that, regardless of rfc2445's decree that .ics == UTF-8

+1

>          try {
> -            str = unicodeConverter.convertFromByteArray(result,
> result.length);
> +            var str_ = {};
> +            while (unicharData.readString( -1, str_ )) {
> +                str += str_.value;
> +            }
>          } catch(e) {
>              this.mObserver.onError(calIErrors.CAL_UTF8_DECODING_FAILED,
> e.toString());
>              this.unlock();
>              return;
>          }
> 
> Given the large number of problems we've been having with non-ASCII characters,
> I think we should do something other than die if we're handed a non- UTF-8 .ics
> file. Firefox has the ability to try and guess a file's encoding, so it isn't
> code we'd have to write from scratch.

The (undocumented) trick with using the unichar stream loader is that it does _not_ die (like the unicode converter): it replaces those chars, currently with a question mark. I agree that a sophisticated encoding detection would do it even better, and I hope that you/somebody has already worked it out. But for the short term, the unichar stream loader fits exactly: It reads unicode (thus makes the code clearer) and BTW tolerates UTF-8 errors.
Attached patch using unichar-stream-loader (obsolete) — — Splinter Review
+ using the detected charset from channel now

Michiel, can you please push this and review the patch? (I finally want to see my google ics subscription in Lightning...) IMO there is no more to discuss here: this patch just improves the current code (i.e. using the unichar stream loader just fits better, no need to parse unicode, less code).
Documenting the current behaviour in nsIUnicharStreamLoader can be shifted to another bug.
Attachment #219000 - Attachment is obsolete: true
Attachment #230115 - Flags: first-review?(mvl)
Attachment #219000 - Flags: first-review?(mvl)
Using the channel's charset is a good thing.

But thinking about this more, ignoring utf8 decoding errors makes me worry about dataloss. For a language like german, this is not a huge issue. But what about chinese, russian or anything that isn't using the latin characters? You will get only questionmarks, and when editing an event, you will replace all the data in your ics file with questionmarks. That's a lot of dataloss i'm not happy about.

I need to think more about a better solution...
(In reply to comment #13)
> Using the channel's charset is a good thing.
> 
> But thinking about this more, ignoring utf8 decoding errors makes me worry
> about dataloss. For a language like german, this is not a huge issue. But what
> about chinese, russian or anything that isn't using the latin characters? You
> will get only questionmarks, and when editing an event, you will replace all
> the data in your ics file with questionmarks. That's a lot of dataloss i'm not
> happy about.

I totally agree and see three options:
- User can configure strict/weak error handling for UTF-8 decoding (needs UI)
- make calendar read-only if UTF-8 decoding errors occurred (a bit hacky)
- Google finally fixes that bug :) so we can stick to the current code (i.e. strict error handling)... Any Google employee reading this?
Would only the use-channel-charset part of the patch help with google calendar (I don't have a gcal account, so can't test) Does it send the right charset header?
(In reply to comment #15)
> Would only the use-channel-charset part of the patch help with google calendar
> (I don't have a gcal account, so can't test) Does it send the right charset
> header?

gcal doesn't send any header at the moment. This is the problem. Often the data is in a non-UTF-8 encoding, but we aren't even told what charset it is.

Regardless of RFC2445, I feel we should:
a) silently handle non-UTF-8 files if we're also provided a charset
b) use the charset detector to guess the encoding if it's not UTF-8 and we have no header. This probably should include some UI telling the user that we've got a malformed/goofy file, and here's our best guess. Here's some example data using our guessed charset. [Use the guess] [Select ISO-8859-1 instead  v] [Cancel altogether]
*** Bug 356397 has been marked as a duplicate of this bug. ***
*** Bug 360906 has been marked as a duplicate of this bug. ***
Comment on attachment 230115 [details] [diff] [review]
using unichar-stream-loader

getting rid of this patch; BTW: google seems to have fixed encoding.
Attachment #230115 - Attachment is obsolete: true
Attachment #230115 - Flags: first-review?(mvl)
Is this still an issue or should we just stay with strict error handling?
In comment #13, mvl stated concerns about dataloss if ignoring eoncoding errors.
Assignee: daniel.boelzle → nobody
Status: ASSIGNED → NEW
just a thought, can the converting to utf8 be done by the OS? So if you encounter a non-utf charset in a calendar, write the calendar to the filesystem in utf8 and read the data from the temp-file?
Keywords: helpwanted
I'm not satisfied with the state of this bug at all.

IMO decoding a non-UTF-8 .ics calendar in a lossy manner is still much better than not showing anything at all and give the user a very technical error message and no option for him to fix this on his end (if the server is not under his control).

I would propose the following:
- we de-bitrot Daniel's patch from 2006
- we also add a dialog (yeah, I know, those suck) where we tell the user, that 
  opening this calendar will alter the file by replacing special characters 
  with question marks and give him the option to abort
- Optionally we could also implement a calendar preview, that would show the 
  user in a short unifinder-like list how the first 3-5 events look. That 
  might be especially useful for international users (e.g. from China or 
  Japan), where the text in the .ics file would most likely be completely 
  overwritten with question marks.

Anyone up for the challenge?
Flags: wanted-calendar1.0+
I wonder if there are other affordances than a dialog that would be less painful here.  In any case, I agree that it's important that we do better here, since it's likely that the user may have little or no immediate influence over the server.  I ran into this problem at meetup.com over the weekend.
Summary: Make ICS provider more error tolerant → Make ICS provider more tolerate of charset (eg UTF8) decoding errors
In case there's been a decoding problem, we may replace those sequences (like the patch is doing), but enforce the calendar to be read-only so the data stays as is. Additionally, we should signal the problem (e.g. activity manager, IMO less painful).
It seems that there's a patch for this issue that's being ignored because converting non-UTF-8 chars is presumed data loss...  (what 

well, what's the big deal if it's a read-only calendar?  lose a few characters or lose the whole @*(&^@$#*&*@ calendar?  I wholly agree with comment #26...log the problem, but let the conversions slide for read-only calendars.  

The utter inability of lightning to load calendars from facebook.com and meetup.com is pretty damning...
(In reply to comment #28)
> well, what's the big deal if it's a read-only calendar?  lose a few characters
> or lose the whole @*(&^@$#*&*@ calendar?  I wholly agree with comment #26...log
> the problem, but let the conversions slide for read-only calendars.

That seems like sound reasoning to me.

> The utter inability of lightning to load calendars from facebook.com and
> meetup.com is pretty damning...

Interestingly, I see the problem you're describing with meetup.com, but not with facebook.com, which I've never had a problem with.
An .ics file with \n line endings (instead of \r\n) also produces UTF8_DECODING_FAILED. I'm having this problem with a custom calendar from berlinale.de . When I download the file and fix the line endings in an editor, it gets imported fine.

While RFC 2445 §4.1 demands \r\n, Calendar could be more input tolerant here. Webapps often get the line endings wrong.

The error message is also a bit misleading, because the file is valid UTF8. I couldn't figure what the problem was until I used the online validator at http://severinghaus.org/projects/icv/ .
bug 354951 contains some useful information on how part of this bug could possibly be fixed, using the universal charset detector.
Attached two .ics files that could serve as a good test case. attachment 618393 [details] is a simple .ics file encoded in ANSI that contains a German umlaut (ü). Lightning 1.4 outright ignores this file when imported. It's not added to the calendar, there's no error dialog whatsoever, nothing in then calendar debug log, and nothing in the calendar verbose debug log. After fiddling around with the .ics and stumbling upon this bug report I tried two things that allowed me to import the file:
- convert the file from ANSI to UTF-8
- replace 'ü' in the ANSI version with 'ue'

Lightning should at least tell me (or give me a hint) why the import failed
Some problem here. That's very annoying cause I cannot open ICS files containing my train connection details mailed by Deutsche Bahn AG. :(
If you are desperate, the following did the trick for me (with calendar data in French): in file calICSCalendar.js, where it says:

        // ics files are always utf8
        unicodeConverter.charset = "UTF-8";

change "UTF-8" for "ISO-8859-1". Both my google agenda and this other calendar I needed to access show up properly. I know, its bad, but remote/institutional calendar providers don't always listen to us. 

(In my OSX 10.8.4/TB 17.0.8/Lightning 1.9.1 configuration, the file is located here:
/Users/<username>/Library/Thunderbird/Profiles/<profile>/extensions/{...}/components/calICSCalendar.js)
This is a iCal file that fails to import with no error message on Lightning 2.6.2
Attachment #827971 - Flags: review+
This is part of a more general issue with the ics import.

At the moment the ics import drops anything that does not work with the library.  It should show that something went wrong and ask the user if they want to fix or use the information it was able to import to create a new event/task

My suggestion is 

Message invalid ICS - please check the results

Drop the lines that do not parse

Add the lines that parse into a new event, with the full ics text in the notes filed of the event so the person importing and work out what has gone wrong.
Attachment #827971 - Attachment mime type: text/calendar → text/plain
Attachment #827971 - Flags: review+

I got hit by this issue recently. Is there a chance that it will get fixed at least for the readonly calendars?

I'll try to attach my problematic ICS, but I don't see the attachment button right now...

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: