Closed Bug 618714 Opened 9 years ago Closed 9 years ago

Switch feed parser to using Date object's support for ISO8601 dates


(Toolkit :: General, defect, minor)

Not set





(Reporter: philor, Assigned: philor)




(1 file, 2 obsolete files)

Attached patch fix (obsolete) — Splinter Review
The test change is because I was making sure we did have tests for ISO8601 -> RFC2822, and the wrong description jumped out at me, but then it had DOS line endings, so I wound up having to touch the whole thing.
Attachment #497167 - Flags: review?(sayrer)
Assignee: nobody → philringnalda
Whiteboard: [needs review]
Attached patch unrotted (obsolete) — Splinter Review
That was an unexpected form of bitrot, where the file disappeared out from under me. Good thing it just went to .., so I could find it again.

The me of six months ago seems to have been saying that he looked at the existing tests and found that we were already testing parsing of these dates, so let's just assume he was right.
Attachment #497167 - Attachment is obsolete: true
Attachment #542065 - Flags: review?(dtownsend)
Attachment #497167 - Flags: review?(sayrer)
Date can parse more than ISO8601 so I wonder if there is a problem with allowing that. If not I wonder if ISO8601DateUtils should just use Date rather than its longer approach.
Attachment #542065 - Flags: review?(dtownsend) → review?(mak77)
The question of what ISO8601DateUtils should do should be moot: the purpose of this bug and bug 618726 is to remove all two in-tree uses of it, so bug 543535 can remove it entirely.

The question of what to worry about with invalid date formats is a rathole, and you should run away right now, far and fast.

There are only two valid sorts of datetimes in feed formats that we support which have an implementable spec (which lets out RSS 1.0, which didn't actually read W3CDTF, so it didn't realize that you can't just say "use W3CDTF" so nobody knows what datetime the valid RSS 1.0 date "2011" is because it didn't define which granularities were valid and it didn't define the meaning of missing parts in less granular dates): RFC3339 (a subset of ISO8601 which requires a full date and a full time), and RFC822+4-digit-years. If something is an RFC3339 datetime, it starts with 4 digits, we'll pass it to Date(), and Date() will parse it correctly. If something is an RFC822+4-digit-years datetime, it does not start with 4 digits, it will match our regex, and we'll pass it on through.

Other things? There's an infinite variety, and nothing which is The Right Thing to be done with them. ISO8601DateUtils, which is just Thunderbird's feed reader's date parsing code, which is just Myk's Forumzilla, chooses to believe that something which starts with four digits and which does not include a "-" should have a couple inserted and be treated as an ISO8601 datetime. So (despite the way we have deadcode which pretends it's there to deal with timestamps) if someone sticks a current Unix timestamp in a feed date element, it will split it up into 1309, 21, 96, 91, 0, 0, 0, and create a date for Thu, 08 Jan 1311 03:00:00 GMT. If it was a number, Date() would decide that it's Fri, 16 Jan 1970 03:40:19 GMT, except that we're passing it a string so it decides it's "Invalid Date." None of those three is wrong, because nothing has ever said that any feed date element containing a string of ten digits was a Unix timestamp.

Using this patch will cause us to switch from treating "20110627T00:00:00Z" (without the dashes that ISO8601 requires between year, month, and day) as "Mon, 27 Jun 2011 00:00:00 GMT" to treating it as "Invalid Date," but despite the comment in ISO8601DateUtils claiming that it had a "Workaround for server sending dates such as: 20030530T11:18:50-08:00" it did not, it had a workaround for server sending dates such as: 20030530T11:18:50+08:00. Currently, we treat the latter as "Fri, 30 May 2003 03:18:50 GMT" and the former as "Invalid Date", and with this patch we treat both as "Invalid Date" which I find considerably more comprehensible behavior.

If you want to have defined and *tested* handling of invalid feed date element content (note that there is no way I should have discovered the "20030530T11:18:50+08:00 is a datetime, 20030530T11:18:50-08:00 is not" thing by reading the code just now, it should have been discovered by tests which uncovered it and then tested the fix for that bug and failed when I tried switching to using Date()), and you want to devote engineering resources to it, I'll be delighted, in my most evil way, to file a followup bug and write test cases for it until your engineer's eyes and ears bleed. That's something I did in a former life, plus I can fall back on the testcases for Mark Pilgrim's Universal Feed Parser, which is quite proud of the way that it can parse Greek dates like Κυρ, 11 Ιούλ 2004 12:00:00 EST and Hungarian dates like július-13T9:15-05:00 and Korean dates like 2004-05-25 오 11:23:17, along with the bits of ISO8601 that neither ISO8601DateUtils nor Date() touched, like the yyo form where 03335 is the 335th day of 2003. (Or the 335th day of some year, I've never bought a copy of the non-free ISO8601 to read about the handling of two-digit years in the forms which allow them.)

Did I mention that handling of date elements in feeds was a rathole?
Adds in removing the unreachable timestamp code: the only thing that would actually hit it (by not matching the "starts with 4 digits" regex) and succeed is <dateUpdated>lol timestamp in aRSS!! 1309240947</dateUpdate>, which... isn't actually worth supporting.
Attachment #542065 - Attachment is obsolete: true
Attachment #542382 - Flags: review?(mak77)
Attachment #542065 - Flags: review?(mak77)
I lost my mind reading comment 3, you know too much about this, we'll have to do something about that ;)
I did try doing something about it a few years back, but it turns out that after lobotomies got a bad name, they stopped trying to improve their ability to localize them.
Comment on attachment 542382 [details] [diff] [review]
kill the timestamp cruft, too

Review of attachment 542382 [details] [diff] [review]:

Let me say, after reading its code, I'm absolutely happy we get rid of that module.
It uses regexes in some fancy way, doesn't use them when it may.
Plus, the "-8:00" thing you found, plus I'm really scared by those internal calculations on timezones.

The only thing that the old code _seems_ to handle better is timestamps expressed in "seconds from the epoch", but as you pointed out the timestamp code was unreacheable, so we are actually not losing anything interesting. The broken case of missing "-" doesn't seem that interesting as well, we don't deserve blame for out-of-spec feeds, by not showing anything wrong we are just delaying authors fixing their code.

Also notice that your example "lol timestamp in aRSS!! 1309240947" would not be parsed by the old code, since parseInt would return NaN, it may parse "1309240947 bacon" though, that is still uninteresting.
What may be interesting to do instead, is workarounding a possible case where a timestamp, or even a valid rss date format, may have trailing spaces, does the xml parser strip them for us? Otherwise I'd suggest doing immediately a dateString = dateString.trim();

I assume you ran the 200 tests in the feedparser, and they passed.

::: toolkit/components/feeds/FeedProcessor.js
@@ +895,5 @@
>   */
>  function dateParse(dateString) {
>    var date = dateString.trim();
>    if (^\d\d\d\d/) != -1) //Could be a ISO8601/W3C date

Care to bring this in the 20th century?

let's if(/^\d{4}/.test(date))

Move the comment before the if, and add a space after //

Could make sense to change comment to something like "Could be a ISO8601/W3C date or a timestamp. We officially support only per-spec datetimes, but if Date() can make sense of the value, it's fine for this use-case."
Attachment #542382 - Flags: review?(mak77) → review+
For my future self, when I get back from vacation and wonder why I didn't land this yet:

We're fine on whitespace, correctly doing a trim() as the first step in dateParse(), but we don't have a test for it like we should.

Unfortunately, looking at all the date tests points out that we sort of lie: and say that the output of dateParse() is "in RFC822 form. Parsable by JS and mail code" which is two or more likely three different things, none of which are necessarily true.

We actually have tests like which verify that we will produce an .updated which is not RFC822 from input that's vaguely but not quite RFC822, and since both the existing code and my patch will produce "Invalid Date" from things that start with digits but are not parsable by JS, you really have to stretch the definition of parsable to say that the string "Invalid Date" is.

Probably it's enough to null out the "Invalid Date" cases, without bothering to actually run the things that match our RFC822+ regex through Date().toUTCString() to make them actually RFC822, and actually doing that string->Date->string cycle when any reasonable user of the API will be doing it again would be excessive. Probably.
Whiteboard: [needs review] → [needs updated patch, moar tests]
Pre-vacation-me should have promised less. I can't usefully test for dateString.trim(), because somebody upstream is already trimming it, and I can't say what I'm testing until I figure out who it is. Shoving in dealing with "Invalid Date" (which we already produce, it's not a regression) here would just be blame confusing, so I'll do that and the RFC822 thing (which does matter, because right now in the process of allowing for "Tues, 25 April 2006 08:00:00 GMT" we will also claim that "Tuemonkey, 25 Aprmonkey 2006 08:00:00 GMT" is an RFC822/Parsable by JS date) in followup bugs.
Closed: 9 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Whiteboard: [needs updated patch, moar tests]
Target Milestone: --- → mozilla8
Depends on: 677135
You need to log in before you can comment on or make changes to this bug.