Feed parser should return either an RFC822 date or null from dateParse(), not "Invalid Date" or a vaguely-RFC822 thing

RESOLVED FIXED in mozilla11

Status

()

defect
RESOLVED FIXED
8 years ago
8 years ago

People

(Reporter: philor, Assigned: philor)

Tracking

Trunk
mozilla11
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 2 obsolete attachments)

Two related problems I noticed during bug 618714:

* if you do dateParse("1234ThisIsn'tADate") then you get back "Invalid Date" rather than the null you should get

* if you do dateParse("Tuesda, 25 Apri 2006 08:00:00 GMT") then you get back "Tuesda, 25 Apri 2006 08:00:00 GMT" which is not an RFC822 date, but can be parsed by JS, so it doesn't matter too much that you aren't getting back "Tue, 25 Apr 2006 08:00:00 GMT"; if you do dateParse("Tuesmonkey, 25 Aprmonkey 2006 08:00:00 GMT") you get back "Tuesmonkey, 25 Aprmonkey 2006 08:00:00 GMT" which cannot be parsed by JS, which is what the IDLs for nsIFeedContainer and nsIFeedEntry promise it will be.

The former case is simple - with things that look like an ISO8601 date by starting with four digits, we have JS parse it as a date, so we just need to check that what we got back for a Date object actually has a value before we toUTCString() it.

The latter case, where we allow some things which aren't RFC822 but are parsable by JS past our regex, just calls for the slightly ugly step of going string->Date->.value()?toUTCString():null.
Posted patch Fix v.1 (obsolete) — Splinter Review
Most of the test changes are just changing to expecting a corrected date in GMT, but my favorite is the last one, where the test was previously testing to be sure that we didn't know that 2 Sep 2003 was actually on a Tuesday rather than a Saturday.
Attachment #551781 - Flags: review?(mak77)
Comment on attachment 551781 [details] [diff] [review]
Fix v.1

Review of attachment 551781 [details] [diff] [review]:
-----------------------------------------------------------------

The approach seems fine, something may be improved, I think.
Could you avoid hg copying test files to create new ones? Reviewing these copies is confusing, since it appears as you changed the original test (mostly a bugzilla's fault), I also don't think the original blame is misleading here, since those original bugs and authors didn't obviously create these test files.

::: toolkit/components/feeds/FeedProcessor.js
@@ +911,5 @@
> +  if (isValidRFC822Date(date)) {
> +    let maybeDate = new Date(date);
> +    if (!isNaN(maybeDate.valueOf()))
> +      return maybeDate.toUTCString();
> +  }

The 2 code branches here are now doing exactly the same thing, so we can most likely refactor this in a more compact and maintenable way.
If we can come up with a clear enough comment, wouldn't be bad to just merge the 2 ifs, otherwise you may create a local helper function like ensureValidDate(aDate) that returns toUTCString or null, and keep the ifs separated.

Looking at existing code, isValidRFC822Date is lying based on these changes, sine it may still return true and then be not parsable. I'd probably rename it to matchesRFC822Date and create also a matchesISO8601Date for the other regexp test.
Then you may just:
if (matchesISO || matchesRFC) {
  // Make sure we return a meaningful date trying to convert the potential valid value.
  ...
}
return null;

what do you think about this?

::: toolkit/components/feeds/test/xml/rss2/feed_pubDate.xml
@@ +1,4 @@
>  <?xml version="1.0" encoding="iso-8859-1"?>
>  <!--
>  
> +Description: non-RFC822 date that matches our regex but can't be parsed produces null .updated

what's that " .updated" suffix?
Attachment #551781 - Flags: review?(mak77) → feedback+
I meant "I also think the original blame is misleading"
Pushed a dos2unix in http://hg.mozilla.org/integration/mozilla-inbound/rev/a273f980f5a1 which should help make future patches where I touch existing tests more clear.

Right now, I'm considering a radical refactoring:

function dateParse(dateString) {
  var date = new Date(dateString.trim());
  isNaN(date.value()) ? return null : return date.toUTCString();
}

There are three useful things we could do by looking at dates before handing them to JS to parse:

* correct errors that JS refuses to have a fallback for

* implement formats JS refuses to implement

* use our out-of-band knowledge to resolve ambiguous formats

The only errors I can think of that anyone has cared enough about to file and that are reasonable enough to have a fallback for are bug 394113/bug 394119, a missing colon in an ISO8601 timezone, and bug 512307, support for the broken single-letter military timezone abbreviations in RFC822 (all of which are the same as UTC according to RFC2822 because 822 got the sign wrong and made all of them other than Z ambiguous). The former is so easy to fix in JS that I took bug 682754; the latter probably only seems more difficult because jsdate.cpp was written by people who must have had to put a dollar in a jar every time they wrote a comment.

I don't have any persuasive examples of formats we want to implement that JS does not. The Universal Feed Parser implements formats like ISO8601's "-03-12-31" and "július-13T9:15-05:00" and MSSQL's timezone-free "2004-07-08 23:56:58.0" and whatnot, but I think that's more an obscure political statement I never quite understood than an actual example to follow.

The only example I have for resolving ambiguous formats is the timestamp thing that the toolkit parser incorrectly copied from the mailnews parser, where http://mxr.mozilla.org/comm-central/source/mailnews/extensions/newsblog/content/utils.js#400 figured that while <pubDate>2010</pubDate> could mean the timestamp 2010 seconds after the start of the epoch, or could mean the year 2004, that it probably is only a timestamp if it's within 3 years of the present.

In any case, if at some time in the future someone decides to implement error correction, or a new datetime format, or massaging of ambiguous things, they can do so by just making that

function dateParse(aDateString) {
  var possibleDateString = dateString.trim();
  possibleDateString = fixMissing8601T(possibleDateString);
  possibleDateString = convertMSSQLDateToReasonableFormat(possibleDateString);
  date = new Date(possibleDateString);
  isNaN(date.value()) ? return null : return date.toUTCString();
}

with minimal rerefactoring, and just have the massaging functions return a properly-formatted string if it's something they recognize, or the original string if it isn't. But I bet nobody ever will.
Well, there is one more thing we are currently doing by looking at dates before we let JS parse them: we're throwing out bathwater like "30/8/10" which probably means Aug 30, 2010 but JS parses as Sat, 08 Jun 1912 07:00:00 GMT because that's the 30th month of 1910, and at the same time throwing out baby like "8/30/2010 10:00:00 UTC" which JS could parse perfectly.
(In reply to Phil Ringnalda (:philor) from comment #4)
> I don't have any persuasive examples of formats we want to implement that JS
> does not. The Universal Feed Parser implements formats like ISO8601's
> "-03-12-31" and "július-13T9:15-05:00" and MSSQL's timezone-free "2004-07-08
> 23:56:58.0" and whatnot, but I think that's more an obscure political
> statement I never quite understood than an actual example to follow.

I agree, we should not care about all exotic stuff, we are not a fully equipped feed reader and we don't want to be, we are just a simple helper that tries to do something useful with feeds.

> figured that while <pubDate>2010</pubDate> could mean
> the timestamp 2010 seconds after the start of the epoch, or could mean the
> year 2004, that it probably is only a timestamp if it's within 3 years of
> the present.

right, I doubt anyone pushed a rss feed entry in 1970, not me for sure.
Posted patch fix v.2 (obsolete) — Splinter Review
I've long since passed the point of staring at this patch thinking "no, I don't know of anything else I'm not considering that would make this a bad idea" and gone to staring at it thinking "now, what was the point of this again?" Which, to recap, is to not return things which can't actually be parsed by JS/mailnews just because they match our regex, and to not return "Invalid Date" when we should instead be returning null.
Attachment #551781 - Attachment is obsolete: true
Attachment #570514 - Flags: review?(mak77)
Comment on attachment 570514 [details] [diff] [review]
fix v.2

Review of attachment 570514 [details] [diff] [review]:
-----------------------------------------------------------------

So, what's the status regarding bug 512307? From what I understand there, those dates just don't work even before these changes? Or should we special case that by converting Z to UTC to keep that working?

::: toolkit/components/feeds/FeedProcessor.js
@@ +868,2 @@
>  /**
> + * Tries parsing a string with 'new Date()'.

nit: "through the javascript Date object"?

@@ +870,3 @@
>   * @param dateString
>   *        A string that is supposedly an RFC822 or RFC3339 date.
> + * @returns A Date.toUTCString or null

@return (without the s)
please specify when null is returned

@@ +873,3 @@
>   */
>  function dateParse(dateString) {
> +  var date = new Date(dateString.trim());

nit: while here, use let

@@ +873,4 @@
>   */
>  function dateParse(dateString) {
> +  var date = new Date(dateString.trim());
> +  if (!isNaN(date.valueOf()))

is there a specific case valueOf is handling here? Since I suspect JS is already doing the conversion implicitly (through it) due to the NaN comparison.
Since it's possible someone in future may add an else to handle exotic cases, I'd probably suggest to brace this if, so future changes can preserve blame.

::: toolkit/components/feeds/test/xml/rss2/feed_pubDate_nonRFC822_2.xml
@@ +6,5 @@
>  
>  -->
>  <rss version="2.0" >
>  <channel>
>  <pubDate>Saturday 25 November 2006 3:12:45 +1000</pubDate>

checking whether the parser corrects a wrong day seems out of the scope of this test, so you should probably fix the day here, and maybe add a separate test for that, if you care.
Posted patch fix v.3Splinter Review
The status of bug 512307 is that it has always been that way (I don't actually know about Netscape 2.0, but always in that in the code Netscape open sourced, Z was not parsed), and bug 682781 is unlikely to actually be fixed, so I stuck in a little more verbosity, doing the trim() on one line and the new Date() on another, so that when I give up and take bug 512307, I can stick a call to dateRescueTimezoneZ() in between the two.

The origin of using valueOf() is lost in an IRC log (that I don't have) from last June or July, but I don't think there was a persuasive reason for it. As likely as anything, I misphrased my question and got that as an answer when I didn't need it.

I can't say I really know *what* feed_pubDate_nonRFC822_2.xml actually intended to test (a reasonable guess would be "all of non-abbreviated weekdays, not having a comma after the weekday, non-abbreviated months, single-digit hours without zero-padding, and + timezones"), but it isn't now testing that we correct a wrong day, it just happens to have a datetime that's in a different day locally than it is in UTC. That's not especially exciting to test, so I just shifted the hour to one that's in the same day, losing the test for single-digit hours which isn't our problem to test anyway.
Attachment #570514 - Attachment is obsolete: true
Attachment #570514 - Flags: review?(mak77)
Attachment #572346 - Flags: review?(mak77)
Attachment #572346 - Flags: review?(mak77) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/c3f33adc6080
Flags: in-testsuite+
Target Milestone: --- → mozilla11
https://hg.mozilla.org/mozilla-central/rev/c3f33adc6080
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.