Closed Bug 451737 Opened 14 years ago Closed 13 years ago

RSS messages should not be removed from cache immediately after disappearing from feed

Categories

(MailNews Core :: Feed Reader, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0b3

People

(Reporter: tuukka.tolvanen, Assigned: 31415-bz)

References

()

Details

Attachments

(1 file, 2 obsolete files)

specific case spun off bug 258465

--------8<--------

bug 258465 Comment #100 Rimas Kudelis  2006-09-25 05:31:07 PDT

Anyone want to try this feed for testing: http://www.vu.lt/RQ/testfeed.php ? It
consists of three items, and should generate new duplicates every five minutes
(so make sure you update often enough).

The catch here is: Every five minutes the feed changes its HTTP response code
from 200 (OK) to Error 503 (server failure). When simulating a failure, it
doesn't present the user with any entries at all, and this seems like enough to
trigger the duplication when the entries are back after five minutes.

You can see the colorized sourcecode of the file here:
http://www.vu.lt/RQ/testfeed.phps .

bug 258465 Comment #101 Dan Wing  2006-09-25 09:38:48 PDT

I am able to duplicate the problem using the test feed at
http://www.vu.lt/RQ/testfeed.php using Thunderbird 1.5.0.7.  I configured a new
"RSS News & Blogs" account that checks for new articles every 2 minutes and
contains only this test feed.

bug 258465 Comment #102 Magnus Melin  2006-09-25 11:29:22 PDT

I also get duplicate items for that feed (using Mozilla/5.0 (X11; U; Linux
i686; en-US; rv:1.8.1) Gecko/20060925 Thunderbird/2.0b1pre ID:2006092503)

-------->8--------

Also reproducible in 20080821225349 Thunderbird/3.0b1pre. The two hints that the feed fetch is bad are the error response and the feed content having 0 items; the latter may be a more generally useful hint that whatever index flushing happens shouldn't happen (if that's roughly how it goes, haven't looked at this part of the code).
Flags: wanted-thunderbird3?
marking wanted
Flags: wanted-thunderbird3? → wanted-thunderbird3+
Tuukka, thanks for reporting this as a separate bug! :)
Well, this is because of invalidateItems().  You see, how it works is this, step-by-step, as I understand:

1. Feed added to Thunderbird.

2. On first check:
   a. every item is parsed.
   b. one by one added to the datastore (with valid and stored arcs.)

3. On next and subsequent checks:
   a. "valid" is unasserted for each item in datastore.
   b. items are added back into the datastore (by setting valid.)
   c. duplicates are checked for by existance in datastore, with "stored" set.
   d. items that still don't have "valid" are removed.

So, what's happening (easier to reproduce with a feed that alternates between 503/empty and 200/not empty) is that it gets the empty list, clears all the items out:

http://mxr.mozilla.org/thunderbird/source/mail/extensions/newsblog/content/feed-parser.js#117

And then flushes it (3d above.)  Next time it checks, its duplicate cache is entirely empty, and so it assumes all items are new.

To fix this, I think the following might help:

1. AFAICT Feed seems to expect that a 503 will call onerror instead of onload, but I'm sure it doesn't check the status code.  A good place might be in onDownloaded or here:

http://mxr.mozilla.org/mozilla/source/mail/extensions/newsblog/content/Feed.js#357

2. The clearing I mentioned (invalidateItems call) happens even if the feed is entirely empty.  I assume this is done to not store things indefinitely, so not clearing it for an empty feed would help.

3. Feeds can very - they might come from load balanced servers, future items may be removed (bringing back older ones), etc.  It would probably solve *a lot* of duplication problems to simply do expiration by date (e.g. wait a day or a week.)  I'm not sure what's best there, though... doing it immediately seems clearly brittle to me.

I thought I'd make this comment in case it helps.  I'm not familiar with the code really at all, but if anyone can confirm my understanding a bit (top level) I'm happy to submit a patch (just have to figure out mercurial...)

I won't have time to test and write solution #3 above in the next week (I'm a busy guy like most) but #1 and #2 are very easy.

-[Unknown]
I still can't understand why (In reply to comment #3)
> Well, this is because of invalidateItems().  You see, how it works is this,
> step-by-step, as I understand:
> 
> 1. Feed added to Thunderbird.
> 
> 2. On first check:
>    a. every item is parsed.
>    b. one by one added to the datastore (with valid and stored arcs.)
> 
> 3. On next and subsequent checks:
>    a. "valid" is unasserted for each item in datastore.
>    b. items are added back into the datastore (by setting valid.)
>    c. duplicates are checked for by existance in datastore, with "stored" set.
>    d. items that still don't have "valid" are removed.

Unless I misunderstood you, I think you're a little wrong.

Considering the fact that for a feed of, say, 6 items, you can see more than a few hundred UNIQUE items (after being subscribed to it long enough), I don't think items are being removed from store. I think they stay there, in Thunderbird's cache/datastore, until you remove them manually.

And this raises another question which is - why would Thunderbird ignore items with identical GUID's when checking for dupes?
(In reply to comment #4)
> I still can't understand why (In reply to comment #3)

Sorry, I was perhaps unclear.  Thunderbird keeps two stores: a store of MESSAGES, and a CACHE of metadata.  I'm talking about the metadata store/cache.

In other words, it never looks at the actual messages to determine uniqueness, instead it looks at a completely separate (and much smaller) list of recently-downloaded meta data.  This meta data contains only the guid and two attributes (stored and valid), nothing else.

So, it clears the meta data (assuming it's no longer useful, e.g. just wasting memory) but does not clear the actual message data.

Does that make more sense?

To answer your question, it ignores items with identical guids because it forgot it has them - it's already removed them (the guids I'm talking about here, not the messages) from its memory/history.

-[Unknown]
Thanks, I understand now.

I wonder if it's reasonable to clear cache at all... But I guess it is, after not seeing an item in the feed for quite some time...
I couldn't take sorting through duplicate items anymore so I started investigating the cause, capturing the raw feed (RSS in my case) and reading the feed downloading code.  The attached patch is a quick & dirty hack that adds a timestamp to feed items and will remove them from the cache after the item has not been included in the feed for 1 day.  I'm not well versed in JavaScript or RDF (how the cache is stored) or the Thunderbird codebase in general.  Hopefully the inadequacies of the patch will inspire someone to do the right thing. :)

The patch appears to apply without rejections on both 2.0.0.21 and 3.0b3pre.  The patch doesn't assume a directory structure, so the correct directory will have to be found.  This is because the location is a bit of a moving target; 2.0.0.21 and 3.0b3pre changed locations of the code, and the deployed JavaScript is in an archive that uses a different structure.  Here are the locations:

2.0.0.21: mail/extensions/newsblog/content
3.0b3pre: mailnews/extensions/newsblog/content
within newsblog.jar (installed file): content/messenger-newsblog

The installed newsblog.jar on my system is located in /usr/share/thunderbird/chrome.

Hopefully this was helpful.
Comment on attachment 375511 [details] [diff] [review]
Patchs that expires feed items after not being included in the feed for one day.

Requesting review.
Attachment #375511 - Flags: review?(philringnalda)
Attachment #375511 - Flags: review?(philringnalda) → review?(myk)
Comment on attachment 375511 [details] [diff] [review]
Patchs that expires feed items after not being included in the feed for one day.

Requesting *better* review ;)
(In reply to comment #3)
> 1. AFAICT Feed seems to expect that a 503 will call onerror instead of onload,
> but I'm sure it doesn't check the status code.  A good place might be in
> onDownloaded or here:

That'd work for 503 errors, and we should probably do it, but it won't solve the problem for other cases of intermittent feed items that don't register an HTTP failure, which I've noticed in a variety of feeds (particularly Google News ones).


> 2. The clearing I mentioned (invalidateItems call) happens even if the feed is
> entirely empty.  I assume this is done to not store things indefinitely, so not
> clearing it for an empty feed would help.

Hmm, that seems problematic given that feeds can be legitimately empty.


> 3. Feeds can very - they might come from load balanced servers, future items
> may be removed (bringing back older ones), etc.  It would probably solve *a
> lot* of duplication problems to simply do expiration by date (e.g. wait a day
> or a week.)  I'm not sure what's best there, though... doing it immediately
> seems clearly brittle to me.

Indeed, it is brittle, and this is a reasonable solution.  I did this a while back with Forumzilla, expiring items from the cache after a week.


> I thought I'd make this comment in case it helps.  I'm not familiar with the
> code really at all, but if anyone can confirm my understanding a bit (top
> level) I'm happy to submit a patch (just have to figure out mercurial...)

Your understanding is perfect!



(In reply to comment #5)
> In other words, it never looks at the actual messages to determine uniqueness,

Indeed, and another solution to this problem might be to also check actual messages, although that has its own problems, since users can delete messages after reading them, and we don't want those messages to reappear.  Forumzilla used to do an additional check like this, but I'm not sure it was worth it.


(In reply to comment #6)
> I wonder if it's reasonable to clear cache at all... But I guess it is, after
> not seeing an item in the feed for quite some time...

It's important to clear the cache because the cache is RDF, and RDF doesn't scale very well (we have to read the entire cache into memory), so we need to keep the cache smallish.
Comment on attachment 375511 [details] [diff] [review]
Patchs that expires feed items after not being included in the feed for one day.

(In reply to comment #10)
> Indeed, it is brittle, and this is a reasonable solution.  I did this a while
> back with Forumzilla, expiring items from the cache after a week.

Erm, looking back on the code, it looks like Forumzilla actually expires them after one day.


(In reply to comment #7)
> Hopefully the inadequacies of the patch will inspire someone to do
> the right thing. :)

This looks quite good, actually!


> The patch appears to apply without rejections on both 2.0.0.21 and 3.0b3pre. 
> The patch doesn't assume a directory structure, so the correct directory will
> have to be found.

mailnews/extensions/newsblog/content is the ideal location to alter, as it is the location on the main development trunk.


>+const EXPIRATION_TIME = 86400000;

Nit: let's call this something more descriptive, like INVALID_ITEM_PURGE_DELAY, and include a comment explaining that 8640000 is one day.

Otherwise, this looks great! r=myk
Attachment #375511 - Flags: review?(myk) → review+
Good that it turned out fairly reasonable. :)

I've changed the name of EXPIRATION_TIME to the name suggested (INVALID_ITEM_PURGE_DELAY), added a comment that indicates how the value is used, and expanded the delay time into a calculation with values that should be a little more recognizable.

Also, the patch is now relative to mailnews/extensions/newsblog/content.

Hopefully I'm following the work flow semi-correctly; it's the first time I've gone through the process.
Attachment #375511 - Attachment is obsolete: true
Attachment #375758 - Flags: superreview?
Attachment #375758 - Flags: review+
Comment on attachment 375758 [details] [diff] [review]
Update to patch 375511 (expiring feed items from cache after one day)

> Hopefully I'm following the work flow semi-correctly; it's the first time I've
> gone through the process.

So far, so good!  Except that you should request super-review from a specific person (just as with review); I've added bienvenu to this super-review request.
Attachment #375758 - Flags: superreview? → superreview?(bienvenu)
Assignee: nobody → 31415-bz
Status: NEW → ASSIGNED
I think expiring after one day is a little too quick. May I suggest it to expire after a week?
Component: RSS → Feed Reader
Product: Thunderbird → MailNews Core
(In reply to comment #14)
> I think expiring after one day is a little too quick. May I suggest it to
> expire after a week?

It's certainly easy to adjust.  However, I'm not sure that extending it to a week will actually be much better.

The most problematic feeds for me are those with some volume (20 or more new items per day) from sites that probably have a large amount of traffic (e.g. reddit).  I suspect (though don't actually know for sure) that the duplicate entries created by these feeds result from a combination of load balancing and mirroring delays.  That is, I'm imagining it could take some time before a new item would populate to all places that service requests.  The ultimate result is that an item could appear in an update, disappear in the next, only to reappear again, all depending on the state of the servers that happened to be handling the requests.

For the above scenario, the expiration time only needs to be long enough to cover the gaps caused by load balancing/mirroring.  One day seemed like plenty of time to cover those gaps.  It also seemed long enough to smooth over occasional periods of 503 errors (assuming that a site is able respond at least once in a given 24 hour period).  It's quite possible that a shorter length of time would adequately cover these scenarios.

The reason to keep the time short is to keep the RDF cache small.  It'd be nice to have a scenario that indicates that a longer time is needed before making the cache 7(ish) times larger.

However, I'm no expert and this is just my two cents. :)
Attachment #375758 - Flags: superreview?(bienvenu) → superreview+
Comment on attachment 375758 [details] [diff] [review]
Update to patch 375511 (expiring feed items from cache after one day)

thx for doing this, looks good, except for a few style nits:

this line is exceptionally long - we try to keep them around 80 chars or so:

+      lastSeenTime = lastSeenTime ? parseInt(lastSeenTime.QueryInterface(Components.interfaces.nsIRDFLiteral).Value) : 0; 

lose the spaces after ( and before )

+      if ( (currentTime - lastSeenTime) < INVALID_ITEM_PURGE_DELAY )

no need for the braces here:

+    if (currentTimeStamp)
+    {
+	ds.Change(resource, FZ_LAST_SEEN_TIMESTAMP, currentTimeStamp, newTimeStamp);
+    }
+    else
+    {
+      ds.Assert(resource, FZ_LAST_SEEN_TIMESTAMP, newTimeStamp, true);
+    }
+
(In reply to comment #15)
> However, I'm no expert and this is just my two cents. :)

One day sounds like the right place to start, since it resolves the issues that we know folks are having.  We can always increase the delay further in the future in response to specific issues that folks continue to have.
(In reply to comment #15)
> For the above scenario, the expiration time only needs to be long enough to
> cover the gaps caused by load balancing/mirroring.  One day seemed like plenty
> of time to cover those gaps. 

Fair enough.

> It also seemed long enough to smooth over
> occasional periods of 503 errors (assuming that a site is able respond at least
> once in a given 24 hour period).  

Not sure about that though. While enough to cover load balancing inadequacies, one day may be too short for feeds that actually fail.

However, if such feeds would actually be treated as errors (see Comment #3), this would probably be solved too. Could we do anything about that?
(In reply to comment #18)
> However, if such feeds would actually be treated as errors (see Comment #3),
> this would probably be solved too. Could we do anything about that?

Let's file and track that as a separate bug.
heh, this bug was about it :)
(In reply to comment #20)
> heh, this bug was about it :)

Yes, but since then it's been repurposed to address a related issue, and it's better to address one issue per bug, so I think it would be better to address the related issue here and then file another bug for the original and remaining issue.

Alternately, we can keep this bug open pending the fix for the original issue, which preserves some context about that problem, but at the cost of making it less clear what has and hasn't been fixed.
OK, it's now bug 491720. Also adjusting Summary here.
Summary: feed that fails intermittently as empty feed with 503 causes duplicate items → messages should not be removed from cache immediately after disappearing from feed
(In reply to comment #16)
> (From update of attachment 375758 [details] [diff] [review])
> thx for doing this, looks good, except for a few style nits:

I have a new patch that cleans up the style issues, but I'm not sure how to flag it properly... Do I need to re-request superreview?  It doesn't appear that I can set it to '+' like I could with the review.
(In reply to comment #23)
> I have a new patch that cleans up the style issues, but I'm not sure how to
> flag it properly... Do I need to re-request superreview?  It doesn't appear
> that I can set it to '+' like I could with the review.

You could re-request superreview from :bienvenu, but folks usually assume that a super-review with nits carries forward to a patch that only fixes the nits.
I've made the suggested style changes.  Hopefully they're OK. 

I decided that I didn't trust myself enough with JavaScript not to do anything silly and therefore requested re-review.
Attachment #375758 - Attachment is obsolete: true
Attachment #376099 - Flags: superreview?(bienvenu)
Attachment #376099 - Flags: review+
(In reply to comment #18)
> Not sure about that though. While enough to cover load balancing inadequacies,
> one day may be too short for feeds that actually fail.
> 
> However, if such feeds would actually be treated as errors (see Comment #3),
> this would probably be solved too. Could we do anything about that?

Apologies for hijacking this bug; it seemed like the underlying causes were the same so I posted here.  I will be more careful about this in the future.

FWIW, I did subscribe to the test feed from comment #1 that produces intermittent 503 errors and haven't experienced duplicate items with the expiration logic in place.  

However, it does seem reasonable that errors get handled explicitly as you're suggesting.
Attachment #376099 - Flags: superreview?(bienvenu) → superreview+
Keywords: checkin-needed
(In reply to comment #26)
> Apologies for hijacking this bug; it seemed like the underlying causes were the
> same so I posted here.  I will be more careful about this in the future.

No problem, really. :)

> FWIW, I did subscribe to the test feed from comment #1 that produces
> intermittent 503 errors and haven't experienced duplicate items with the
> expiration logic in place.  

It's because the 503 error is only being generated for 5 minutes, and then for another 5 it's all great. With your patch in place, I would have to adjust that to 1 or 2 days.

> However, it does seem reasonable that errors get handled explicitly as you're
> suggesting.

Hence the other bug. :)
Summary: messages should not be removed from cache immediately after disappearing from feed → RSS messages should not be removed from cache immediately after disappearing from feed
Thanks for fixing this.

When I was preparing to check it in, I noticed a couple of blank lines that you added that also had whitespace on them - we normally try to avoid that with blank lines, so I removed the extra whitespace before checkin in.

http://hg.mozilla.org/comm-central/rev/1d7bfe5ae518
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 3.0b3
Attachment #376099 - Attachment description: Update to patch 375758, fixed the style nits (expiring feed items from cache after one day) → [checked-in] Update to patch 375758, fixed the style nits (expiring feed items from cache after one day)
You need to log in before you can comment on or make changes to this bug.