Open Bug 1170646 Opened 5 years ago Updated 3 months ago

A few M-C fixes to handle short read in Cache code ( from bug 1170564 Failure to deal with short read)

Categories

(Core :: Networking: Cache, defect, P5, major)

All
Linux
defect

Tracking

()

REOPENED
mozilla42

People

(Reporter: ishikawa, Assigned: ishikawa)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-would-take])

Attachments

(2 files, 3 obsolete files)

This blocks bug 1170564.

This bugzilla entry is about

>3. A few M-C fixes.

in the bug 1170564 .

+++ This bug was initially created as a clone of Bug #1170564 +++

After spending some time to weed out the failure to detect I/O
errors from C-C TB code during message download [1], I have
come to realize there are issues regarding short read.

Short read refers to |read| system call returning without the
full number of requested octets although the subsequent |read|
calls would return the missing octets. That is, short read
returns prematurely.

Short read can happen during a read operation against a
remotely mounted file. High-load on the remote server, or a
transient network issue can cause such short read.

It is not a stranger to FF developers. Network code is full of
cautious programming to take care of such short read.

Unfortunately, C-C TB (under linux at least) fails to handle
the short read very well and as a result we see errors which I
noticed during tests.

So I set out to identify and fix these issues and this is a
meta entry to track such errors.

It turns out that performing additional writes after a
short-write is automagically handled by |pt_Write|. Isn't it
great?

Then why not pt_Read? As I explain later in this post, we
sometimes don't know how many we should read in
advance. That's why.

Short read simulated:

I have come to learn that mounting one's profile to remote
file server seems to cause strange errors.  Since I don't have
a CIFS server that seems to cause such short read very often
under heavy I/O work load, I resort to simulation.

I inject short read by preparing some functions such as |read|
and |read64| that simulates short read and preload them using

"export LD_PRELOAD=that-file.so" 

before running C-C thunderbird to find errors caused by short
read.


How to FIX after identifying the trouble spot:

When you deal with short read issues there are a few things
that we must keep in mind.

At the level of |read| system call, we have no idea whether a
short read ought to be followed by additional |read| or not.

Then where should we make the decision?  It depends on the
type of |read| operation.

There are two types of |read| operations.

(a) # of octets to read is known in advance.

    The caller of |read| knows exactly that there are certain
    number of octets to be read in advance.

    E.g. 

     (1) we check the size of JSON file by calling |stat| in
         advance and try to read that many octets in one
         |read|.  [As it turns out this is a real-world
         problem.]
    
     (2) we try to read a "block" of a known data structure
         and sizeof(block) is already known in advance (at
         compile time, etc).
         A case in point is the "block" for CACHE operation.
	 (There was an issue and it was fixed.)

In these cases, we ought to perform additional |read|
operations until all the expected number of octets are read
(or until error condition or EOF is encountered.) in the face
of short read. So we must provide a mechanism for such
additional reads.

In my patch, I have introduced a couple of variants of |Read|
operation and |PR_Read| that performs additional read
operations if we failed to read the desired # of octets in one
run until all the octets are read.


(b)  # of octets to read is not known in advance.

     E.g.,

     (1) There is a mime-encoded data in a file (as in
     attachment), and the program needs to read as many data
     as possible until the end of the data marker line is
     reached.

     (2) In general, suppose there is a data structure written
     in ASCII representation in a file, and also suppose the
     caller of |read| can know the end of the logical data
     structure by reading a certain textual construct that
     signals the end of the data. In this case, the caller of
     |read| needs to perform the |read| repeatedly until this
     end of data marker is read.
     Case (1) is obviously a particular case of (2).

     (3) There are similar variants.

In these cases, we can NOT rely on the number of octets to
read in advance. The size of the buffer passed to |read| is
just an estimate, a ball park figure. We may have more data to
read or much less data to read actually.  We ought to perform
additional |read| until the end of data marker is reached (or
the error is signaled or EOF is reached.)

Between (a) and (b), the main difference is case (a) knows the
fixed number of octets to be read in advance, and case (b)
does not know it.

In practice, |read| calls of case (b) tend to happen inside
loops and are relatively easy to spot, and no modification is
necessary most of the time even if short read has to be
considered.  (Well there are cases where the ERROR condition
was not detected at all in the current code and I tried to fix
them.)

The short read in case (a) needs to be handled at the spot
where the required number of octets is known in the call
chain.  At the |read| system call level, we have no idea
whether a particular short |read| ought to be followed by
additional reads at all.

To figure out where such problematic short read is
encountered, and to learn whether the |read| is type (a) or
type (b) above, I had to dump stack trace to figure out where
such short read is NOT followed by necessary additional read
operations. Using the stack trace, I tried to figure out where
the decision to perform additional reads can be made and, if
necessary, replaced the read with a version that performs such
additional read operations.

I think I explained enough background.

There will be four bugzilla entries I intend to submit today and
block this meta entriy.

1. NSS "dbm" code used for certificate and key handling.
   (Already reported in newsgroup/mailing list.)

2. Mostly C-C T-B fixes to take care of short read.

3. A few M-C fixes.

   Yes, there *ARE* places where short read is not handled in
   M-C!  I was surprised myself.

   Maybe storing one's profile under remote file server was an
   arcane idea when mozilla code base was written years ago.
   But today, in corporate setup, PC is often used like a thin
   client and everything including your profile may be stored
   on a central server.

4. Case of reading JSON file from JavaScript.

There is a known bug which I don't know how to fix.  It is
filed as a separate bugzilla. It is concerned with a reading
of JSON file from JavaScript.

There are a few more unresolved cases that manifest themselves
during local |make mozmill| test suite operation with
simulated short read injection, but I am not sure if they are real and not timing-dependent random errors that are seen before. [Yes,
there are such random errors in |make mozmill|.]

I will add them to known C-C T-B fixes and M-C fixes *iff*
they turn out to be real bugs in the future.

TIA

[1] Bug 1121842 - [META] RFC: C-C Thunderbird - Cleaning of
incorrect Close, unchecked Flush, Write etc. in nsPop3Sink.cpp
and friends.
Assignee: nobody → ishikawa
The patch has been tested with other patches using |make mozmill| with or without the injection of simulated short reads.
It did not produce new errors that were not there before.

Re: other patches.

Please read the following to learn what "short read" means.

There are many files in C-C tree to fix I/O error handling (or the lack of it) mentioned in [1] and other bugzilla entries
mentioned there in.
The patch uploaded here was tested with these patches and
one in [2].

[1] Bug 1121842 - [META] RFC: C-C Thunderbird - Cleaning of
incorrect Close, unchecked Flush, Write etc. in nsPop3Sink.cpp
and friends.
[2] Bug 1170606 - C-C T-B fixes to take care of short read (part of [META] Failure to deal with short read)
Blocks: 1170564
No longer depends on: 1170564, 1170578, 1170668
This can go in independently of other patches, and so I am requesting a review now.
TIA
Comment on attachment 8614151 [details] [diff] [review]
A couple of fixes to Cache routines to handle short read.

I wonder if deleting cache/* (in favor of cache2) is on the roadmap.. or do we need to keep it around for appcache?
Attachment #8614151 - Flags: review?(mcmanus) → review?(honzab.moz)
(In reply to Patrick McManus [:mcmanus] from comment #3)
> Comment on attachment 8614151 [details] [diff] [review]
> A couple of fixes to Cache routines to handle short read.
> 
> I wonder if deleting cache/* (in favor of cache2) is on the roadmap.. or do
> we need to keep it around for appcache?

We so far need cache/ for appcache.  There is more a plan to remove appcache altogether one day (blocked on SW) than to cut-off cache/ bits needed for appcache only and throw the rest away.
Comment on attachment 8614151 [details] [diff] [review]
A couple of fixes to Cache routines to handle short read.

Review of attachment 8614151 [details] [diff] [review]:
-----------------------------------------------------------------

Forwarding to Michal, he know this code better than me.
Attachment #8614151 - Flags: review?(honzab.moz) → review?(michal.novotny)
(In reply to Patrick McManus [:mcmanus] from comment #3)
> Comment on attachment 8614151 [details] [diff] [review]
> A couple of fixes to Cache routines to handle short read.
> 
> I wonder if deleting cache/* (in favor of cache2) is on the roadmap.. or do
> we need to keep it around for appcache?

Just a reminder: the bug was noticed by C-C TB |make mozmill| test suite run.
So C-C TB definitely uses the code today.

TIA
Comment on attachment 8614151 [details] [diff] [review]
A couple of fixes to Cache routines to handle short read.

Review of attachment 8614151 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to ISHIKAWA, Chiaki from comment #6)
> Just a reminder: the bug was noticed by C-C TB |make mozmill| test suite run.
> So C-C TB definitely uses the code today.

C-C is obviously using the old cache. It should switch to the new cache because the old one will be removed in the future.

::: netwerk/cache/nsDiskCacheBlockFile.cpp
@@ +26,5 @@
> +
> +    while (remaining > 0)
> +    {
> +        n = PR_Read(fd, start, remaining);
> +        if (n < 0)

Please use the style common for this file. I.e.

if () {
} else {
}

@@ +28,5 @@
> +    {
> +        n = PR_Read(fd, start, remaining);
> +        if (n < 0)
> +        {
> +            /* may want to repeat if errno == EINTR */

If it's needed please add it to the patch, if not then remove this comment.

@@ +287,3 @@
>      
> +    /* xxx I wonder why no I/O error check was done on the above read and NS_OK
> +     * was returned always. Returning NS_ERROR_UNEXPECTED seems a good idea.*/

Because the return value was checked by the callers of this method. Anyway, such comment should be in the bugzilla comment describing the patch, not in the patch itself.

@@ +288,5 @@
> +    /* xxx I wonder why no I/O error check was done on the above read and NS_OK
> +     * was returned always. Returning NS_ERROR_UNEXPECTED seems a good idea.*/
> +
> +    if(*bytesRead == -1)
> +        return NS_ERROR_UNEXPECTED;

Move the check below the CACHE_LOG_DEBUG(), otherwise the failed read won't be logged.
Attachment #8614151 - Flags: review?(michal.novotny) → review-
(In reply to Michal Novotny (:michal) from comment #7)
> Comment on attachment 8614151 [details] [diff] [review]
> A couple of fixes to Cache routines to handle short read.
> 
> Review of attachment 8614151 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> (In reply to ISHIKAWA, Chiaki from comment #6)
> > Just a reminder: the bug was noticed by C-C TB |make mozmill| test suite run.
> > So C-C TB definitely uses the code today.
> 
> C-C is obviously using the old cache. It should switch to the new cache
> because the old one will be removed in the future.
> 
> ::: netwerk/cache/nsDiskCacheBlockFile.cpp
> @@ +26,5 @@
> > +
> > +    while (remaining > 0)
> > +    {
> > +        n = PR_Read(fd, start, remaining);
> > +        if (n < 0)
> 
> Please use the style common for this file. I.e.
> 
> if () {
> } else {
> }
> 
> @@ +28,5 @@
> > +    {
> > +        n = PR_Read(fd, start, remaining);
> > +        if (n < 0)
> > +        {
> > +            /* may want to repeat if errno == EINTR */
> 
> If it's needed please add it to the patch, if not then remove this comment.
> 
> @@ +287,3 @@
> >      
> > +    /* xxx I wonder why no I/O error check was done on the above read and NS_OK
> > +     * was returned always. Returning NS_ERROR_UNEXPECTED seems a good idea.*/
> 
> Because the return value was checked by the callers of this method. Anyway,
> such comment should be in the bugzilla comment describing the patch, not in
> the patch itself.
> 
> @@ +288,5 @@
> > +    /* xxx I wonder why no I/O error check was done on the above read and NS_OK
> > +     * was returned always. Returning NS_ERROR_UNEXPECTED seems a good idea.*/
> > +
> > +    if(*bytesRead == -1)
> > +        return NS_ERROR_UNEXPECTED;
> 
> Move the check below the CACHE_LOG_DEBUG(), otherwise the failed read won't
> be logged.

Thank you for the comment. I will re-create the patch per your suggestions.

TIA
I modified the patch per suggestion.

I left the comment regarding EINTR in the comment preceding the message.

It is not clear to me whether we should handle it or not.
Regarding this I have no idea. 

There are conflicting information as far as I am concerned.
I was advised that EINTR should not percolate up to this level, but then I notice that it *may* under some circumstances (in bug Bug 1170668).
To ask for a future attention on this matter, I think leaving a single line or two should not hurt IMHO.

TIA
Attachment #8625144 - Flags: review?(michal.novotny)
Attachment #8614151 - Attachment is obsolete: true
I noticed a typo in the comment in patch Take-2 version. 
So the patch is replaced.

TIA
Attachment #8625144 - Attachment is obsolete: true
Attachment #8625144 - Flags: review?(michal.novotny)
Attachment #8625146 - Flags: review?(michal.novotny)
Attachment #8625146 - Flags: review?(michal.novotny) → review+
Thank you.

I will put checkin-needed keyword.
sorry had to back this out for bustage like https://treeherder.mozilla.org/logviewer.html#?job_id=11423087&repo=mozilla-inbound - for a relanding can you also provide a new try run? thanks!
Flags: needinfo?(ishikawa)
(In reply to Carsten Book [:Tomcat] from comment #13)
> sorry had to back this out for bustage like
> https://treeherder.mozilla.org/logviewer.html#?job_id=11423087&repo=mozilla-
> inbound - for a relanding can you also provide a new try run? thanks!

Oops. Sorry.
I have no idea what happened.
Quite likely to be the local tree got out of sync with the 
trunk somewhat.

Will check with tryserver.
Flags: needinfo?(ishikawa)
(In reply to Carsten Book [:Tomcat] from comment #13)
> sorry had to back this out for bustage like
> https://treeherder.mozilla.org/logviewer.html#?job_id=11423087&repo=mozilla-
> inbound - for a relanding can you also provide a new try run? thanks!

Now I realize it is the windows version compilation.
For some reason, my attempt to compile for windows on tryserver is a hit and miss event. I have no idea why sometimes I get a compilation error.

I am afraid I might have missed something here due to my failures to test windows compilation thoroughly.

I will try to see if the particular job submission has anything different from my past windows compilation efforts on tryserver.

TIA
I found out the problem.
I used |ssize_t| for a function return type. (My local test machine runs linux.)
As it turns out, this is not good for Windows compilation: we need to use |PRInt32| instead.

I have submitted a tryserver job for C-C TB compilation.
(I have not been able to check windows compilation very well 
for a few weeks there.)
Let me see if it goes well this time.

https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=7983cac21e61
Attachment #8625146 - Attachment is obsolete: true
Attachment #8630603 - Flags: review+
(In reply to Carsten Book [:Tomcat] from comment #13)
> sorry had to back this out for bustage like
> https://treeherder.mozilla.org/logviewer.html#?job_id=11423087&repo=mozilla-
> inbound - for a relanding can you also provide a new try run? thanks!

I could finally get a successful tryserver run.
(I don't know why but try jobs failed for a few days this week. I think it is an infrastructure problem.)

https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=e0d1e9cd52fc

It compiles fine now, and
I don't believe  it has introduced any new errors.

TIA
https://hg.mozilla.org/mozilla-central/rev/f600f0cd7bb3
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
Depends on: 1199957
The evidence in bug 1199957 is that this patch is causing serious regressions in Thunderbird which is blocking our current beta. AFAIK, /netwerk/cache/ is no longer used outside of Thunderbird, as Firefox has switched to cache2.

So I believe that we can just back this out of mozilla-central, mozilla-aurora, and mozilla-beta based on what is optimal for Thunderbird. Is that correct? Who should I ask for this permission? Not sure how to do a permission flag for a backout.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Michal, could you please respond to comment 21? Thanks.
Flags: needinfo?(michal.novotny)
(In reply to Kent James (:rkent) from comment #21)
> The evidence in bug 1199957 is that this patch is causing serious
> regressions in Thunderbird which is blocking our current beta. AFAIK,
> /netwerk/cache/ is no longer used outside of Thunderbird, as Firefox has
> switched to cache2.

Parts of /netwerk/cache are still used in firefox (appCache), but this patch changes only file nsDiskCacheBlockFile.cpp which isn't used by firefox anymore. So from the perspective of firefox, it doesn't matter whether the patch is applied or not.


> So I believe that we can just back this out of mozilla-central,
> mozilla-aurora, and mozilla-beta based on what is optimal for Thunderbird.
> Is that correct? Who should I ask for this permission? Not sure how to do a
> permission flag for a backout.

Create a backout patch and ask for approval.
Flags: needinfo?(michal.novotny)
OK, I'll go through normal review steps for the backout. See bug 1199957 comment 20 for justification.
Attachment #8667359 - Flags: review?(michal.novotny)
Attachment #8667359 - Flags: review?(michal.novotny) → review+
Depends on: 1204381
Comment on attachment 8667359 [details] [diff] [review]
Backout f600f0cd7bb3 because of Thunderbird regressions with OSX

Approval Request Comment
[Feature/regressing bug #]: This bug, as this is a backout.
[User impact if declined]: Severe Thunderbird regression on OSX systems, see dependent bugs.
[Describe test coverage new/current, TreeHerder]: Unknown.
[Risks and why]: Code is not used in Firefox, only in Thunderbird. Minimal risk as this just returns us to the state we were in for years.
[String/UUID change made/needed]: None.
Attachment #8667359 - Flags: approval-mozilla-beta?
Attachment #8667359 - Flags: approval-mozilla-aurora?
Comment on attachment 8667359 [details] [diff] [review]
Backout f600f0cd7bb3 because of Thunderbird regressions with OSX

Next time, please request the backout in a separate bug.
Attachment #8667359 - Flags: approval-mozilla-beta?
Attachment #8667359 - Flags: approval-mozilla-beta+
Attachment #8667359 - Flags: approval-mozilla-aurora?
Attachment #8667359 - Flags: approval-mozilla-aurora+
Whiteboard: [necko-would-take]
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P5

AIUI Thunderbird is no longer using cache v1, so this is no longer an issue for us?

Flags: needinfo?(ishikawa)

This isn't a meta

Keywords: meta
Summary: A few M-C fixes to handle short read in Cache code ( from [META] Failure to deal with short read) → A few M-C fixes to handle short read in Cache code ( from bug 1170564 Failure to deal with short read)

(In reply to Wayne Mery (:wsmwk) from comment #31)

AIUI Thunderbird is no longer using cache v1, so this is no longer an issue for us?

Wayne, you have asked several questions including this one via bugzilla.

I will try to answer them over the weekend: I run linux inside virtualbox that is hosted under Windows 10 on home PC. That linux environment is my main development ground. Win10 introduces new semi-annual (?) glitzy build 1903 and my windows machine picked it up early this week. Disaster. Ever since, host win10 cannot run more than half an hour before it hits KERNEL_MODE_HEAP_CORRUPTION error. I had to disable and downgrade some device drivers until win10 runs overnight without crashing. (Before finding the final stable condition, it simply crashed without my having logged in !).
I am in the process of formating (clang-format) my mozilla M-C patches for local compilation with very strict compiler flag:these patches may be of interest.
Anyway, now that I think my PC can function without my worrying about crashing any minute, I can look into the issues you raised
here and elsewhere.
I also know that there is a pending request: I have been meaning to get to it, but I have had a health problem that kept me from doing it. (Not life-threatening, but uncomfortable enough that I could not sit in front of PC for a prolonged debug/programming session.)

Flags: needinfo?(ishikawa)
You need to log in before you can comment on or make changes to this bug.