Closed Bug 835920 Opened 7 years ago Closed 7 years ago

UnixSocket and Ril mis-interpret read() returning 0 as failure and enter infinite reconnect loop

Categories

(Firefox OS Graveyard :: General, defect)

x86_64
Linux
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bjacob, Assigned: bjacob)

References

Details

Attachments

(3 files, 2 obsolete files)

Attached patch patch (obsolete) — Splinter Review
read returning 0 indicates end-of-file, accoring to the man page I'm reading. But Ril.cpp seems to be interpreting it as an error condition and then tries to reconnect.

http://hg.mozilla.org/mozilla-central/file/6cca454559c8/ipc/ril/Ril.cpp#l256

This is giving me an infinite loop with this repeated many times in logcat:

I/Gonk    (  367): Cannot read from network, error 0
I/Gonk    (  367): Socket open for RIL

The attached patch makes it only interpret strictly negative return values as errors; in the zero case, we then seem to be correctly acting on EOF at this line:

http://hg.mozilla.org/mozilla-central/file/6cca454559c8/ipc/ril/Ril.cpp#l284

At least, this removes the infinite reconnecting problem for me.
Attachment #707720 - Flags: review?(jones.chris.g)
Here's an attempt at a theory for why this is happening only for me now and not for everyone everytime. The read size here, RilRawData::MAX_DATA_SIZE, is defined in Ril.h to be 1024. So IIUC we will only hit this bug if the size of the file being read here is an exact multiple of 1024 bytes.
RIL.cpp is going to be rewritten with UnixSocket in bug 826931. However, the implementation of UnixSocket has the same problem: http://hg.mozilla.org/mozilla-central/file/6cca454559c8/ipc/unixsocket/UnixSocket.cpp#l656
Benoit, could you fix UnixSocket instead?
Depends on: 826931
Attached patch unixsocket patch (obsolete) — Splinter Review
Thanks, here's a patch fixing it there.
Attachment #707731 - Flags: review?(jones.chris.g)
Attached patch unixsocket patchSplinter Review
This version avoids sending an empty task if read() returned 0. I assume that's better, but I don't really know for sure and that's untested.
Attachment #707731 - Attachment is obsolete: true
Attachment #707731 - Flags: review?(jones.chris.g)
Attachment #707733 - Flags: review?(jones.chris.g)
Attached patch ril patchSplinter Review
And same for the RIL patch, avoid sending empty data if read() returned 0.

There's some trickiness around here: the if (!mIncoming) condition means it matters a lot that .forget() sets the pointer to null. So when we if() this out, we have to make sure that we still null the pointer.

Hope I got this right; seems to be working here, but I can't be very confident, as I'm in the middle of bigger changes and my build is non-functional anyways.
Attachment #707720 - Attachment is obsolete: true
Attachment #707720 - Flags: review?(jones.chris.g)
Attachment #707741 - Flags: review?(jones.chris.g)
(In reply to Benoit Jacob [:bjacob] from comment #6)
> Created attachment 707741 [details] [diff] [review]
> ril patch

The whole content of this file is going to be removed. Fixing anything here is probably meaningless.
Well, how soon is it going to be removed? I seem to be hitting this with the gfx-layers-refactoring branch, so I'm blocked on either fixing Ril.cpp or having it replaced.
Summary: Ril mis-interprets read() returning 0 as failure and enters infinite reconnect loop → UnixSocket and Ril mis-interpret read() returning 0 as failure and enter infinite reconnect loop
(In reply to Benoit Jacob [:bjacob] from comment #8)
> Well, how soon is it going to be removed? I seem to be hitting this with the
> gfx-layers-refactoring branch, so I'm blocked on either fixing Ril.cpp or
> having it replaced.

I'll update my patches tomorrow (actually 6 hr later or "this morning"), and hopefully I can land them tonight (20+ hr later I guess).
Great, that works for me (anything that lands before the end of next week is good enough for me).
Comment on attachment 707741 [details] [diff] [review]
ril patch

cancelling this review request per above comments.
Attachment #707741 - Flags: review?(jones.chris.g)
Attachment #707733 - Flags: review?(jones.chris.g) → review?(vyang)
Comment on attachment 707733 [details] [diff] [review]
unixsocket patch

Actually I'm not a IPC peer or the author of UnixSocket, but in my opinion, I would rather have:

  ssize_t ret = read(aFd, data, MAX_READ_SIZE);
  if (!ret) {
    // EOF found.
    return;
  }
  if (ret < 0) {
    ...
Attachment #707733 - Flags: review?(vyang) → review?(kyle)
Attachment #707733 - Flags: review?(kyle) → review+
Crap. I didn't see vicamo's comment before I r+'d. I agree on the cleanliness, can be filed as a followup.
OK, will do asap.
The patch doesn't look correct to me. If read() returns EOF this means the peer close()'d the FD (e.g. the peer crashed or has been killed). For RIL.cpp we should treat this as an error and reconnect. read() returns 0 as not EOF only when we pass 0 as the buffer size, but this is not the case in Ril.cpp or UnixSocket.cpp.

This is non-blocking IO. If the peer sends multiple of 1024 bytes then we should get -1 and errno = EAGAIN on the last read() call, not 0 as EOF. As a test I applied the ril patch and stopped rilproxy to trigger the error, then I didn't see reconnection, even after rilproxy was then restarted. The result is: I couldn't dial out anymore. I didn't test UnixSocket part, but I think the result is similar.
I thought it still made sense to put the EOF handling _after_ the error handling: while read() returning both as separate return values makes this not matter in practice, _if_ the API was slightly different so that we could have both an error and EOF, I think that we would want to handle the error.

Also added an assertion to guard our assumption below that ret > 0.
Attachment #708574 - Flags: review?(kyle)
Comment on attachment 708574 [details] [diff] [review]
follow-up patch based on Vicamo's suggestion

Review of attachment 708574 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/unixsocket/UnixSocket.cpp
@@ +676,5 @@
>          return;
>        }
> +      if (ret == 0) {
> +        // EOF found, zero bytes read
> +        return;

As Cervantes' comment #16, we should fix rilproxy crash/socket missing instead.

@@ +681,2 @@
>        }
> +      MOZ_ASSERT(ret > 0); // other cases handled above

This assertion seems to be meaningless.

@@ +688,2 @@
>        if (ret < ssize_t(MAX_READ_SIZE)) {
> +        // EOF found

It's not EOF here actually.
Comment on attachment 708574 [details] [diff] [review]
follow-up patch based on Vicamo's suggestion

Review of attachment 708574 [details] [diff] [review]:
-----------------------------------------------------------------

Agreeing with vicamo and cervantes' reviews here. Go ahead and throw vicamo on as addl. review for patches from here on out, since he's seeing UnixSocket more than anyone else on the team these days.
Attachment #708574 - Flags: review?(kyle) → review-
https://hg.mozilla.org/mozilla-central/rev/f85c0ab34738
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Hi Benoit, may I have your build config?  I tend to revert the changes made in this bug because reconnection on EOF is actually nothing wrong as Cervantes had described.  When you --enable-b2g-ril, then there is supposed to be ril socket open for RilConsumer.  We can't tell the cases you don't have rild at all or the rild dies.  As a result, reconnect on such event is necessary and a correct behaviour.  Please comment more on your case, config and let's whether can we find a solution for both.
Flags: needinfo?(bjacob)
Sorry for the delay replying. My .userconfig is:

GECKO_PATH=/hack/mozilla-graphics
export B2G_DEBUG=1
export B2G_NOOPT=1

Where mozilla-graphics does not have any relevant change. So I suppose that the only thing in my .userconfig that could make a difference is B2G_DEBUG.

Do as you wish -- I hold off from making an update patch according to review comments, until you've decided whether you want to revert this.
Flags: needinfo?(bjacob)
You need to log in before you can comment on or make changes to this bug.