Closed Bug 711315 Opened 10 years ago Closed 10 years ago

RIL: reset state upon socket reconnect

Categories

(Core :: General, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
mozilla18
blocking-kilimanjaro +
blocking-basecamp +

People

(Reporter: philikon, Assigned: cyu)

References

Details

Attachments

(1 file, 2 obsolete files)

The RIL IPC thread is going to have to send a special events for disconnect/reconnect. The RIL worker picks those up, resets all state/boots up the radio in the right places.
Assignee: nobody → kyle
Any idea if this has changed with the landing of Bug 717451?
And by changed, I mean does it need to relay in a way different than it would have before dealing with multiple page sync.
Nothing has changed, I'd say. WebTelephony isn't the only consumer of RIL events. This will be an "interesting" case to deal with for all of them. I wonder if we should add an event to all those APIs to the effect of "forget everything, device driver crashed." This seems like a concept that might transcend all device-based APIs, not just the ones based on the RIL.
First planned step so far is to just start kill -9'ing rild during different events (in a call, sending an sms, etc...) and seeing what happens. Has that been done yet?

Also, do we want to open another bug for dealing with rilproxy crashing? If rilproxy crashes we don't actually lose radio state, just our connection, so we /should/ be ok just picking up and going as we were I think?
Dropping since I'm not really on RIL much anymore.
Assignee: kyle → nobody
blocking-basecamp: --- → +
blocking-kilimanjaro: --- → +
Vincent, would you mind taking a look at this? If you have any questions, I'm sure qDot will be happy to answer them. :p
Assignee: nobody → vchang
(In reply to Philipp von Weitershausen [:philikon] from comment #6)
> Vincent, would you mind taking a look at this? If you have any questions,
> I'm sure qDot will be happy to answer them. :p

In my understanding, the message passing between ril worker and ril daemon is 
ril_worker <--> I/O thread(IPC) <--> rilproxy <--> Android ril daemon

When will RIL IPC thread need to send a special events to ril worker for disconnect/reconnect ? When rilproxy or ril is being crashed ?
(In reply to Vincent Chang from comment #7)
> In my understanding, the message passing between ril worker and ril daemon
> is  ril_worker <--> I/O thread(IPC) <--> rilproxy <--> Android ril daemon

Yes.

> When will RIL IPC thread need to send a special events to ril worker for
> disconnect/reconnect ? When rilproxy or ril is being crashed ?

That is my understanding, yes.
(In reply to Philipp von Weitershausen [:philikon] from comment #8)
> (In reply to Vincent Chang from comment #7)
> > In my understanding, the message passing between ril worker and ril daemon
> > is  ril_worker <--> I/O thread(IPC) <--> rilproxy <--> Android ril daemon
> 
> Yes.
> 
> > When will RIL IPC thread need to send a special events to ril worker for
> > disconnect/reconnect ? When rilproxy or ril is being crashed ?
> 
> That is my understanding, yes.

I tried to kill rilproxy and rild when doing incoming/outgoing call and talking. After that, I can't dial out or receive a call anymore. It seems that the call state is wrong. Will try to find out if we can get any events when rild is crashed.
Hi Philikon, 

We can send a message to ril_worker when we detect read error in  ipc/Ril.cpp function RilClient::OnFileCanReadWithoutBlocking.  We can create the other unsolicited message like   UNSOLICITED_RIL_SOCKET_RECONNECT  or use JS_CallFunctionName to create the other function to be called in ril_worker, like onRILSocketReset.  

Case 1, rild crash 
Rilproxy detects read socket error when rild is restarting. Then, rilproxy closes opened sockets, one is for rild and the other one is for IPC, and create sockets for rild and IPC again. So IPC call could detect read error because Rilproxy re-creates the sockets. 
Case 2, rilproxy crash 
Rilproxy is crashed and is restarted. Because IPC socket is re-created, so IPC call could detect read error.
Assignee: vchang → cyu
To handle B2G crash issue first.
On b2g/rild/rilproxy restart, the inconsistencies in RIL call state are:

1) b2g restart: while there are active calls, if b2g restart, ril worker won't know there are active calls. We need to query rild about the current calls and hang them up (but don't notify DOM).

2) rild restart: while there are active calls, if rild restart, the active calls will be disconnected. We need to clean up RIL.currentCalls and notify DOM about this event.

3) rilproxy restart: rild and b2g think each other disconnected, but neither actually did. Will go through cleanup actions in both 1) and 2).

Since in any of the above, connection to rild will be reestablished, the handler for UNSOLICITED_RIL_CONNECTED is a good place to place cleanup code in. The plan is to issue REQUEST_GET_CURRENT_CALLS on ril connected and do cleanup work in the wrapped REQUEST_GET_CURRENT_CALLS handler. The REQUEST_GET_CURRENT_CALLS handler will be restored to default on completion.
Attachment #649194 - Flags: review?(philipp)
Test on try: https://tbpl.mozilla.org/?tree=Try&rev=00ee42eac202

Only manual tests to verify:
- Kill b2g on active phone call. After b2g restarted verify by making another call.
- Kill rild on active phone call. After rild restarted verify by making another call.
- Kill rilproxy on active phone call. After rilproxy restarted verify by making another call.

There is some quirk on otoro: if you kill rild, the phone will enter "Emergency call only" state and you need to kill rild again to make recover. It seems some hardware resource is not freed and the rild is restarted too quickly. When this happens, kill rild again will recover the phone from the "Emergency call only" state.
Comment on attachment 649194 [details] [diff] [review]
Clean up current calls on RIL socket reconnect

Review of attachment 649194 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/system/gonk/ril_worker.js
@@ +3759,5 @@
>      debug("Detected RIL version " + version);
>      debug("RILQUIRKS_V5_LEGACY is " + RILQUIRKS_V5_LEGACY);
>    }
> +
> +  this._cleanUpCurrentCalls();

This isn't nearly enough. There's a lot of other state that needs to be invalidated (data calls, etc.).

I'm inclined to go with the nuclear option here: shut the radio down, reset *all* local state (to do that I suggest moving all the RIL member initialization [1] into a method called `RIL.initState()` so that we can reinitialize everything), and bring the radio back into the state that it was before. I realize that that will mean ongoing phone calls will be interrupted, but I think that's there's no way around that anyway, even with your current patch, right?

[1] https://mxr.mozilla.org/mozilla-central/source/dom/system/gonk/ril_worker.js#584
Attachment #649194 - Flags: review?(philipp) → review-
P.S.: I do think it's very elegant to use the RIL[UNSOLICITED_RIL_CONNECTED] parcel as a way to detect whether RIL has been (re)started.
There is no way around without some dirty work on b2g restart. On rild restart, the call will be disconnected anyway. So I will go with the nuke approach.

I will add reset of other states like 3g data call. We have 3g data call reset on rild restart, which needs to be consolidated with the next version of patch.
Priority: -- → P1
Changes to v1:

1. Reset (most) RIL data members on socket reconnection. Note that SMS segments are not handled per discussion with Vicamo. If rild ever crashes, we could still receive remaining segments when rild comes back.
2. Don't hang up current calls on b2g restart. This is handled in bug 775375, in which powering off ril will nuke any existing data calls or voice calls (and in a cleaner way).
Attachment #649194 - Attachment is obsolete: true
Attachment #651319 - Flags: review?(philipp)
Tests on try: https://tbpl.mozilla.org/?tree=Try&rev=6c8d0ff599ae

Manual tests executed:
1. kill b2g or rild when there is an active call. Verify that b2g and rild should be in a consistent disconnected state.
2. kill b2g or rild when there is an active data call. Verify that b2g and rild should be in a consistent disconnected state.

No new automatic test cases.
(In reply to Cervantes Yu from comment #18)
> 1. Reset (most) RIL data members on socket reconnection. Note that SMS
> segments are not handled per discussion with Vicamo. If rild ever crashes,
> we could still receive remaining segments when rild comes back.

That's a good point. Please add a comment for this.

> 2. Don't hang up current calls on b2g restart. This is handled in bug
> 775375, in which powering off ril will nuke any existing data calls or voice
> calls (and in a cleaner way).

Wait, bug 775375 handles unexpected radio states at Gecko startup. But here we have a different case: Gecko continues running while rild restarts. I don't see how your patch addresses that.
Comment on attachment 651319 [details] [diff] [review]
Reset ril states on RIL socket reconnect (v2)

Review of attachment 651319 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/system/gonk/ril_worker.js
@@ +581,5 @@
>   * This object communicates with rild via parcels and with the main thread
>   * via post messages. It maintains state about the radio, ICC, calls, etc.
>   * and acts upon state changes accordingly.
>   */
> +function _RIL() {

This seems unnecessary to me. Why go to a constructor? The RIL object still is just a singleton "bag".

@@ +614,5 @@
> +  this._initRILStates();
> +}
> +
> +_RIL.prototype = {
> +  _initRILStates: function _initRILStates() {

Not sure why you added the underscore and made it plural. What was wrong with `initRILState`?

@@ +674,5 @@
> +    for each (let currentCall in this.currentCalls) {
> +      delete this.currentCalls[currentCall.callIndex];
> +      this.getFailCauseCode(currentCall);
> +    }
> +    this.currentCallsLength = 0;

`getFailCauseCode` doesn't work with specific calls, it only gets the last error.

I can't see how this keeps data consistency. Your patch doesn't power down the radio, so we can't be sure what state the calls will be in. We delete them from our internal list, but don't really update the main thread.

I think it's much much cleaner if we just shutdown the radio, tell the DOM that all calls (voice and data) are now gone (loop through them, set state = disconnected, and send messages to main thread), and then start with a clean slate.

(Also note: bug 782778 is getting rid of `this.currentCallsLength`.)
Attachment #651319 - Flags: review?(philipp) → review-
> `getFailCauseCode` doesn't work with specific calls, it only gets the last
> error.
> 
I will use _handleDisconnectedCall() instead to notify DOM that the voice call is disconnected.

> I can't see how this keeps data consistency. Your patch doesn't power down
> the radio, so we can't be sure what state the calls will be in. We delete
> them from our internal list, but don't really update the main thread.
> 
Powering down the radio is done in function UNSOLICITED_RESPONSE_RADIO_STATE_CHANGED(),
which will be run after socket connected. It's already done in bug 775375. Voice and data
calls will be disconnected by it so I don't power off the radio again here.

> I think it's much much cleaner if we just shutdown the radio, tell the DOM
> that all calls (voice and data) are now gone (loop through them, set state =
> disconnected, and send messages to main thread), and then start with a clean
> slate.
> 
The 2 loops in the init function are for telling the DOM that calls are gone.
Here are how the ril state is reset in b2g or rild crash:

1. b2g crash: the RIL object is automatically initialized when ril_worker.js runs. Powering off
is done in function UNSOLICITED_RESPONSE_RADIO_STATE_CHANGED(). No need to notify DOM.
2. rild crash: call initRILState() in function UNSOLICITED_RILD_CONNECTED() to reset the data
members. Data and voice calls will be broken anyway. Use _handleDisconnectedCall() and deactivateDataCall()
to notify DOM.

Powering down radio is in UNSOLICITED_RESPONSE_RADIO_STATE_CHANGED() because it needs the response
from rild. One alternative is move it to initRILState() and do it unconditionally.
But we may end up powering down unnecessarily. Philipp, do you prefer doing this? Thanks.
Please note that this revision doesn't power off the radio in initRILState() because doing this during init needs postRILMessage(), which is still unavailable when ril worker is initialized.

The calls will be disconnected on rild crash, so initRILState() only needs to notify the DOM. Powering off the radio is better handled as is now, in UNSOLICITED_RESPONSE_RADIO_STATE_CHANGED.
Attachment #651319 - Attachment is obsolete: true
Attachment #654185 - Flags: review?(philipp)
(In reply to Cervantes Yu from comment #23)
> Please note that this revision doesn't power off the radio in initRILState()
> because doing this during init needs postRILMessage(), which is still
> unavailable when ril worker is initialized.

That could easily be fixed, but ok. I'm fine with the solution outline below.

> The calls will be disconnected on rild crash, so initRILState() only needs
> to notify the DOM. Powering off the radio is better handled as is now, in
> UNSOLICITED_RESPONSE_RADIO_STATE_CHANGED.

I understand now. Reconnecting to the RIL will give us another radio state changed, and it will be the initial one, so we will power cycle the radio. Good!
Attachment #654185 - Flags: review?(philipp) → review+
Cervantes: is this ready to land?
Yes, let's and it.
Target Milestone: --- → mozilla17
Target Milestone: mozilla17 → mozilla18
https://hg.mozilla.org/mozilla-central/rev/7c5550c63a24
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.