Closed Bug 1152730 Opened 5 years ago Closed 5 years ago

[Messages] We should retry sending SMS for some time if there is no network available

Categories

(Firefox OS Graveyard :: RIL, defect)

ARM
Gonk (Firefox OS)
defect
Not set

Tracking

(tracking-b2g:+, firefox41 fixed)

RESOLVED FIXED
2.2 S13 (29may)
tracking-b2g +
Tracking Status
firefox41 --- fixed

People

(Reporter: julienw, Assigned: freesamael)

References

Details

(Keywords: foxfood, Whiteboard: [sms-most-wanted])

Attachments

(4 files, 11 obsolete files)

14.28 KB, text/plain
Details
11.56 KB, patch
bevis
: review+
Details | Diff | Splinter Review
8.63 KB, patch
bevis
: review+
Details | Diff | Splinter Review
18.27 KB, patch
freesamael
: review+
Details | Diff | Splinter Review
From Alexandre Lissy, it looks like we don't retry sending SMS like we do with MMS.
Bevis, can you please describe the retry mechanism we have with SMS when network is unavailable ?
Flags: needinfo?(btseng)
For SMS retry,
It relies on the response from modem of the REQUEST_SEND_SMS.
Here are the possible error from modem:
 *  RADIO_NOT_AVAILABLE
 *  SMS_SEND_FAIL_RETRY
 *  FDN_CHECK_FAILURE
 *  GENERIC_FAILURE
If the error cause is E_SMS_SEND_FAIL_RETRY then we will resend SMS in ril_worker, and the maximal retry of the same SMS is set to 3.

That means we depends on modem to tell us if it is necessary to resend the SMS according to the network condition.
Flags: needinfo?(btseng)
I think it makes sense to send the error to Gaia for FDN_CHECK_FAILURE and GENERIC_FAILURE, but for RADIO_NOT_AVAILABLE we could reschedule a resend automatically. We should know when radio is available and send what's waiting. After ~5 minutes we could return the error to Gaia.

What do you think?
Flags: needinfo?(btseng)
I'm proposing this because this is the behavior we have with MMS (I know it's using a completely different code).
(In reply to Julien Wajsberg [:julienw] (PTO April 6th) from comment #4)
> I think it makes sense to send the error to Gaia for FDN_CHECK_FAILURE and
> GENERIC_FAILURE, but for RADIO_NOT_AVAILABLE we could reschedule a resend
> automatically. We should know when radio is available and send what's
> waiting. After ~5 minutes we could return the error to Gaia.
> 
> What do you think?


For FDN_CHECK_FAILURE & RADIO_NOT_AVAILABLE, the corresponding error cause are reported to gaia.
https://dxr.mozilla.org/mozilla-central/source/dom/mobilemessage/gonk/SmsService.js#273-276

IMO, these 2 scenario are permanent failure for sending SMS.
For RADIO_NOT_AVAILABLE, we don't know when the network will be available if user keep staying in a place out of network coverage. It's more clear to tell user the failure immediately and precisely.
Flags: needinfo?(btseng)
Bevis, when we move (especially using the train, or subway), we usually face moments where we have coverage and moments where we don't. That's why in my opinion it makes sense to have this behavior for RADIO_NOT_AVAILABLE. I'm not saying we should retry for hours, but we should retry for some minutes.

And I think this should be in Gecko because MMS behaves like this. If MMS didn't behave like this, I'd be fine with doing it in Gaia, but I think we need some consistency here.
Flags: needinfo?(btseng)
I want to add that this happens to Alexandre Lissy every day when he travels by train, and tis happens to me at least once every week when I travel with subway.

I think we should do a limited retry (limited to ~5min) in Gecko, and possibly implement in Gaia a larger retry ("please retry failed sendings every 15min") that the user could configure.
(In reply to Julien Wajsberg [:julienw] (PTO April 6th) from comment #8)
> I want to add that this happens to Alexandre Lissy every day when he travels
> by train, and tis happens to me at least once every week when I travel with
> subway.

Moreover, it's not something that happens with Android because thet keep retrying for quite some time before really failing like we do. Hence, our UX is very crappy.
(In reply to Julien Wajsberg [:julienw] (PTO April 6th) from comment #7)
> Bevis, when we move (especially using the train, or subway), we usually face
> moments where we have coverage and moments where we don't. That's why in my
> opinion it makes sense to have this behavior for RADIO_NOT_AVAILABLE. I'm
> not saying we should retry for hours, but we should retry for some minutes.
> 
I think we shall identify the cause firstly to see how we can improve this sending experience in a better way.
If the failure cause is SMS_SEND_FAIL_RETRY, I think a better idea to improve this is to have a timer instead of re-sending SMS immediately at [1].
IMO, if SMS_SEND_FAIL_RETRY were reported immediately after REQUEST_SEND_SMS in the symptom you mentioned, a timer is required to improve this.

For RADIO_NOT_AVAILABLE or other errors, my concern is that in 3GPP 24.011, there is a maximal number of retransmission set to 3 and SMS_SEND_FAIL_RETRY is the only way for modem to ask platform layer to trigger the retransmission, if we add more retransmissions into the implementation, this might cause  GCF/PTCRB test case failure. :(

3GPP 24.011 - 5.3.2.1          RPDU transfer for circuit switched service
"
The maximum number of CP-DATA message retransmissions is an implementation option but shall be either 1, 2 or 3.
"

3GPP TS 51.010-1 - 34.2.2     SMS mobile originated
"
The SS is configured not to send the CP-ACK message or to respond to any other CM requests. Then maximum 3 CP-DATA retransmissions may occur. After a duration of 60 s after the first CP-DATA not acknowledged the SS initiates channel release. The 60 s is sufficient time to wait to verify that the MS does not send more than the maximum CP-DATA retransmissions
"

> And I think this should be in Gecko because MMS behaves like this. If MMS
> didn't behave like this, I'd be fine with doing it in Gaia, but I think we
> need some consistency here.

[1] https://hg.mozilla.org/mozilla-central/file/dd32e3ff3717/dom/system/gonk/ril_worker.js#l3960
Flags: needinfo?(btseng)
(In reply to Alexandre LISSY :gerard-majax from comment #9)
> (In reply to Julien Wajsberg [:julienw] (PTO April 6th) from comment #8)
> > I want to add that this happens to Alexandre Lissy every day when he travels
> > by train, and tis happens to me at least once every week when I travel with
> > subway.
> 
> Moreover, it's not something that happens with Android because thet keep
> retrying for quite some time before really failing like we do. Hence, our UX
> is very crappy.

Just check the implementation in AOSP, the only difference of handling SMS_FAIL_RETRY is as what I mentioned in comment 10 that AOSP introduce a 2-second timer for next retry instead of sending it immediately. [1] 
I think this could be a better way to improve the sending experience.

[1] http://androidxref.com/5.1.0_r1/xref/frameworks/opt/telephony/src/java/com/android/internal/telephony/SMSDispatcher.java#667
Component: Gaia::SMS → RIL
Given the amount of retry is 3 and the 60sec delay, maybe we can wait 20sec instead of 2sec?
(In reply to Alexandre LISSY :gerard-majax from comment #9)
> (In reply to Julien Wajsberg [:julienw] (PTO April 6th) from comment #8)
> > I want to add that this happens to Alexandre Lissy every day when he travels
> > by train, and tis happens to me at least once every week when I travel with
> > subway.
> 
> Moreover, it's not something that happens with Android because thet keep
> retrying for quite some time before really failing like we do. Hence, our UX
> is very crappy.

After further comparison of the implementation in Android & the 3GPP standards.
I don't think there is too much difference for sending SMS between Android & Gecko.
The 2-sec delay in AOSP doesn't help too much.
It's more about modem's capability when the network coverage is bad.
The reasons are:
1. The SMS_SEND_FAIL_RETRY is mapped from the error code of "332" in +CMS Error according to ril.h in AOSP which means "network timeout" in +CMS Error according to 3GPP 27.005.
2. In 3GPP 24.011, there is a timer TR1M in RP-layer and is expected to be implemented by modem.
   The suggested value of this timer is in range of 35-45 sec.
3. That means when we got SMS_SEND_FAIL_RETRY, the outgoing SMS PDU has been sent to the network but was timeout at least 35 seconds.
4. We retransmit 3 times at most in framework layer in both Android[1]/Gecko[2] when receiving SMS_SEND_FAIL_RETRY from modem.
5. For other error causes, failure shall be notified immediately [1][2].

[1] http://androidxref.com/5.1.0_r1/xref/frameworks/opt/telephony/src/java/com/android/internal/telephony/SMSDispatcher.java#667
[2] https://hg.mozilla.org/mozilla-central/file/dd32e3ff3717/dom/system/gonk/ril_worker.js#l3960
Do you know which error we have when we have no network? (network is enabled but we are in a "no coverage" zone ?

What are the risks of willfully breaking the spec ?
(In reply to Julien Wajsberg [:julienw] (PTO April 6th) from comment #14)
> Do you know which error we have when we have no network? (network is enabled
> but we are in a "no coverage" zone ?
+CMS Error in 3GPP 27.005 defines lots of error cause in SMS-TL/RL.
For no coverage, it could be "331 - no network service."
In AOSP, all the error causes are defined as mentioned in comment 2:
 *  RADIO_NOT_AVAILABLE
 *  SMS_SEND_FAIL_RETRY - means retry (i.e. error cause is 332)
 *  FDN_CHECK_FAILURE
 *  GENERIC_FAILURE - means no retry (i.e. error cause is 500)
For RADIO_NOT_AVAILABLE, after checking in advance, it's not related to "no coverage", it means the modem(radio) is off/resetting instead.
Hence, the only option for framework to retry the SMS is SMS_SEND_FAIL_RETRY.
In addition, for "no coverage", in AOSP, the implementation just reject the send request according to the voice state without sending the request to the modem. [1]

> What are the risks of willfully breaking the spec ?
This could possibly break GCF/PTCRB test cases.
IMO, the retry mechanism is already part of 3GPP standard in 3GPP 24.011 (SMS-RL) and shall be supported by each certified modem and additional retry might break the times of retry restricted in the spec.
That's why I said that the failure of sending SMS in bad network condition is more related to modem & RF capability.

[1] http://androidxref.com/5.1.0_r1/xref/frameworks/opt/telephony/src/java/com/android/internal/telephony/SMSDispatcher.java#1256
Bevis, is it possible that we don't do the first sending if we detect that the voice data is not available? So that we can wait to have coverage for ~2min to do the first sending?

If we can do that we would not break the spec and this would be a better behavior for the user.
(In reply to Julien Wajsberg [:julienw] (PTO April 6th) from comment #16)
> Bevis, is it possible that we don't do the first sending if we detect that
> the voice data is not available? So that we can wait to have coverage for
> ~2min to do the first sending?
> 
> If we can do that we would not break the spec and this would be a better
> behavior for the user.

I'd suggest to have this be decided by modem.
Otherwise, we remove the possibility for the requests to be available at lower protocol layers in this condition.
Seriously, I don't care how you implement this. All I see is that the user experience on the same device (Z3 Compact) is crap on B2G on this point, compared to Android.
(In reply to Alexandre LISSY :gerard-majax from comment #18)
> Seriously, I don't care how you implement this. All I see is that the user
> experience on the same device (Z3 Compact) is crap on B2G on this point,
> compared to Android.

Are you saying that the SMS sending by Z3 Compact with B2G is worse than the same Z3 Compact with Android?

This is strange because there is no too much difference according to the comparison of the implementation in comment 13. Unless Z3 has their own modification different from AOSP.

In addition, is the SMS sending is also that bad with Flame in the same environment?
If no, this problem is more specific to Z3 compact instead.
(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #19)
> (In reply to Alexandre LISSY :gerard-majax from comment #18)
> > Seriously, I don't care how you implement this. All I see is that the user
> > experience on the same device (Z3 Compact) is crap on B2G on this point,
> > compared to Android.
> 
> Are you saying that the SMS sending by Z3 Compact with B2G is worse than the
> same Z3 Compact with Android?
> 
> This is strange because there is no too much difference according to the
> comparison of the implementation in comment 13. Unless Z3 has their own
> modification different from AOSP.
> 
> In addition, is the SMS sending is also that bad with Flame in the same
> environment?
> If no, this problem is more specific to Z3 compact instead.

Not pretty sure if it's possible to have logcat with RIL debugging enabled for further analysis.
Frankly, I'm tired: 20 comments and you still do not want to understand.
(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #19)
> (In reply to Alexandre LISSY :gerard-majax from comment #18)
> > Seriously, I don't care how you implement this. All I see is that the user
> > experience on the same device (Z3 Compact) is crap on B2G on this point,
> > compared to Android.
> 
> Are you saying that the SMS sending by Z3 Compact with B2G is worse than the
> same Z3 Compact with Android?
> 
> This is strange because there is no too much difference according to the
> comparison of the implementation in comment 13. Unless Z3 has their own
> modification different from AOSP.

Even my HTC Desire Z running Gingerbread was not causing any issue in the same conditions ...

> 
> In addition, is the SMS sending is also that bad with Flame in the same
> environment?
> If no, this problem is more specific to Z3 compact instead.

It's not limited to Z3 Compact. I had the same behavior on B2G in general, whatever the device.
And sorry, I don't use my Flame for this so I cannot state how it works.

And *yes* I might send SMS often when I'm transiently out of coverage. But I'm 100% convinced it's the job of the Gecko-level APIs to handle this for us, and only pop up errors in last resort to Gaia.
(In reply to Alexandre LISSY :gerard-majax from comment #22)
> (In reply to Bevis Tseng[:bevistseng][:btseng] from comment #19)
> > (In reply to Alexandre LISSY :gerard-majax from comment #18)
> > > Seriously, I don't care how you implement this. All I see is that the user
> > > experience on the same device (Z3 Compact) is crap on B2G on this point,
> > > compared to Android.
> > 
> > Are you saying that the SMS sending by Z3 Compact with B2G is worse than the
> > same Z3 Compact with Android?
> > 
> > This is strange because there is no too much difference according to the
> > comparison of the implementation in comment 13. Unless Z3 has their own
> > modification different from AOSP.
> 
> Even my HTC Desire Z running Gingerbread was not causing any issue in the
> same conditions ...
I was there when HTC Deire Z is under development so I know that the logic of sending SMS is the same.....
> 
> > 
> > In addition, is the SMS sending is also that bad with Flame in the same
> > environment?
> > If no, this problem is more specific to Z3 compact instead.
> 
> It's not limited to Z3 Compact. I had the same behavior on B2G in general,
> whatever the device.
> And sorry, I don't use my Flame for this so I cannot state how it works.
> 
> And *yes* I might send SMS often when I'm transiently out of coverage. But
> I'm 100% convinced it's the job of the Gecko-level APIs to handle this for
> us, and only pop up errors in last resort to Gaia.

I know it's good to have additional retry logic in gecko if there is no risk.
However, this could breaks the retry rules that have been done by modem as mentioned earlier....
Seems like nothing can be done, we will keep a crappy user experience.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
Bevis, I also have the issue sometimes when sending SMS in the subway
(The subway in Paris lacks coverage in some places but I don't know exactly where).

If everything has been done in Gecko and we can't do more, I guess we should do it in Gaia. But for this we'll need a local DB so that we can save more message states.
Status: RESOLVED → REOPENED
Component: RIL → Gaia::SMS
Resolution: INVALID → ---
Whiteboard: [sms-most-wanted]
(In reply to Julien Wajsberg [:julienw] from comment #25)
> Bevis, I also have the issue sometimes when sending SMS in the subway
> (The subway in Paris lacks coverage in some places but I don't know exactly
> where).
> 
> If everything has been done in Gecko and we can't do more, I guess we should
> do it in Gaia. But for this we'll need a local DB so that we can save more
> message states.

Hi Julien,

IMHO, to improve this sms sending experience, what we have to do firstly is to know why the SMS was failed to sent instead of retry sending SMS anytime when SMS was failed to send.
It will be very helpful to have logcat with RIL related debugging flags enabled for further analysis.

The reason for the logs is to see if there is any additional information available from the log for us to identify the scenario of "lacks coverage" you mentioned when symptom happened.

Then, we could improve it more precisely instead of retry the SMS every time when sms is failed to send.

In addition, the detailed reasons why I didn't suggest to retry SMS automatically in comment 15 is that
1. Usually, the retry has already been done by the error cause of SMS_SEND_FAIL_RETRY, and
2. For GENERIC_FAILURE, according to its documentation which means we should not do any retry automatically.
3. I understand that it's good the align the retry we have done in MMS. 
   For MMS, we can design the retry we want because all the transactions are handled in gecko.
   However, for SMS, the retry condition is actually handled by modem. :(

Hence, to prevent potentially breaking any GCF/PTCRB test cases, 
another solution is that 
we could display a confirmation dialog to the user for next retry with the new system messages of "sms-failed" we have done in bug 1138264 instead of auto-retry and this can be applied for both MMS/SMS.

How do you think? :)
(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #26)
> (In reply to Julien Wajsberg [:julienw] from comment #25)
> > Bevis, I also have the issue sometimes when sending SMS in the subway
> > (The subway in Paris lacks coverage in some places but I don't know exactly
> > where).
> > 
> > If everything has been done in Gecko and we can't do more, I guess we should
> > do it in Gaia. But for this we'll need a local DB so that we can save more
> > message states.
> 
> Hi Julien,
> 
> IMHO, to improve this sms sending experience, what we have to do firstly is
> to know why the SMS was failed to sent instead of retry sending SMS anytime
> when SMS was failed to send.
> It will be very helpful to have logcat with RIL related debugging flags
> enabled for further analysis.
>
This can be enabled in m-c without rebooting the device by toggling "RIL Output in ADB" in "Developer" Menu. (ril_worker related logs won't be enabled after device reboot, because setWorkerDebugFlag() is only invoked when "ril.debugging.enabled" is changed [1], not at initialization stage. We should fire another bug to address this.)

[1] https://dxr.mozilla.org/mozilla-central/source/dom/system/gonk/RadioInterfaceLayer.js#1464
In some other industries, I know they use a "certification-compliant" specific configuration to run the tests. I think that we could put such auto-retry behind such pref, so that the user can enable it.
Here's my feedback from carefully testing this morning. Devices were two Z3 Compact running a KitKat base system (one Android, one current Gecko master), with Free Mobile SIM card.

The protocol was the following: find a moment/place were network coverage is bad and forces off-network for some time. That's the case at some places inside the high speed train along the line I'm commuting with. That's also the case when getting off the subway toward the office.

First run was in the high speed train, with Z3 Compact running Android. At the moment of bad network coverage, sending a SMS.
What we can observe is:
 0. Network level is very very low (no bar)
 1. SMS is marked as Sending
 2. Network drops (red cross icon)
 3. SMS is still marked as Sending
 4. For the next 90+ secs (I have not checked at the precise second but what is important is the timeframe), network is missing because the SIM card is searching either a home network or a roaming network. Right around this place, there is no Free Mobile (208-15) coverage, so the SIM will search for Orange F (208-01) to get back network
 5. Network gets back
 6. Over a few seconds the SMS is marked as "Sent" (removing of the "Sending" label)

The second run was while getting out of the subway, in Paris, with Z3 Compact running B2G master.
What we can observe is:
 0. Network level is low (1-0 bar)
 1. SMS is marked as Sending
 2. Network drops
 3. SMS is still marked as Sending for a couple of seconds (10-20, but much shorter than step 4 in the previous run)
 4. SMS is marked as in Failed sending
 5. Forcing resend.
 6. Repeat steps 3-5 twice
 7. On the third or fourth attempt, network gets back and the SMS is sent.

While in the subway, we are also under Orange F coverage. When getting out, we are loosing the network and the SIM starts searching again for its home or roaming network. The office is on an edge of Free Mobile coverage and so when you're getting out of the subway, you might get better network from Orange F cells, hence forcing the system to attach to it rather than to get back to its home plmn.
Bevis, what do you think? Same device running on Android and B2G, 2 different behaviors...

Do you think the SMS application is defering the initial sending because there is no network?
Flags: needinfo?(btseng)
(In reply to Julien Wajsberg [:julienw] from comment #30)
> Bevis, what do you think? Same device running on Android and B2G, 2
> different behaviors...
> 
> Do you think the SMS application is defering the initial sending because
> there is no network?

No, from Lissy's description, for both devices, the SMS was sent when signal strength is poor(Step 1) before network is dropped (Step 2-4).

However, for the one running B2G, it's still hard to tell if the error of SMS_SEND_FAIL_RETRY was reported from modem or not and how many times it was reported.
We expect to take as long time (more than 90 seconds) as Z3C takes to retry the SMS if SMS_SEND_FAIL_RETRY was reported 3 times and the timer in modem is set to in range of 35-45 secs per suggested in 3GPP 24.011.

Unfortunately, the Z3C was unavailable yet in TPE office.
Otherwise, I could try to get some ril-related logs in a static place where network condition is bad or put the device into multiple metal boxes to see what's actually going on in Z3C. :(

I'll do some test on flame first to see how the SMS retry works with our implementation.

Keep this NI to have more feedback later.
If the current logging of RIL is not enough you know you can provide me a patch ... but given the STR it will take quite some time ...
(In reply to Alexandre LISSY :gerard-majax from comment #32)
> If the current logging of RIL is not enough you know you can provide me a
> patch ... but given the STR it will take quite some time ...

Hi Lissy,

The logging in RIL shall be enough, all you need is to set DEBUG_ALL to true in ril_consts.js.
https://dxr.mozilla.org/mozilla-central/source/dom/system/gonk/ril_consts.js#17

Then, 
For B2G, capture the main log with "adb logcat -v threadtime -b main" all the time with the steps in comment 29.
For the Android, ensure that the radio log is enabled (That means you can see the RIL_REQUEST_XXX/UNSOL_XXX with "adb logcat -v threadtime -b radio") and capture the radio log with the same steps in comment 29. (Hope these RIL messages are enabled in Z3C.)
(It will be more easier if you can reproduce this is a static place with bad network condition instead of the subway.) :p

With these 2 logs, we might be able to see what's the difference between Z3C(Android) and Z3C(B2G) more clearly.

Thanks!
Flags: needinfo?(btseng)
(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #33)
> Then, 
> For B2G, capture the main log with "adb logcat -v threadtime -b main" all
> the time with the steps in comment 29.
> For the Android, ensure that the radio log is enabled (That means you can
> see the RIL_REQUEST_XXX/UNSOL_XXX with "adb logcat -v threadtime -b radio")
> and capture the radio log with the same steps in comment 29. (Hope these RIL
> messages are enabled in Z3C.)

Moreover, I think we should capture both main log & radio log from both Android & B2G implementation 
on Z3C to see what's actually going on in application layer & modem.

> (It will be more easier if you can reproduce this is a static place with bad
> network condition instead of the subway.) :p
> 
> With these 2 logs, we might be able to see what's the difference between
> Z3C(Android) and Z3C(B2G) more clearly.
> 
> Thanks!
That won't be possible: I can probably get the B2G part if I'm lucky enough with LogShake, but I don't have any way to do the same capture on Android.
Alexandre, why can't you try this with the phone plugged in your computer using USB ? As I understand you do this on the train and you use your computer on the train too, and you control when you can send the message. So I don't see why you couldn't fetch the logs like this?
So I could reproduce on B2G: E_GENERIC_FAILURE. I was not able to reproduce on Android, the only moment I could somewhat get close to the same state was when I was already out of network coverage, and in this case the message got explicitely queued.
If E_GENERIC_FAILURE (which means "no more retry" according to the description in ril.h) is the only error cause when sending SMS in bad network condition (not no coverage), then there seem not much we can do in gecko. :(

But, I think we still can improve this by having a handy dialogue prompted to the user to trigger the retry directly without going into message app.
So from attachment 8597198 [details] which expose what happened when I tried to send a SMS from Android just after loosing network, we can see:
 - I'm getting a GENERIC_FAILURE too, I can't tell 100% if it's the same
 - SmsReceiver code is waiting a ServiceState-related intent
 - Once it receives the Intent, it will do a SEND_SMS
 - the SEND_SMS succeeeds

SmsReceiver code can be browsed at: https://android.googlesource.com/platform/packages/apps/Mms/+/master/src/com/android/mms/transaction/SmsReceiverService.java

We can specifically see the handleServiceStateChanged() method which waits for the ServiceState to be online (i.e., network works) to try to dequeue the SMS by calling sendFirstQueuedMessage().

This database is fed by SmsMessageSender code: https://android.googlesource.com/platform/packages/apps/Mms/+/master/src/com/android/mms/transaction/SmsMessageSender.java

Specifically, the sendMessage() method of SmsMessageSender does put everything in the queue via queueMessage() method.

Coming back at SmsReceiverService, we can also see that it listens for the ACTION_SEND_MESSAGE Intent, upon which it will also call the dequeuing method.

We can also check the handleSmsSent() method. Specifically, we can see that it will check if the message failed to be sent because of those errors:
 - ERROR_RADIO_OFF || ERROR_NO_SERVICE: it registers for listeninig to service state changes and be able to resend when networks gets back. The string being used is "R.string.message_queued" which matches my experience earlier,
 - ERROR_FDN_CHECK_FAILURE: it just fails and shows a toaster,
 - in other cases, it marks the message as failed and goes to the next one in queue.
[Tracking Requested - why for this release]:

(In reply to Alexandre LISSY :gerard-majax from comment #40)
> So from attachment 8597198 [details] which expose what happened when I tried
> to send a SMS from Android just after loosing network, we can see:
>  - I'm getting a GENERIC_FAILURE too, I can't tell 100% if it's the same
>  - SmsReceiver code is waiting a ServiceState-related intent
>  - Once it receives the Intent, it will do a SEND_SMS
>  - the SEND_SMS succeeeds
> 
> SmsReceiver code can be browsed at:
> https://android.googlesource.com/platform/packages/apps/Mms/+/master/src/com/
> android/mms/transaction/SmsReceiverService.java
> 
> We can specifically see the handleServiceStateChanged() method which waits
> for the ServiceState to be online (i.e., network works) to try to dequeue
> the SMS by calling sendFirstQueuedMessage().
> 
> This database is fed by SmsMessageSender code:
> https://android.googlesource.com/platform/packages/apps/Mms/+/master/src/com/
> android/mms/transaction/SmsMessageSender.java
> 
> Specifically, the sendMessage() method of SmsMessageSender does put
> everything in the queue via queueMessage() method.
> 
> Coming back at SmsReceiverService, we can also see that it listens for the
> ACTION_SEND_MESSAGE Intent, upon which it will also call the dequeuing
> method.
> 
> We can also check the handleSmsSent() method. Specifically, we can see that
> it will check if the message failed to be sent because of those errors:
>  - ERROR_RADIO_OFF || ERROR_NO_SERVICE: it registers for listeninig to
> service state changes and be able to resend when networks gets back. The
> string being used is "R.string.message_queued" which matches my experience
> earlier,
>  - ERROR_FDN_CHECK_FAILURE: it just fails and shows a toaster,
>  - in other cases, it marks the message as failed and goes to the next one
> in queue.

This is more like to be the case you observed.
But for Android, this is actually handled by the MMS App instead of the framework.

Because from the log, there is only one SEND_SMS request to the modem right after the network is registerred:
04-24 10:55:10.024  1306  1463 D RILJ    : [250405]< VOICE_REGISTRATION_STATE {2, null, null, 0, null, null, null, 0, null, null, null, null, null, 0, null}
04-24 10:55:39.314  1306  1463 D RILJ    : [250421]< VOICE_REGISTRATION_STATE {5, 3806, 004bfce4, 3, null, null, null, 0, null, null, null, null, null, 0, 85}
04-24 10:55:39.514  1306  7229 D RILJ    : [250431]> SEND_SMS
04-24 10:55:39.050  1306  1463 D RILJ    : [250434]< VOICE_REGISTRATION_STATE {5, 3806, 004bfce4, 3, null, null, null, 0, null, null, 0, null, null, 0, 85}
04-24 10:55:39.070  1306  1463 D RILJ    : [250439]< VOICE_REGISTRATION_STATE {5, 3806, 004bfce4, 3, null, null, null, 0, null, null, 0, null, null, 0, 85}
04-24 10:55:39.070  1306  1463 D RILJ    : [250444]< VOICE_REGISTRATION_STATE {5, 3806, 004bfce4, 3, null, null, null, 0, null, null, 0, null, null, 0, 85}
04-24 10:55:40.330  1306  1463 D RILJ    : [250431]< SEND_SMS { mMessageRef = 140, mErrorCode = -1, mAckPdu = null}

For B2G, I think this might be able to be improved in SmsService in gecko to get more information from MobileConnectionService before sending SMS.
Component: Gaia::SMS → RIL
Thanks Lissy & Julien for collecting this information to clarify how the retry was done in AOSP.
This helps a lot to move forward of the sms sending problem in no network condition. :)
Wesley, can you please see if we can get a fix for this asap? It is a hard blocker for the Z3C builds that need to be ready in the next few weeks.
Flags: needinfo?(whuang)
Let me keep my NI for now, before information clarified clearly.
This bug is about modifying current behavior to improve UX, so that seems to be a feature to me.
I would like to know the blocking criteria of Z3C build (Spark) which is being cooked as I know.
Would also like to see what the project timeline looks like.
(In reply to Wesley Huang [:wesley_huang] (EPM) (NI me) from comment #44)
> Let me keep my NI for now, before information clarified clearly.
> This bug is about modifying current behavior to improve UX, so that seems to
> be a feature to me.
> I would like to know the blocking criteria of Z3C build (Spark) which is
> being cooked as I know.
> Would also like to see what the project timeline looks like.

I'd like to take this chance to list what tasks has to be done here in details to improve 
the retry logic in gecko:
1. [ Move current retry logic to SmsService for better implementation of SMS retry ]
   Currently, the SEND_SMS request to modem comes in 2 different way:
   a. Requested from DOM. (Initial request)
   b. internally sent by ril_worker when SMS_SEND_FAIL_RETRY reported by modem in 
      previous SEND_SMS request. (Retrial request)
2. [ Postpone both initial and trial requests in 1) if voice registration state
   is not registered and, then, resume the requests until the registration state 
   is changed to 'registered' ]
   a. Inside SmsService, we need to introduce a new queue per radioInterface to 
      keep all suspended requests when voice registration state is not 'register'.
   b. MobileConnectionService provides APIs to get voice state and listen to the 
      change of voice state.
   c. when voice state changed to 'registered', we shall resume all the suspended
      requests to modem.
3. [ Possible Error handling for postponed sending requests ]
   a. When the suspended sending message is deleted by the user, we shall observe
      the deleted topic from MobileMessageDB to remove the suspended requests to
      prevent extra cost to the user.
   b. When air plane mode is turned off, we shall flush all the suspended request
      and set all delivery status to failed in to DB and report RADIO_DISABLED_ERROR
      to the caller.
   c. With this improvement, it means it could take longer time to get the response
      of a DOMRequest. Hence, we should robust our error handling when reporting a 
      response back to the DOMRequest whose parentWindow was dismissed.
(In reply to Bevis Tseng[:bevistseng][:btseng] from comment #45)
> 3. [ Possible Error handling for postponed sending requests ]
2 more error handling is required:
   d. An timer shall be introduced if voice state was not registered for a long time
      to cancel all these suspended requests and report error to the caller.
   e. We might have to bite a wake lock to ensure the retry is workable
      when device is suspended without USB cable connected. (More testing is needed)
[Blocking Requested - why for this release]:
nominate per comment 43
blocking-b2g: --- → spark?
(In reply to Candice Serran (:cserran) from comment #43)
> Wesley, can you please see if we can get a fix for this asap? It is a hard
> blocker for the Z3C builds that need to be ready in the next few weeks.

Is there more background to this request? Since we've never really had a scope to work off of I'm curious as to why it's a hard blocker.
Assignee: nobody → sawang
OS: Linux → Gonk (Firefox OS)
Hardware: x86_64 → ARM
Flags: needinfo?(cserran)
agreed not blocking based on comments and feedback. Removing flag. Will contact wesley and bruce separately via email for product backlog addition
blocking-b2g: spark? → ---
Flags: needinfo?(cserran)
tracking-b2g: --- → +
Flags: needinfo?(whuang)
Just so that things are clear for everyone, and given the noise on this bug already: this behavior is highly irritating. As a dogfooder, this is exactly the kind of issue that already pushed/refrained me from using my device as much as I want, and forcing me to use something else.
Keywords: dogfood
These 2 issues are IMO orthogonal to this issue.
No longer blocks: 1161519, 1161521
See Also: → 1161519, 1161521
Nice Samael! Thanks!
Attachment #8604483 - Attachment is obsolete: true
Attachment #8604493 - Attachment description: Part 1: Update retry logic in SmsService and remove the retry in ril_worker → Part 1(v2): Update retry logic in SmsService and remove the retry in ril_worker
Attachment #8604494 - Attachment description: Part 2: Add a maironette test case → Part 2(v2): Add a maironette test case
Attachment #8603982 - Attachment is obsolete: true
Attachment #8603983 - Attachment is obsolete: true
Attachment #8604493 - Flags: review?(btseng)
Attachment #8604494 - Flags: review?(btseng)
Attachment #8605038 - Flags: review?(btseng)
Comment on attachment 8604493 [details] [diff] [review]
Part 1(v2): Update retry logic in SmsService and remove the retry in ril_worker

Review of attachment 8604493 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for improving the SMS retry flow.

It seems better to move the retry logic to a new defined module and 
register listener to mobileconnection when needed to reduce annoying notification 
that we are not interested in.

Please see my comments below for more detailed explanation.

::: dom/mobilemessage/gonk/SmsService.js
@@ +131,5 @@
> +  this._connectionListeners = [];
> +  for (let i = 0; i < gMobileConnectionService.numItems; i++) {
> +    let connListener = new MobileConnectionListener(i, this);
> +    connListener.register();
> +    this._connectionListeners.push(connListener);

Per offlined discussion, please have a standalone module per service id, for example, gSmsRetryHandler with XPCOMUtils.defineLazyGetter() to have Ci.nsIMobileConnectionListener & nsIObserver implemented to
1. register/unregister MobileConnection when needed.
2. have the Deleted/Shutdown logic by itself.
3. Move _pendingMessageRequests[]/_sendPendingMessageRequests/_removePendingMessageRequests into this handler.

@@ +319,5 @@
> +        for (let j = 0; j < _pendingMessageRequests[i].length; j++) {
> +          for (let id of aMessageIds) {
> +            if (_pendingMessageRequests[i][j].message.id === id) {
> +              if (DEBUG) debug("Deleting message with id=" + id);
> +              _pendingMessageRequests[i].splice(j, 1);

This is a leak here for the pending requests.

We should notify failure of the request as what have done at [1] with error cause set to Ci.nsIMobileMessageCallback.NOT_FOUND_ERROR and broadcast system message for non-silent sms.

It would be nice if it's doable to unify the logic of notifying error into single implementation. :)

[1] https://dxr.mozilla.org/mozilla-central/source/dom/mobilemessage/gonk/SmsService.js#293-327

@@ +353,5 @@
> +      this._pendingMessageRequests[aServiceId]
> +        .push({ message: aDomMessage,
> +                silent: aSilent,
> +                options: aOptions,
> +                request: aRequest });

return here to bail out earlier without else {}

@@ +374,5 @@
>      let requestStatusReport = aOptions.requestStatusReport;
>  
> +    // Retry count for GECKO_ERROR_SMS_SEND_FAIL_RETRY
> +    if (!aOptions.retryCount) {
> +      aOptions.retryCount = 0;

I think we can have all required properties initiated at 
https://dxr.mozilla.org/mozilla-central/source/dom/mobilemessage/gonk/SmsService.js#865-873

@@ +1227,5 @@
> +        if (DEBUG) {
> +          debug("Observe " + kSmsDeletedObserverTopic + ": " + JSON.stringify(aSubject));
> +        }
> +        if (aSubject && aSubject.deletedMessageIds)
> +          this._removePendingMessageRequests(aSubject.deletedMessageIds);

Move this into SmsRetryHandler.

@@ +1239,5 @@
>          Services.obs.removeObserver(this, kDiskSpaceWatcherObserverTopic);
> +        Services.obs.removeObserver(this, kSmsDeletedObserverTopic);
> +
> +        for (let connListener of this._connectionListeners) {
> +          connListener.unregister();

Ditto.
Attachment #8604493 - Flags: review?(btseng)
Comment on attachment 8605038 [details] [diff] [review]
Part 3: Add owner window checks on DOM object operations. r=btseng

Review of attachment 8605038 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM. Thank you!
Attachment #8605038 - Flags: review?(btseng) → review+
Comment on attachment 8604494 [details] [diff] [review]
Part 2(v2): Add a maironette test case

Review of attachment 8604494 [details] [diff] [review]:
-----------------------------------------------------------------

There are couple of questions still need to be clarified inlined.

::: dom/mobilemessage/tests/marionette/test_outgoing_unstable_voice_connection.js
@@ +67,5 @@
> +        is(message.body, expectedBody, "message.body: expected '" + expectedBody
> +          + "'' but got '" + message.body + "'");
> +
> +        // resolve when all messages are appeared "sent"
> +        if (eventCount == aMessages.length) {

How can you predict that there won't be 4th sent event?

@@ +79,5 @@
> +    }
> +  });
> +}
> +
> +startTestBase(function testCaseMain() {

Is there any reason why startTestBase() is used instead of startTestCommon()?

@@ +83,5 @@
> +startTestBase(function testCaseMain() {
> +  let verifySentPromise;
> +  return ensureMobileMessage()
> +    .then(() => ensureMobileConnection())
> +    .then(() => pushPrefEnv({ set: [['dom.sms.requestStatusReport', true]] }))

pushPrefEnv() defined in head.js shall be used to construct the promise chain.

@@ +87,5 @@
> +    .then(() => pushPrefEnv({ set: [['dom.sms.requestStatusReport', true]] }))
> +    .then(() => setEmulatorVoiceStateAndWait("unregistered"))
> +    .then(() => sendMessagesAndVerifySending())
> +    // emulate that user deletes the second message
> +    .then((aMessageIds) => manager.delete(aMessageIds[1]))

deleteMessagesById() in head.js shall be used to construct the promise chain.

@@ +93,5 @@
> +    .then(() => {
> +      let messages = MESSAGES.slice();
> +      messages.splice(1, 1);
> +      verifySentPromise = verifyMessagesSent(messages);
> +      return Promise.resolve();

why don't we just have the following as last promise chain?
.then(() => verifyMessagesSent(MESSAGES.slice().splice(1,1));
Attachment #8604494 - Flags: review?(btseng)
Attachment #8604493 - Attachment is obsolete: true
Attachment #8604494 - Attachment is obsolete: true
Attachment #8608031 - Attachment description: Part 1: Update retry logic in SmsService and remove the retry in ril_worker → Part 1(v3): Update retry logic in SmsService and remove the retry in ril_worker
Attachment #8608031 - Flags: review?(btseng)
Attachment #8608032 - Attachment description: Part 2: Add a maironette test case → Part 2(v3): Add a maironette test case
Attachment #8608032 - Flags: review?(btseng)
Comment on attachment 8608031 [details] [diff] [review]
Part 1(v3): Update retry logic in SmsService and remove the retry in ril_worker

Review of attachment 8608031 [details] [diff] [review]:
-----------------------------------------------------------------

Overall, the design looks good to me now.
However, there are several topics to be improved as mentioned inline.

Thanks!

::: dom/mobilemessage/gonk/SmsService.js
@@ +317,5 @@
> +          },
> +          onCancel: () => {
> +            aRequest.notifySendMessageFailed(
> +              Ci.nsIMobileMessageCallback.NOT_FOUND_ERROR, aDomMessage);
> +            Services.obs.notifyObservers(aDomMessage, kSmsFailedObserverTopic, null);

As mentioned in previous patch, we should notify error with the DOMMessage in correct delivery status and also update the delivery state into MobileMessageDB if aSilent is false. [1]

Let's also try to unify the redundant logic in [1] and here.

[1] https://dxr.mozilla.org/mozilla-central/source/dom/mobilemessage/gonk/SmsService.js#293-327

@@ +325,5 @@
> +      return;
> +    }
> +
> +    // Send all pending requests before current request.
> +    gSmsRetryHandlers.getHandlerByServiceId(aServiceId).retry();

I expect this to be triggered in notifyVoiceChanged().

This make the _trySendToTheAir() becomes weird to me. :(

@@ +1349,5 @@
> +          }
> +        }
> +        break;
> +      case NS_XPCOM_SHUTDOWN_OBSERVER_ID:
> +        this._ensureMoboConnObserverUnregistration();

I think we should also flush all the pending requests here as well with error cause to NO_SIGNAL_ERROR when device is shutting down.
I think we can call the onCancel() with specified ERROR cause for all pending requests.

@@ +1356,5 @@
> +        break;
> +    }
> +  },
> +
> +  // Unused nsIMobileConnectionListener methods.

nits: To be more clear, put this all callbacks of nsIMobileConnectionListener together instead.
It looks strange to me to have nsIObserver implementation inserted between these MobileConnectionListener callbacks.
Attachment #8608031 - Flags: review?(btseng)
Comment on attachment 8608032 [details] [diff] [review]
Part 2(v3): Add a maironette test case. r=btseng

Review of attachment 8608032 [details] [diff] [review]:
-----------------------------------------------------------------

Well done! :)
Attachment #8608032 - Flags: review?(btseng) → review+
Attachment #8608601 - Attachment is obsolete: true
Attachment #8608627 - Attachment is obsolete: true
Attachment #8608031 - Attachment is obsolete: true
Attachment #8608633 - Attachment description: Part 1: Update retry logic in SmsService and remove the retry in ril_worker → Part 1(v4): Update retry logic in SmsService and remove the retry in ril_worker
Attachment #8608633 - Flags: review?(btseng)
Comment on attachment 8608633 [details] [diff] [review]
Part 1(v4): Update retry logic in SmsService and remove the retry in ril_worker

Review of attachment 8608633 [details] [diff] [review]:
-----------------------------------------------------------------

Some SMS Marionette tests failed on try server. Checking the reason.
Attachment #8608633 - Flags: review?(btseng)
Comment on attachment 8608633 [details] [diff] [review]
Part 1(v4): Update retry logic in SmsService and remove the retry in ril_worker

Review of attachment 8608633 [details] [diff] [review]:
-----------------------------------------------------------------

We are close to the target.
Please see my comment below the have all error handling unified in one place and 
notify failure with the DOMMessage that contains updated delivery status.

Thanks!

::: dom/mobilemessage/gonk/SmsService.js
@@ +290,5 @@
>    },
>  
> +  _notifySendingError: function(aErrorCode, aSendingMessage, aSilent, aRequest) {
> +    if (aSilent) {
> +      aRequest.notifySendMessageFailed(aErrorCode, aSendingMessage);

As mentioned in previous review, since aSendingMessage is read-only, we should recreate a new DOMMessage with failure delivery status for the notification.
https://dxr.mozilla.org/mozilla-central/source/dom/mobilemessage/gonk/SmsService.js#299-311

@@ +296,5 @@
> +    }
> +
> +    if (aErrorCode === Ci.nsIMobileMessageCallback.NOT_FOUND_ERROR) {
> +      aRequest.notifySendMessageFailed(aErrorCode, aSendingMessage);
> +      Services.obs.notifyObservers(aSendingMessage, kSmsFailedObserverTopic, null);

Ditto.

@@ +313,5 @@
> +        Ci.nsISmsMessenger.NOTIFICATION_TYPE_SENT_FAILED, aDomMessage);
> +      aRequest.notifySendMessageFailed(aErrorCode, aDomMessage);
> +      Services.obs.notifyObservers(aDomMessage, kSmsFailedObserverTopic, null);
> +    });
> +  },

As mentioned in previous review, the the error handling at |_sendToTheAir| shall also be refactored with this newly created method of _notifySendingError:
https://dxr.mozilla.org/mozilla-central/source/dom/mobilemessage/gonk/SmsService.js#293-328
Attachment #8609257 - Attachment is obsolete: true
Attachment #8610041 - Attachment description: Part 1: Update retry logic in SmsService and remove the retry in ril_worker → Part 1(v5): Update retry logic in SmsService and remove the retry in ril_worker
Attachment #8610041 - Flags: review?(btseng)
Attachment #8608633 - Attachment is obsolete: true
Comment on attachment 8610041 [details] [diff] [review]
Part 1(v5): Update retry logic in SmsService and remove the retry in ril_worker

Review of attachment 8610041 [details] [diff] [review]:
-----------------------------------------------------------------

r=me after the comment inlined addressed.

Thanks!

::: dom/mobilemessage/gonk/SmsService.js
@@ +327,5 @@
> +                                               0,
> +                                               aSendingMessage.read));
> +      Services.obs.notifyObservers(aSendingMessage, kSmsFailedObserverTopic, null);
> +      return;
> +    }

After thinking this twice, sorry to say that I'd like to change my mind to reduce the redundant code as followed.

How do you think?

    if (aSilent ||
      aErrorCode === Ci.nsIMobileMessageCallback.NOT_FOUND_ERROR) {
      // There is no way to modify nsIDOMMozSmsMessage attributes as they
      // are read only so we just create a new sms instance to send along
      // with the notification.
      let domMessage = 
        gMobileMessageService.createSmsMessage(aSendingMessage.id,
                                               aSendingMessage.threadId,
                                               aSendingMessage.iccId,
                                               DOM_MOBILE_MESSAGE_DELIVERY_ERROR,
                                               RIL.GECKO_SMS_DELIVERY_STATUS_ERROR,
                                               aSendingMessage.sender,
                                               aSendingMessage.receiver,
                                               aSendingMessage.body,
                                               aSendingMessage.messageClass,
                                               aSendingMessage.timestamp,
                                               0,
                                               0,
                                               aSendingMessage.read);

      aRequest.notifySendMessageFailed(aErrorCode, domMessage);

      if (!aSilent) {
        Services.obs.notifyObservers(aSendingMessage, kSmsFailedObserverTopic, null);
      }

      return;
    }

@@ +1208,5 @@
>    return tuples;
>  };
>  
> +function SmsSendingSchedular(aServiceId) {
> +  this._servicdeId = aServiceId;

s/_servicdeId/_serviceId

@@ +1246,5 @@
> +      }
> +    }
> +  },
> +
> +  _queue: [],

Sorry, I think I missed this one in previous review.

This shall be initiated at Constructor instead.
Otherwise, in multi-sim scenario, Multiple instances of SmsSendingSchedular shares the same queue when listening to the voice state from specified serviceId.
Attachment #8610041 - Flags: review?(btseng) → review+
Attachment #8610074 - Attachment description: Part 1: Update retry logic in SmsService and remove the retry in ril_worker → Part 1(v6): Update retry logic in SmsService and remove the retry in ril_worker. r=btseng
Attachment #8610074 - Flags: review+
Attachment #8610041 - Attachment is obsolete: true
Attachment #8608032 - Attachment description: Part 2(v3): Add a maironette test case → Part 2(v3): Add a maironette test case. r=btseng
Attachment #8605038 - Attachment description: Part 3: Add owner window checks on DOM object operations → Part 3: Add owner window checks on DOM object operations. r=btseng
This is working very well, I'm hitting the bad behavior much much much less often now. There may be some need for a little bit tunning because there are still cases where I get into the error message flow while we should be able to avoid it. But so far, this is already very coool :)
Keywords: dogfoodfoxfood
Depends on: 1221199
You need to log in before you can comment on or make changes to this bug.