Closed Bug 947928 Opened 10 years ago Closed 10 years ago

[RTSP] System crash when opening RTSP for the second time (NAT problem)

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:1.3+, firefox27 wontfix, firefox28 fixed, firefox29 fixed, b2g-v1.3 fixed)

RESOLVED FIXED
1.3 C2/1.4 S2(17jan)
blocking-b2g 1.3+
Tracking Status
firefox27 --- wontfix
firefox28 --- fixed
firefox29 --- fixed
b2g-v1.3 --- fixed

People

(Reporter: ethan, Assigned: ethan)

References

Details

(Whiteboard: [FT:RIL])

Attachments

(10 files, 4 obsolete files)

Attached file rtsp_error.log
*** Reproduce Steps ***
1. Open a simple app using "rtsp" as video source. Such as:
   <video id=home_video controls preload=metadata autoplay=true width="100%">
     <source src="rtsp://184.72.239.149/vod/mp4:BigBuckBunny_115k.mov"/>
   </video>
2. Click the play button in the control panel. The video was played normally.
3. Close the app.
4. Open the app and click the play button again.
5. After a few seconds, the whole system crashed.

*** Note ***
1. This bug can be stably reproduced in certain network environment but not in others.
2. In the logcat, the PIDs of the simple app are: 4022 (normal) and 4085 (crashed).
When the app was opened for the second time, by touching the screen we can see the duration was already calculated and displayed, which means the metadata was loaded successfully.
*** GDB Backtrace Logs ***
Gdb attached to the b2g process.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 109.649]
__libc_android_abort () at bionic/libc/unistd/abort.c:82
82	    *((char*)0xdeadbaad) = 39;
(gdb) bt
#0  __libc_android_abort () at bionic/libc/unistd/abort.c:82
#1  0x4000944c in __android_log_assert (cond=<value optimized out>, 
    tag=0x41e3909c "AMPEG4ElementaryAssembler", fmt=0x41de1a54 "%s")
    at system/core/liblog/logd_write.c:246
#2  0x40e6b914 in android::AMPEG4ElementaryAssembler::addPacket (this=0x475ffc18, 
    source=<value optimized out>)
    at /home/ethan/workspace/mozilla/mozilla-central/netwerk/protocol/rtsp/rtsp/AMPEG4ElementaryAssembler.cpp:239
#3  0x40e6bcc4 in android::AMPEG4ElementaryAssembler::assembleMore (this=0xdeadbaad, 
    source=...)
    at /home/ethan/workspace/mozilla/mozilla-central/netwerk/protocol/rtsp/rtsp/AMPEG4ElementaryAssembler.cpp:384
#4  0x40e6d672 in android::ARTPAssembler::onPacketReceived (this=0x4972e6d0, source=...)
    at /home/ethan/workspace/mozilla/mozilla-central/netwerk/protocol/rtsp/rtsp/ARTPAssembler.cpp:41
#5  0x40e6ee24 in android::ARTPSource::processRTPPacket (this=0x47368f60, 
    buffer=<value optimized out>)
    at /home/ethan/workspace/mozilla/mozilla-central/netwerk/protocol/rtsp/rtsp/ARTPSource.cpp:86
#6  0x40e6dcea in android::ARTPConnection::parseRTP (this=<value optimized out>, 
    s=<value optimized out>, buffer=...)
    at /home/ethan/workspace/mozilla/mozilla-central/netwerk/protocol/rtsp/rtsp/ARTPConnection.cpp:491
#7  0x40e6ded4 in android::ARTPConnection::receive (this=0x446d0b20, s=0x43951c40, 
    receiveRTP=true)
    at /home/ethan/workspace/mozilla/mozilla-central/netwerk/protocol/rtsp/rtsp/ARTPConnection.cpp:402
#8  0x40e6e130 in android::ARTPConnection::onPollStreams (this=0x446d0b20)
    at /home/ethan/workspace/mozilla/mozilla-central/netwerk/protocol/rtsp/rtsp/ARTPConnection.cpp:287
#9  0x40e6e636 in android::ARTPConnection::onMessageReceived (this=0x446d0b20, msg=...)
    at /home/ethan/workspace/mozilla/mozilla-central/netwerk/protocol/rtsp/rtsp/ARTPConnection.cpp:161
#10 0x408982bc in android::ALooperRoster::deliverMessage (this=0x4089b678, msg=...)
    at frameworks/base/media/libstagefright/foundation/ALooperRoster.cpp:133
#11 0x40897e26 in android::ALooper::loop (this=<value optimized out>)
    at frameworks/base/media/libstagefright/foundation/ALooper.cpp:212
#12 0x40897f9c in android::ALooper::LooperThread::threadLoop (this=<value optimized out>)
---Type <return> to continue, or q <return> to quit---
    at frameworks/base/media/libstagefright/foundation/ALooper.cpp:47
#13 0x40156e58 in android::Thread::_threadLoop (user=<value optimized out>)
    at frameworks/base/libs/utils/Threads.cpp:834
#14 0x4015749e in thread_data_t::trampoline (t=<value optimized out>)
    at frameworks/base/libs/utils/Threads.cpp:127
#15 0x40091e18 in __thread_entry (
    func=0x40157409 <thread_data_t::trampoline(thread_data_t const*)>, arg=0x461f3890, 
    tls=<value optimized out>) at bionic/libc/bionic/pthread.c:217
#16 0x4009196c in pthread_create (thread_out=<value optimized out>, attr=0xbefc4d1c, 
    start_routine=0x40157409 <thread_data_t::trampoline(thread_data_t const*)>, 
    arg=0x461f3890) at bionic/libc/bionic/pthread.c:357
#17 0x00000000 in ?? ()
#2  0x40e6b914 in android::AMPEG4ElementaryAssembler::addPacket (this=0x475ffc18, 
    source=<value optimized out>)
    at /home/ethan/workspace/mozilla/mozilla-central/netwerk/protocol/rtsp/rtsp/AMPEG4ElementaryAssembler.cpp:239

CHECK_GE(buffer->size(), 2u);

It looks like the buffer content size is less than 2.

Hi Ethan:
Please help to dump more detail information in |ARTPSource::queuePacket|.
In the function, we can dump the buffer->size() and the rtp sequence number. By using rtp sequence number, we can track the packet captured by wireshark.
Can you also put some debug message in line 490....
http://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/rtsp/rtsp/ARTPConnection.cpp#490


uint32_t seqNum = (uint32_t)buffer->int32Data();
size_t size = buffer->size();
dump("buffer size = %d seqNum = %d range = %d", size, seqNum, size - payloadOffset);

Also here, 
http://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/rtsp/rtsp/ARTPConnection.cpp#391

dump("errno = %s receive bytes = %d", strerror(errno), nbytes);
Assignee: nobody → ettseng
Attached file rtsp.pcap
This pcap file was captured while reproducing this bug.
Attached file rtsp_gdb.log
Gdb backtrace log. We printed the content of buffer and it looks fine.
Attached file rtsp_gdb.log.2
Another gdb backtrace log. Note that the function call stacks while segmentation fault are different from the previous log.
The only difference of reproducing bug is, this time we ran tcpdump at the same time.
Attached file rtsp_amx_music.pcap
This pcap was captured by slightly changing the reproduction steps:
1. Connect to another streaming source (music service for AMX).
   <a href="rtsp://m.streaming.imusica.com.br/vod/mp4:takemeout?=3529931==web.local=1262=0">
2. Open the media through the video app (the b2g system still crashed).
Can we set this to 1.3+, as it looks quite important?
Flags: needinfo?(ettseng)
Whiteboard: [FT:RIL]
Yes I think so. 
We just identified the concrete scenario, which will be described in detail in the comment below.
Although this bug would not happen in the normal case, however, when it happens, the result (crash) is undesirable. I agree to set it as 1.3+.
Flags: needinfo?(ettseng)
A very concrete scenario to reproduce this bug is to drop RTP packets, such as using iptables command:
(Suppose our RTSP client uses port 15550-15553 as RTP listen ports.)
iptables -I INPUT -i wlan0 -p udp --dport 15550 -j DROP
iptables -I INPUT -i wlan0 -p udp --dport 15551 -j DROP
iptables -I INPUT -i wlan0 -p udp --dport 15552 -j DROP
iptables -I INPUT -i wlan0 -p udp --dport 15553 -j DROP

When these firewall rules are set up, the behavior of RTSP app would be like this:
1) RTSP session establishment 
   C->S: Describe, S->C: Replay 200 OK
   C->S: Setup, S->: Reply 200 OK
2) RTSP play request
   C->S: Play, S->C: Replay 200 OK
   Meanwhile, RTSP app starts a timer to count down 10 seconds to determine RTP availability.
3) Since RTP packets are dropped, the timer timed out and the app would try to "reconnect".
   This "reconnect" would teardown the previous RTSP session and describe/setup a new one.
   However, at this moment an implementation bug would cause the entire parent process crash (segmentation fault).

In summary, when RTSP session is established and play is requested, if there is no RTP packet received, the app would definitely crash 10 seconds after play request was sent.

We got a quick solution to fix the reconnect bug. 
Unfortunately, even if this is fixed and reconnect succeeded, if somehow RTP packet is coming now, the app would still crash while parsing the RTP packet. This would happen under network environment. We suspect there is some factors related the network devices (Wirelsss AP, Routers, etc.)
As described above, we have to follow up with more tests, observations and analysis.
AFAIK, we should clarify at least the following issues:
1. Why RTP packets would not be received by app when we open the app for 2nd time?
2. After we fixed the timeout-reconnect bug, why would the app still crash while parsing RTP packet?

After we clarified these issue, another question would be:
Should we replace the assertions "CHECK()" by MOZ_ASSERT(), which will be turned off in release build?
blocking-b2g: --- → 1.3+
Attached file rtsp_crash_at_rtp.pcap
Packets captured when b2g system crashed by parsing RTP packet.
A patch to print debugging messages.
** Crash Point **
The crash point is here:
http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/rtsp/rtsp/AMPEG4ElementaryAssembler.cpp#333
In AMPEG4ElementaryAssembler::addPacket()
 CHECK_LE(offset + header.mSize, buffer->size());

(offset + header.mSize) is larger than buffer->size() which is the packet payload size, and the assertion CHECK_LE() causes the abort/crash.
The reason why we get here is because the ARTPSource constructor creates a wrong assembler. For video track, it should be AAVCAssembler, not AMPEG4ElementaryAssembler.

** Incorrect RTP Port Number **
We are not 100% sure why we create a wrong assembler. However, the captured packets provide some clues about this issue.
When the crash happens, we found something goes wrong with the "client_port" parameter in RTSP Setup/Response.
Note in the following packet snippets, RTSP client (our app) specified port 15552/15553 as RTP/RTCP ports. But RTSP server replied port 15550/15551.

1) RTSP session establishment 
   C->S: Describe, S->C: Reply 200 OK
   C->S: Setup for trackID 1 (audio)
   S->C: Reply 200 OK
   C->S: Setup for trackID 2 (video)
           192.168.1.115  184.72.239.149  RTSP  291  SETUP rtsp://184.72.239.149/vod/mp4:BigBuckBunny_115k.mov/trackID=2 RTSP/1.0
           Transport: RTP/AVP/UDP;unicast;client_port=15552-15553
   S->C: Reply 200 OK
           184.72.239.149  192.168.1.115  RTSP  389  Reply: RTSP/1.0 200 OK
           Transport: RTP/AVP/UDP;unicast;client_port=15550-15551;source=184.72.239.149;server_port=7194-7195;ssrc=68A7E392
2) RTSP play request
   C->S: Play, S->C: Reply 200 OK
3) RTP packet coming
   S->C: RTP (video)
        184.72.239.149  192.168.1.115  RTP  836  PT=DynamicRTP-Type-97, SSRC=0x68A7E392, Seq=1, Time=0, Mark
	User Datagram Protocol, Src Port: 7194 (7194), Dst Port: 15550 (15550)
	Payload type: DynamicRTP-Type-97 (97)
	Synchronization Source identifier: 0x68a7e392 (1755833234)
Attached file Rtsp ok for Galaxy S2
The captured packet for Samsung Galaxy S2 in the same nat network environment.
Can you add "srand((unsigned int)time(NULL));" before the line we use rand() to get the start port ?
http://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/rtsp/rtsp/ARTPConnection.cpp#118

I would like to make sure that two RTSP sessions use different client_ports.
Summary: [RTSP] System crash when opening rtsp media element → [RTSP] System crash when opening RTSP for the second time (NAT problem)
Attached patch bug-947928-fix.patch (obsolete) — Splinter Review
This patch is just a trial.
Originally the RTP/RTCP client ports would stick on 15550-15553. We make these client ports as random integers between 15550 to 20000.
(In reply to Vincent Chang[:vchang] from comment #17)
> Can you add "srand((unsigned int)time(NULL));" before the line we use rand()
> to get the start port ?
> http://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/rtsp/rtsp/
> ARTPConnection.cpp#118
> I would like to make sure that two RTSP sessions use different client_ports.

Vincent, I am almost sure this is a NAT traversal problem (that's why I edited the title).

First, in conclusion. Yes, using random client ports could resolve this bug!
Once we use random ports, the scenario of this bug never happen again.
Secondly, I think the issue arise with the implementation of NAT traversal. It seems the NAT implementation determines the 2nd-time "poke a hole" packets as invalid and replies those ICMP port unreachable messages itself.

Actually I think it's a bug in the original code. As you know, the intention of the code is to use random ports.
unsigned start = (rand() * 1000)/ RAND_MAX + 15550;
However, this line would not generate random results since (rand() * 1000) may overflow int type.

I found Android 4.4 already fixed this bug. We should just fix that. :)
BTW, we don't need to call srand() ourselves. It seems it was already done somewhere in our system.
Attached patch bug-947928-fix.patch (obsolete) — Splinter Review
Fix the bug of generating random client ports.
Attachment #8351226 - Attachment is obsolete: true
Attached patch bug-947928-fix.patch (obsolete) — Splinter Review
Attachment #8351308 - Attachment is obsolete: true
Comment on attachment 8351309 [details] [diff] [review]
bug-947928-fix.patch

Vincent, could you help to review this patch?
Attachment #8351309 - Flags: review?(vchang)
Comment on attachment 8351309 [details] [diff] [review]
bug-947928-fix.patch

Review of attachment 8351309 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/protocol/rtsp/rtsp/ARTPConnection.cpp
@@ +116,5 @@
>  
>      bumpSocketBufferSize(*rtcpSocket);
>  
> +    /* rand() * 1000 may overflow int type, use long long */
> +    unsigned start = (unsigned)((rand() * 1000ll) / RAND_MAX) + 15550;

Rather than call srand(seed) somewhere in the code, I would prefer to do it here.
So that we can reduce the chance to use duplicate port numbers....
Attachment #8351309 - Flags: review?(vchang)
Attached patch bug-947928-fix.patch (obsolete) — Splinter Review
Add srand() to reduce the chance of using duplicate port numbers.
Attachment #8351309 - Attachment is obsolete: true
Attachment #8351317 - Flags: review?(vchang)
Comment on attachment 8351317 [details] [diff] [review]
bug-947928-fix.patch

Review of attachment 8351317 [details] [diff] [review]:
-----------------------------------------------------------------

Hi sworkman, we found this NAT problem under the firewall gateway environment.
Not sure if you have time to review the patch.
Attachment #8351317 - Flags: review?(vchang)
Attachment #8351317 - Flags: review?(sworkman)
Attachment #8351317 - Flags: review+
Comment on attachment 8351317 [details] [diff] [review]
bug-947928-fix.patch

Review of attachment 8351317 [details] [diff] [review]:
-----------------------------------------------------------------

Oh this looked like a tricky one. Good job on the analysis! r=me.
Attachment #8351317 - Flags: review?(sworkman) → review+
Update the comment.
Attachment #8351317 - Attachment is obsolete: true
Attachment #8355800 - Flags: review+
(In reply to Steve Workman [:sworkman] from comment #27)
> Oh this looked like a tricky one. Good job on the analysis! r=me.
Steve, thanks for your review. :)
Keywords: checkin-needed
You need to log in before you can comment on or make changes to this bug.