Closed Bug 1049291 Opened 10 years ago Closed 10 years ago

b2g process crashes while calling select()

Categories

(Core :: WebRTC, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
mozilla34
blocking-b2g 2.0+
Tracking Status
firefox32 --- wontfix
firefox33 --- wontfix
firefox34 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: tkundu, Assigned: ekr)

References

Details

(Keywords: crash, Whiteboard: [caf-crash 250][caf priority: p2][b2g-crash] [CR 705517])

Crash Data

Attachments

(3 files)

Attached file stacktrace.txt
b2g process uses more than 1024 fds. We should not use select syscall inside gecko code which is called by b2g process. Reason is that select syscall has limitation of 1024 fd and it will crash during stability test .

We are seeing webrtc is using select() and b2g is crashing
[Blocking Requested - why for this release]:
blocking-b2g: --- → 2.0?
Summary: Replace select syscall with poll() → b2g process crashes while calling select()
Randall, this crash seems WebRTC-related; can you take a look?
blocking-b2g: 2.0? → 2.0+
Component: Stability → WebRTC
Flags: needinfo?(rjesup)
Keywords: crash
Product: Firefox OS → Core
Can you describe the STR for this?

Is this is Mochitests? I wonder if we're repeatedly spinning up instances of SIPCC.
Whiteboard: [b2g-crash]
Flags: needinfo?(rjesup)
The issue here is simply the existence of the select() call.  It needs to be rewritten to poll() instead.  Using select() with more than 1024 open fds can cause memory corruption and hard-to-diagnose stability issues.
(In reply to Michael Vines [:m1] [:evilmachines] from comment #4)
> The issue here is simply the existence of the select() call.  It needs to be
> rewritten to poll() instead. Using select() with more than 1024 open fds
> can cause memory corruption and hard-to-diagnose stability issues.

Can you elaborate on this? select() shouldn't be causing memory
corruption with any arguments that point to valid memory regions.
FD_SET doesn't do any bounds checking.

If you pass in a file descriptor whose integer value is >= 1024, then it will corrupt memory.
Whiteboard: [b2g-crash] → [b2g-crash] [CR 705517]
Ekr's going to take this.  We realize we need a very fast turn around (less than 1 day), but this time frame for a fix is asking a lot.
Assignee: nobody → ekr
Upon deeper investigation (thanks :jesup), there is no bug here.  The select() in question passes in no fds so there's no risk of memory corruption by an FD_SET.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INVALID
Hmm...

(gdb) l
427	           struct cpr_timeval * RESTRICT timeout)
428	{
429	    int16_t rc;
430	    struct timeval t, *t_p;
431	
432	    if (timeout != NULL) {
433	        t.tv_sec  = timeout->tv_sec;
434	        t.tv_usec = timeout->tv_usec;
435	        t_p       = &t;
436	    } else {
(gdb) n
437	        t_p       = NULL;
(gdb) n
440	    rc = (int16_t) select(nfds, read_fds, write_fds, except_fds, t_p);
(gdb) p read_fds
$5 = (fd_set *) 0x117c15e78
(gdb) p *write_fds
$6 = {
  fds_bits = {0 <repeats 32 times>}
}
(gdb) p *read_fds
$7 = {
  fds_bits = {0, 32, 0 <repeats 30 times>}
}
(gdb) p nfds
$8 = 38
(gdb)
(In reply to Michael Vines [:m1] [:evilmachines] from comment #8)
> Upon deeper investigation (thanks :jesup), there is no bug here.  The
> select() in question passes in no fds so there's no risk of memory
> corruption by an FD_SET.

That was one specific one in DataChannels; the one in cprSelect() does pass fd's, and that was the one that triggered the stacktrace on this bug
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
The WIP patch above removes the call to cprSelect() and the FD_* macros in sip_platform_task and replaces it with a single blocking read to the IPC socket. This should be OK since I believe the other sockets are used for SIP stuff and we are not doing SIP.

The other known use of select is in cpr_*_timers_using_select.c but that is in:

  start_timer_service_loop()

called from:

  timerThread()

called from:

  cpr_timer_pre_init()

called from:

  cprPreInit()
http://dxr.mozilla.org/mozilla-central/source/media/webrtc/signaling/src/sipcc/cpr/linux/cpr_linux_init.c#183

but only if CPR_TIMERS_ENABLED is set, which it is not:

http://dxr.mozilla.org/mozilla-central/source/media/webrtc/signaling/signaling.gyp?from=signaling.gyp&case=true#671


This is all pretty quick and dirty so it needs a bunch of testing and careful review to make sure
I didn't miss anything.
Comment on attachment 8468942 [details] [diff] [review]
Remove uses of select() in SIPCC SIP task

Review of attachment 8468942 [details] [diff] [review]:
-----------------------------------------------------------------

ehugg, can you please review. This works in pc_test.html and signaling_unittests pass.
Attachment #8468942 - Flags: review?(ethanhugg)
Attachment #8468942 - Flags: feedback?(tkundu)
Comment on attachment 8468942 [details] [diff] [review]
Remove uses of select() in SIPCC SIP task

Review of attachment 8468942 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/signaling/src/sipcc/core/sipstack/sip_platform_task.c
@@ +409,5 @@
>       * - Forever-loop exits in sip_process_int_msg()::THREAD_UNLOAD
>       */
>      while (TRUE) {
> +      sip_process_int_msg();
> +      continue;

I guess we don't actually need this continue. I wrote it before I decided to #if 0 out everything.
Comment on attachment 8468942 [details] [diff] [review]
Remove uses of select() in SIPCC SIP task

Review of attachment 8468942 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me. Built and ran unittests and a few ad-hoc tests.   I found instances of FD_SET and FD_CLR in ccsip_platform_tcp.c but I'm pretty sure that code is never called since we don't make our own tcp connections from this code.
Attachment #8468942 - Flags: review?(ethanhugg) → review+
Attachment #8468942 - Flags: feedback?(tkundu) → feedback+
Whiteboard: [b2g-crash] [CR 705517] → [caf priority: p2][b2g-crash] [CR 705517]
Whiteboard: [caf priority: p2][b2g-crash] [CR 705517] → [caf-crash 250][caf priority: p2][b2g-crash] [CR 705517]
https://hg.mozilla.org/mozilla-central/rev/7ab18eda31e4
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Observed on: 

Device: msm8610
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.054
Moz BuildID: 20140804000204
B2G Version: 2.0
Gecko Version: 32.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=9e5907995c9327f14cb5d182cee5ff16b1743ed4
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=ed87f0a54baf646eb0b20b4debc090c8016d2104
(^ that was without the patch here)
Flags: in-moztrap?(bzumwalt)
No STR is present to create test case to address bug.
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Flags: in-moztrap?(bzumwalt)
Flags: in-moztrap-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: