Closed Bug 924729 Opened 11 years ago Closed 11 years ago

[Flatfish]:Socket bind() fail in UeventPoller initialization

Categories

(Firefox OS Graveyard :: Gaia::System, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:1.3+)

RESOLVED FIXED
1.3 Sprint 6 - 12/6
blocking-b2g 1.3+

People

(Reporter: vliu, Assigned: vliu)

References

Details

(Whiteboard: [Flatfish only][developer+])

Attachments

(1 file, 8 obsolete files)

This bug is created according to the below link.

https://bugzilla.mozilla.org/show_bug.cgi?id=911117#c12

Here to collect some information.

UeventPoller failed to initialize itself because bind() returns -EADDRINUSE.  UeventPoller uses netlink socket to read uevents(same as Android) without the need of filling out address or port.  A possible EADDRINUSE error in netlink socket bind() is bind 2 or more netlink sockets with the same pid.  In the manpage of netlink(7), it says the following:

There are two ways to assign  nl_pid to a netlink socket. If the application sets nl_pid before calling bind(2), then it is up to the application to make sure that nl_pid is unique.  If the application sets it to 0, the kernel takes care of assigning it.  The kernel assigns the process ID to the first netlink socket the process opens and assigns a unique nl_pid to every netlink socket that the process  subsequently creates.
Blocks: 911117, flatfish
blocking-b2g: --- → koi?
Set |saddr.nl_pid = getpid()| or |saddr.nl_pid = pthread_self() << 16 | getpid()| both have risk if more than 2 socket connection in a process or in a thread. From netlink(7), set saddr.nl_pid to 0 and let kernel takes care of assigning it seems more safety. 

Hi Cervantes, do you have any comment for that?
Flags: needinfo?(cyu)
Letting the kernel set the address for us automatically sounds better. The problem is why would there be multiple threads using netlink within the b2g process?
Flags: needinfo?(cyu)
hi Cervantes,

What would the next steps be here? What is the user experience here if left without a fix?
Flags: needinfo?(cyu)
(In reply to Cervantes Yu from comment #2)
> Letting the kernel set the address for us automatically sounds better. The
> problem is why would there be multiple threads using netlink within the b2g
> process?

This is also the question in my mind. Do you have a better way to look into it? Actually I didn't see this problem in another device but only Flatfish.
(In reply to Preeti Raghunath(:Preeti) from comment #3)
> hi Cervantes,
> 
> What would the next steps be here? What is the user experience here if left
> without a fix?

I don't think there would be impact to user experience because this is only on flatfish, which is still not released. Other devices are not affected by this bug.

I think we need to investigate the problem using gdb to find out why we failed in bind()ing the netlink socket.
Flags: needinfo?(cyu)
Steeven can you help make a call on this ?

This looks like flatfish only ?
Flags: needinfo?(styang)
Whiteboard: flats
this is a function failed issue, according to triage result, it should be Koi+
Assignee: nobody → vliu
blocking-b2g: koi? → koi+
Flags: needinfo?(styang)
Whiteboard: flats → [Flatfish only]
Blocks: 934881
Whiteboard: [Flatfish only] → [Flatfish only][developer+]
Attached patch bug-924729-fix.patch (obsolete) — Splinter Review
(In reply to Vincent Liu[:vliu] from comment #0)
> This bug is created according to the below link.
> 
> https://bugzilla.mozilla.org/show_bug.cgi?id=911117#c12
> 
> Here to collect some information.
> 
> UeventPoller failed to initialize itself because bind() returns -EADDRINUSE.
> UeventPoller uses netlink socket to read uevents(same as Android) without
> the need of filling out address or port.  A possible EADDRINUSE error in
> netlink socket bind() is bind 2 or more netlink sockets with the same pid. 
> In the manpage of netlink(7), it says the following:
> 
> There are two ways to assign  nl_pid to a netlink socket. If the application
> sets nl_pid before calling bind(2), then it is up to the application to make
> sure that nl_pid is unique.  If the application sets it to 0, the kernel
> takes care of assigning it.  The kernel assigns the process ID to the first
> netlink socket the process opens and assigns a unique nl_pid to every
> netlink socket that the process  subsequently creates.

From the netlink(7), nl_pid should set to 0 and let kernel takes care of assigning it.
Attachment #827892 - Flags: review?(dhylands)
Comment on attachment 827892 [details] [diff] [review]
bug-924729-fix.patch

Review of attachment 827892 [details] [diff] [review]:
-----------------------------------------------------------------

r-'ing because I'd like to see the answers to my questions.

::: hal/gonk/UeventPoller.cpp
@@ +119,4 @@
>  
> +  if (bind(mSocket.get(), (struct sockaddr *)&saddr, sizeof(saddr)) == -1) {
> +    close(mSocket.get());
> +    mSocket.rwget() = -1;

Why did you do this just for the bind error path?
Why not jut return false and let the destructor release the socket?

If you were going to close the socket, you should call mSocket.dispose()
The above code doesn't deal with -EINTR return codes from close.
Attachment #827892 - Flags: review?(dhylands) → review-
Since the above is a ScopedClose, it will automatically call dispose() as soon as you return from the function. So you should just return false; and let ScopedClose do the right thing.

So the only line that should be changed is line 118, which should set saddr.nl_pid = 0;
Setting saddr_nl_pid = 0 is a more flexible, but we still haven't had the answer to why there are multiple threads accessing the netlink socket. If the netlink socket is opened from some vendor blob and this is inevitable, then we should make this change. Otherwise, I am afraid we are working around the problem and hide the real problem behind this patch.
as we confirmed that Flatfish will use v1.3 and we need to respect current testing cycle, set target milestone to 12/6 since 1.3FC tag is not created yet.
blocking-b2g: koi+ → 1.3+
Target Milestone: --- → 1.3 Sprint 6 - 12/6
I think this is caused by pid is used twice. 
So may be we should use pid=0 to bind a socket to prevent this kind of error.

From gdb log we can see that b2g pid 1765 was used by "hwcomposer.flatfish.so" to bind a socket; 
That's why UeventPoller in b2g can't use b2g pid to bind socket.

------------------------------
pid of b2g is 1765
------------------------------
$ adb shell b2g-ps
APPLICATION      USER     PID   PPID  VSIZE  RSS     WCHAN    PC         NAME
b2g              root     1765  1764  69328  19488 ffffffff 41a69004 t /system/b2g/b2g


------------------------------
gdb log
------------------------------
(gdb) print "%d", getpid()
$1 = 1765
(gdb) bt
#0  bind () at bionic/libc/arch-arm/syscalls/bind.S:8
#1  0x42ab2b10 in ?? () from /home/kaizhen/Project/a31-v1.2/out/target/product/flatfish/system/vendor/lib/hw/hwcomposer.flatfish.so
#2  0x42ab2e70 in ?? () from /home/kaizhen/Project/a31-v1.2/out/target/product/flatfish/system/vendor/lib/hw/hwcomposer.flatfish.so
#3  0x400b73dc in __thread_entry (func=0x42ab2e50, arg=0x42ab8084, tls=0x42bb8f00) at bionic/libc/bionic/pthread.c:204
#4  0x400b6ac8 in pthread_create (thread_out=0x42bb8f00, attr=0x400eaf80, start_routine=0x42ab2e50, arg=<optimized out>)
    at bionic/libc/bionic/pthread.c:348
#5  0x00000000 in ?? ()
(gdb) 




(gdb) print "%d", getpid()
$2 = 1765
(gdb) bt
#0  bind () at bionic/libc/arch-arm/syscalls/bind.S:8
#1  0x414257fa in mozilla::hal_impl::NetlinkPoller::OpenSocket (this=<optimized out>) at ../../gecko/hal/gonk/UeventPoller.cpp:120
#2  0x414258a6 in mozilla::hal_impl::UeventInitTask::Run (this=<optimized out>) at ../../gecko/hal/gonk/UeventPoller.cpp:151
#3  0x4160fa9a in MessageLoop::RunTask (this=0x430a5df8, task=0x40418628)
    at ../../../gecko/ipc/chromium/src/base/message_loop.cc:338
#4  0x4161040a in MessageLoop::DeferOrRunPendingTask (this=<optimized out>, pending_task=<optimized out>)
    at ../../../gecko/ipc/chromium/src/base/message_loop.cc:346
#5  0x41611476 in DoWork (this=<optimized out>) at ../../../gecko/ipc/chromium/src/base/message_loop.cc:446
#6  MessageLoop::DoWork (this=0x430a5df8) at ../../../gecko/ipc/chromium/src/base/message_loop.cc:425
#7  0x41612f76 in base::MessagePumpLibevent::Run (this=0x4044bc40, delegate=0x430a5df8)
    at ../../../gecko/ipc/chromium/src/base/message_pump_libevent.cc:311
#8  0x4160fa2e in MessageLoop::RunInternal (this=<optimized out>) at ../../../gecko/ipc/chromium/src/base/message_loop.cc:220
#9  0x4160fada in RunHandler (this=0x430a5df8) at ../../../gecko/ipc/chromium/src/base/message_loop.cc:213
#10 MessageLoop::Run (this=0x430a5df8) at ../../../gecko/ipc/chromium/src/base/message_loop.cc:187
#11 0x416127e2 in base::Thread::ThreadMain (this=0x4044c440) at ../../../gecko/ipc/chromium/src/base/thread.cc:160
#12 0x41613334 in ThreadFunc (closure=<optimized out>) at ../../../gecko/ipc/chromium/src/base/platform_thread_posix.cc:39
#13 0x400b73dc in __thread_entry (func=0x4161332d <ThreadFunc(void*)>, arg=0x4044c440, tls=0x430a5f00)
    at bionic/libc/bionic/pthread.c:204
#14 0x400b6ac8 in pthread_create (thread_out=0x430a5f00, attr=0xbed96928, start_routine=0x4161332d <ThreadFunc(void*)>, 
    arg=<optimized out>) at bionic/libc/bionic/pthread.c:348
#15 0x430a5df8 in ?? ()
Another option would be to use gettid()
(In reply to Kai-Zhen Li from comment #13)
> I think this is caused by pid is used twice. 
> (gdb) print "%d", getpid()
> $1 = 1765
Are you sure this is a netlink socket, *and* its nl_pid is current process id?
I don't see them from the above gdb session. After all you call getpid() from gdb, but there is no sign that hwcomposer.flatfish.so does the same.
(In reply to Cervantes Yu from comment #15)
> Are you sure this is a netlink socket, *and* its nl_pid is current process
> id?
> I don't see them from the above gdb session. After all you call getpid()
> from gdb, but there is no sign that hwcomposer.flatfish.so does the same.

We don't have the source of hwcomposer.flatfish.so, I'll ask the vendor to check what is done in it.
(In reply to Kai-Zhen Li from comment #16)
> (In reply to Cervantes Yu from comment #15)
> > Are you sure this is a netlink socket, *and* its nl_pid is current process
> > id?
> > I don't see them from the above gdb session. After all you call getpid()
> > from gdb, but there is no sign that hwcomposer.flatfish.so does the same.
> 
> We don't have the source of hwcomposer.flatfish.so, I'll ask the vendor to
> check what is done in it.

You don't need to. You can use gdb to verify by inspecting the parameter in calling bind().
(In reply to Cervantes Yu from comment #17)
> 
> You don't need to. You can use gdb to verify by inspecting the parameter in
> calling bind().

When I try to print something in gdb, I can see something like this.
So I think hwcomposer.flatfish.so bound a socket with b2g pid.
16 -> AF_NETLINK
1595 -> pid of b2g

-----
(gdb) print *0x42c56ea0@10
$4 = {16384, 0, 0, 16, 1595, -1, 0, 0, 0, 0}
(gdb) where
#0  bind () at bionic/libc/arch-arm/syscalls/bind.S:8
#1  0x42b52b10 in ?? ()
   from /home/kaizhen/Project/a31-v1.2/out/target/product/flatfish/system/vendor/lib/hw/hwcomposer.flatfish.so
#2  0x42b52e70 in ?? ()
   from /home/kaizhen/Project/a31-v1.2/out/target/product/flatfish/system/vendor/lib/hw/hwcomposer.flatfish.so
#3  0x401573dc in __thread_entry (func=0x42b52e50, arg=0x42b58084, tls=0x42c58f00) at bionic/libc/bionic/pthread.c:204
#4  0x40156ac8 in pthread_create (thread_out=0x42c58f00, attr=0x4018af80, start_routine=0x42b52e50, arg=<optimized out>)
    at bionic/libc/bionic/pthread.c:348
#5  0x00000000 in ?? ()
(gdb) c
I also tried to set break point at bind.S:6 and observed the the value of sa_family. I think it has enough information to point out that hwcomposer.flatfish.so had ever called bind(). If the root cause is confirmed, we should talk with partner to fix it in hwcomposer.flatfish.so.

----
Breakpoint 2, bind () at bionic/libc/arch-arm/syscalls/bind.S:8
8	    ldr     r7, =__NR_bind
(gdb) p *(struct sockaddr *) $r1
$1 = {sa_family = 16, sa_data = "\000\000\312\v\000\000\377\377\377\377\000\000\000"}
(gdb) bt
#0  bind () at bionic/libc/arch-arm/syscalls/bind.S:8
#1  0x42aecb10 in ?? ()
   from /home/vliu-laptop/projs/a31-422/out/target/product/flatfish/system/vendor/lib/hw/hwcomposer.flatfish.so
#2  0x42aece70 in ?? ()
   from /home/vliu-laptop/projs/a31-422/out/target/product/flatfish/system/vendor/lib/hw/hwcomposer.flatfish.so
#3  0x400fe3dc in __thread_entry (func=0x42aece50, arg=0x42af2084, tls=0x42bf2f00)
    at bionic/libc/bionic/pthread.c:204
#4  0x400fdac8 in pthread_create (thread_out=0x42bf2f00, attr=0x40131f80, start_routine=
    0x42aece50, arg=<optimized out>) at bionic/libc/bionic/pthread.c:348
#5  0x00000000 in ?? ()
(gdb) call getpid()
$2 = 3018
---------

(In reply to Dave Hylands [:dhylands] from comment #14)
> Another option would be to use gettid()

I'd ever tried {saddr.nl_pid = pthread_self() << 16 | getpid()} and it also works for me. Please see comment 1 for detail. But even so, I still think set |saddr.nl_pid = 0| is better than others.
I tried to create a netlink socket binding pid before hwcomposer init for checking whether hwcomposer is binding the socket with pid.(please refer to the attach patch 0001-The-test-code-to-verify-whether-hwcomposer-is-using-.patch).
In the experiment, hwcomposer could be initialized successfully even there was a netlink socket binding with b2g pid. This proved that hwcomposer is using nl_pid=0 to bind the socket.
Why it is binding with b2g pid in FxOS? It is because hwcomposer creates the 1st netlink socket in b2g process. According to the man page of netlink(http://manpages.ubuntu.com/manpages/hardy/man7/netlink.7.html):
       If the application sets it to 0, the kernel takes care of assigning it.
       The kernel assigns the process ID  to  the  first  netlink  socket  the
       process  opens and assigns a unique nl_pid to every netlink socket that
       the process subsequently creates.
Since hwcomposer creates the 1st socket in b2g process, it is reasonable the socket binds with b2g pid.
I think there is no change required in hwcomposer about the creation of netlink.
I use this patch to create a netlink socket binding pid before hwcomposer init for checking whether hwcomposer is binding the socket with pid
Hi, Jeff,

It's clear to see hwcomposer.flatfish.so bind the socket with pid from the following log.

---
Breakpoint 1, bind () at bionic/libc/arch-arm/syscalls/bind.S:8
8	    ldr     r7, =__NR_bind
(gdb) p *(struct sockaddr_nl*)$r1
$1 = {nl_family = 16, nl_pad = 0, nl_pid = 1949, nl_groups = 4294967295}
(gdb) where
#0  bind () at bionic/libc/arch-arm/syscalls/bind.S:8
#1  0x42af4b10 in ?? () from /home/kaizhen/Project/a31-v1.2/out/target/product/flatfish/system/vendor/lib/hw/hwcomposer.flatfish.so
#2  0x42af4e70 in ?? () from /home/kaizhen/Project/a31-v1.2/out/target/product/flatfish/system/vendor/lib/hw/hwcomposer.flatfish.so
#3  0x400f83dc in __thread_entry (func=0x42af4e50, arg=0x42afa084, tls=0x42bfaf00) at bionic/libc/bionic/pthread.c:204
#4  0x400f7ac8 in pthread_create (thread_out=0x42bfaf00, attr=0x4012bf80, start_routine=0x42af4e50, arg=<optimized out>)
    at bionic/libc/bionic/pthread.c:348
#5  0x00000000 in ?? ()
(gdb)
(In reply to Jeff Chuang from comment #20)
> I tried to create a netlink socket binding pid before hwcomposer init for
> checking whether hwcomposer is binding the socket with pid.(please refer to
> the attach patch
> 0001-The-test-code-to-verify-whether-hwcomposer-is-using-.patch).
> In the experiment, hwcomposer could be initialized successfully even there
> was a netlink socket binding with b2g pid. This proved that hwcomposer is
> using nl_pid=0 to bind the socket.
> Why it is binding with b2g pid in FxOS? It is because hwcomposer creates the
> 1st netlink socket in b2g process. According to the man page of
> netlink(http://manpages.ubuntu.com/manpages/hardy/man7/netlink.7.html):
>        If the application sets it to 0, the kernel takes care of assigning
> it.
>        The kernel assigns the process ID  to  the  first  netlink  socket 
> the
>        process  opens and assigns a unique nl_pid to every netlink socket
> that
>        the process subsequently creates.
> Since hwcomposer creates the 1st socket in b2g process, it is reasonable the
> socket binds with b2g pid.
> I think there is no change required in hwcomposer about the creation of
> netlink.

I don't think it can prove that hwcomposer is using nl_pid=0 because we don't know what will happen once hwcomposer binds fail. But from gdb info we can know that the bind() with nl_pid = pid_of_b2g is called in hwcomposer.flatfish.so. I think we still need to check more in hwcomposer.flatfish.so.
Since KaiZen provided the clear information about it, I have pass the suggestion to the vendor and checking the following 2 questions about how hwcomposer bind the netlink socket:
1)How the value of nl_pid is assigned when binding the netlink socket?
2)How hwcomposer handle the errors of binding the netlink socket?
The vendor's feedback is as the following:
1)How the value of nl_pid is assigned when binding the netlink socket?
  Ans: Hwcomposer does use getpid() to assign the value of nl_pid.
2)How hwcomposer handle the errors of binding the netlink socket?
  Ans: When hwcomposer failed to bind the socket, it will report errors and exit. The system shall not work in that situation.

About the release date of the modified version, it will be around the end of Nov
Attached patch bug-924729-fix-v2.patch (obsolete) — Splinter Review
(In reply to Jeff Chuang from comment #25)
> The vendor's feedback is as the following:
> 1)How the value of nl_pid is assigned when binding the netlink socket?
>   Ans: Hwcomposer does use getpid() to assign the value of nl_pid.
> 2)How hwcomposer handle the errors of binding the netlink socket?
>   Ans: When hwcomposer failed to bind the socket, it will report errors and
> exit. The system shall not work in that situation.
> 
> About the release date of the modified version, it will be around the end of
> Nov

Thanks for your verification. We will wait for the release from partner.

Besides the fix from partner side, b2g also need this fix and let the socket bind more flexible.
Attachment #827892 - Attachment is obsolete: true
Attachment #8333650 - Flags: review?(dhylands)
Attachment #8333650 - Flags: review?(cyu)
Comment on attachment 8333650 [details] [diff] [review]
bug-924729-fix-v2.patch

Review of attachment 8333650 [details] [diff] [review]:
-----------------------------------------------------------------

I'd prefer using some value on our own instead of letting kernel select one for us automatically. gettid() sounds like a good choice, unless we bind the netlink socket on the main thread (but is an error anyway). This makes it easier to detect potential errors in gecko.
Attachment #8333650 - Flags: review?(cyu)
Attachment #8333650 - Flags: review?(dhylands) → review?(dhylands)
Comment on attachment 8333650 [details] [diff] [review]
bug-924729-fix-v2.patch

Review of attachment 8333650 [details] [diff] [review]:
-----------------------------------------------------------------

I'd be happy with either using 0 or gettid()
Attachment #8333650 - Flags: review?(dhylands) → review+
Attached patch bug-924729-fix-v3.patch (obsolete) — Splinter Review
(In reply to Cervantes Yu from comment #27)
> Comment on attachment 8333650 [details] [diff] [review]
> bug-924729-fix-v2.patch
> 
> Review of attachment 8333650 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I'd prefer using some value on our own instead of letting kernel select one
> for us automatically. gettid() sounds like a good choice, unless we bind the
> netlink socket on the main thread (but is an error anyway). This makes it
> easier to detect potential errors in gecko.

Considering the concern Cervantes said in Comment 27, we should have a mechanism to know if anyone in the other place using |saddr.nl_pid = getpid()| to assign socket address. Do you think if it is better than bug-924729-fix-v2.patch? Thanks.
Attachment #8335144 - Flags: review?(dhylands)
(In reply to Vincent Liu[:vliu] from comment #29)
> Created attachment 8335144 [details] [diff] [review]
> bug-924729-fix-v3.patch
> 
The warning message is not clear enough, and we should say something from the high-level view like "The netlink socket address is in use. Retry with auto-assigned one."
And we should also bail out from the infinite while loop just in case other unexpected error should ever happen.
Attached patch bug-924729-fix-v3.patch (obsolete) — Splinter Review
(In reply to Cervantes Yu from comment #31)
> And we should also bail out from the infinite while loop just in case other
> unexpected error should ever happen.

As the suggestion from Cervantes, adding retryCnt to bail out from the infinite while loop.
Attachment #8335144 - Attachment is obsolete: true
Attachment #8335144 - Flags: review?(dhylands)
Attachment #8335211 - Flags: review?(dhylands)
Attached patch bug-924729-fix-v3.patch (obsolete) — Splinter Review
OOps!! There is a typo err in the patch file. A modified patch was attached.
Attachment #8335211 - Attachment is obsolete: true
Attachment #8335211 - Flags: review?(dhylands)
Attachment #8335214 - Flags: review?(dhylands)
Comment on attachment 8335214 [details] [diff] [review]
bug-924729-fix-v3.patch

>   struct sockaddr_nl saddr;
>+  int retryCnt = 0;
I don't think using a retry counter here a good idea.
>   bzero(&saddr, sizeof(saddr));
>   saddr.nl_family = AF_NETLINK;
>   saddr.nl_groups = 1;
>   saddr.nl_pid = getpid();
> 
>-  if (bind(mSocket.get(), (struct sockaddr *)&saddr, sizeof(saddr)) == -1)
>-    return false;
>+  do {
>+    if (bind(mSocket.get(), (struct sockaddr *)&saddr, sizeof(saddr)) == 0) {
>+      break;
>+    } else if (errno != EADDRINUSE) {
we don't need the 'else' here because we break if we fail to bind.
>+      return false;
>+    }
Instead I think we should bail out here if saddr.nl_pid is already 0, which means we already tried to use 0 but still failed. Retrying on this doesn't give us more chance to succeed.
>+    // Once there was any other place in the same process assigning saddr.nl_pid by
>+    // getpid(), we can detect it and print warning message.
>+    printf_stderr("[WARNING]:socket address is in use. Re-assign it by kernel.");
Generally, we don't need both square bracket and colon in messages like this. And please add one space after the colon if it is chose.
We should mention that this is a netlink socket, like "The netlink socket address is in use. Rebind with one assigned automatically."
>+    saddr.nl_pid = 0;
>+    retryCnt++;
>+  } while (retryCnt < 3);
3 is quite arbitrary. I think using while (true) should suffice.
>+
>+  if (retryCnt >= 3) {
>+    printf_stderr("[ERROR]:socket address is assigned fail by kernel.");
The kernel should not fail to assign a netlink socket address. We don't need the code here.
>+    return false;
>+  }
Comment on attachment 8335214 [details] [diff] [review]
bug-924729-fix-v3.patch

Review of attachment 8335214 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with nits addressed.

::: hal/gonk/UeventPoller.cpp
@@ +117,4 @@
>    bzero(&saddr, sizeof(saddr));
>    saddr.nl_family = AF_NETLINK;
>    saddr.nl_groups = 1;
>    saddr.nl_pid = getpid();

Let's use gettid(), which, for the main thread, is the same thing as getpid(), but at least its different for each thread within a process.

To the kernel pid is actually what userspace calls tid.

I'd still keep the retry logic.

@@ +117,5 @@
>    bzero(&saddr, sizeof(saddr));
>    saddr.nl_family = AF_NETLINK;
>    saddr.nl_groups = 1;
>    saddr.nl_pid = getpid();
>  

Let's use gettid() which, for the main thread, is the same thing as getpid(), but at least its different for each thread within a process.

To the kernel pid is actually what userspace calls tid

@@ +126,5 @@
> +      return false;
> +    }
> +    // Once there was any other place in the same process assigning saddr.nl_pid by
> +    // getpid(), we can detect it and print warning message.
> +    printf_stderr("[WARNING]:socket address is in use. Re-assign it by kernel.");

nit: [WARNING]:socket address is in use. Let the kernel re-assign.

@@ +132,5 @@
> +    retryCnt++;
> +  } while (retryCnt < 3);
> +
> +  if (retryCnt >= 3) {
> +    printf_stderr("[ERROR]:socket address is assigned fail by kernel.");

nit: [ERROR]:kernel failed to assign socket address.

@@ +133,5 @@
> +  } while (retryCnt < 3);
> +
> +  if (retryCnt >= 3) {
> +    printf_stderr("[ERROR]:socket address is assigned fail by kernel.");
> +    return false;    

nit: trailing space
Attachment #8335214 - Flags: review?(dhylands) → review+
Comment on attachment 8335214 [details] [diff] [review]
bug-924729-fix-v3.patch

Review of attachment 8335214 [details] [diff] [review]:
-----------------------------------------------------------------

::: hal/gonk/UeventPoller.cpp
@@ +121,5 @@
>  
> +  do {
> +    if (bind(mSocket.get(), (struct sockaddr *)&saddr, sizeof(saddr)) == 0) {
> +      break;
> +    } else if (errno != EADDRINUSE) {

Remove else after break;

@@ +129,5 @@
> +    // getpid(), we can detect it and print warning message.
> +    printf_stderr("[WARNING]:socket address is in use. Re-assign it by kernel.");
> +    saddr.nl_pid = 0;
> +    retryCnt++;
> +  } while (retryCnt < 3);

So - I agree with Cervantes. Remove retryCnt and bail if saddr.nl_pid is already 0

There is no reason for the kernel to screw this up.
Attached patch bug-924729-fix-v4.patch (obsolete) — Splinter Review
Thanks for all of your good suggestions of patch reviewing. Here is my modified patch. If you still have any nit for the patch. please leave for Comment. Thanks.
Attachment #830599 - Attachment is obsolete: true
Attachment #8333650 - Attachment is obsolete: true
Attachment #8335214 - Attachment is obsolete: true
Attachment #8337484 - Flags: review?(dhylands)
Comment on attachment 8337484 [details] [diff] [review]
bug-924729-fix-v4.patch

Review of attachment 8337484 [details] [diff] [review]:
-----------------------------------------------------------------

::: hal/gonk/UeventPoller.cpp
@@ +120,4 @@
>  
> +  do {
> +    if (bind(mSocket.get(), (struct sockaddr *)&saddr, sizeof(saddr)) == 0)
> +      break;

nit: Add braces, even though this is only a one line statement.

See: https://developer.mozilla.org/en-US/docs/Developer_Guide/Coding_Style#Control_Structures

@@ +123,5 @@
> +      break;
> +
> +    if (errno != EADDRINUSE)
> +      return false;
> +

You should add:

if (saddr.nl_pid == 0) {
  break;
}
Attachment #8337484 - Flags: review?(dhylands) → review+
Attached patch bug-924729-fix-v5.patch (obsolete) — Splinter Review
(In reply to Dave Hylands [:dhylands] from comment #38)
> Comment on attachment 8337484 [details] [diff] [review]
> bug-924729-fix-v4.patch
> 
> Review of attachment 8337484 [details] [diff] [review]:
> -----------------------------------------------------------------
> 

> You should add:
> 
> if (saddr.nl_pid == 0) {
>   break;
> }

Thanks for good suggestion.
Attachment #8337484 - Attachment is obsolete: true
Attachment #8338967 - Flags: review?(dhylands)
A question in my mind is asking me why it is considered as an error while different parts of system are trying to use netlink socket at the same time. HAL(or other 3rd party library) has it's own reason to create netlink socket because it cannot use Gonk's UventPoller in upper level.  Using multiple netlink sockets in the same process/thread is perfect legal according to netlink specification.

I'm afraid we are adding overheads to the system.
(In reply to Terry Li from comment #40)
> A question in my mind is asking me why it is considered as an error while
> different parts of system are trying to use netlink socket at the same time.
> HAL(or other 3rd party library) has it's own reason to create netlink socket
> because it cannot use Gonk's UventPoller in upper level.  Using multiple
> netlink sockets in the same process/thread is perfect legal according to
> netlink specification.
> 
> I'm afraid we are adding overheads to the system.

So my understanding of this bug is to make it NOT be considered an error.

We'll try using gettid() and if that fails, then we'll use 0 which means that the kernel will assign an available socket.
Comment on attachment 8338967 [details] [diff] [review]
bug-924729-fix-v5.patch

Review of attachment 8338967 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry - minusing because I made a bad suggestion last time, and that needs to be fixed.

::: hal/gonk/UeventPoller.cpp
@@ +128,5 @@
> +    }
> +
> +    if (saddr.nl_pid == 0) {
> +      break;
> +    }

Bah - I screwed up. I suggested using break, which is obviously incorrect.

We should return false here. The fact that we got here and nl_pid == 0 means that we tried to bind already using a socket of 0 and it failed. So trying again is pointless.

@@ +132,5 @@
> +    }
> +
> +    // Once there was any other place in the same process assigning saddr.nl_pid by
> +    // gettid(), we can detect it and print warning message.
> +    printf_stderr("The netlink socket address is in use. Let the kernel re-assign.");

Since we're going to print something, we might as well print the nl_pid that failed. I also noticed that there is a \n missing from the end of the line.

printf_stderr("UeventPoller: netlink socket %d is in use. Let the kernel re-assign.\n", saddr.np_pid);
Attachment #8338967 - Flags: review?(dhylands) → review-
(In reply to Dave Hylands [:dhylands] from comment #41)
> (In reply to Terry Li from comment #40)
> > A question in my mind is asking me why it is considered as an error while
> > different parts of system are trying to use netlink socket at the same time.
> > HAL(or other 3rd party library) has it's own reason to create netlink socket
> > because it cannot use Gonk's UventPoller in upper level.  Using multiple
> > netlink sockets in the same process/thread is perfect legal according to
> > netlink specification.
> > 
> > I'm afraid we are adding overheads to the system.
> 
> So my understanding of this bug is to make it NOT be considered an error.
> 
> We'll try using gettid() and if that fails, then we'll use 0 which means
> that the kernel will assign an available socket.

This is not the case we should retry, loop, or do error handling. We should set nl_pid = 0 on the first bind() since the simplest solution is always the best. get_tid() does not make much difference from get_pid() while a 3rd party library or HAL running in the same thread of Gonk UventPoller.  Can we think twice why the detection logic of "netlink socket is in use" is necessary in this case?
So, personally, I'd be happy to use 0 right from the start.

Up until just recently we've been using getpid without issue, but on the flatfish there is a second thread which we don't control which also used getpid.

Perhaps Cervantes can weigh in on why he'd rather that we use our own value than just let the kernel assign a socket (which we know will always work) (see comment 27)
Flags: needinfo?(cyu)
(In reply to Terry Li from comment #43)
> This is not the case we should retry, loop, or do error handling. We should
> set nl_pid = 0 on the first bind() since the simplest solution is always the
> best. get_tid() does not make much difference from get_pid() while a 3rd
> party library or HAL running in the same thread of Gonk UventPoller.  Can we
> think twice why the detection logic of "netlink socket is in use" is
> necessary in this case?

It does make difference. Using gettid() allows another thread to open netlink socket in the same process, but on the IO thread is not allowed. 3rd party libraries that we can't control running on the IO thread, the thread UeventPoller is on, should be considered as an error, or at the very least highly dangerous.

The IO thread use epoll to watch several file descriptors, including the IPC channels. Anyone on that thread should receive notification from epoll, perform nonblocking IO or any actions that will not block. It will be disastrous if anyone performs blocking IO or do something that could possibly block for some indefinite time, and ALL apps will freeze because of of it. If anyone needs to do blocking actions, it needs to collect the information it needs nonblockingly, and then post a task to another thread to do it.

If another 3rd party library needs opens a netlink socket on the IO thread *and* uses gettid() as its socket address, we can write glue code to make it work if we have access to the source code. For HAL or 3rd party library that we can't control, we just don't allow it to run on the IO thread. It's just way too dangerous, and problems like this could be very hard to debug.
Flags: needinfo?(cyu)
(In reply to Cervantes Yu from comment #45)
> 
> It does make difference. Using gettid() allows another thread to open
> netlink socket in the same process, but on the IO thread is not allowed. 3rd
> party libraries that we can't control running on the IO thread, the thread
> UeventPoller is on, should be considered as an error, or at the very least
> highly dangerous.
> 
> The IO thread use epoll to watch several file descriptors, including the IPC
> channels. Anyone on that thread should receive notification from epoll,
> perform nonblocking IO or any actions that will not block. It will be
> disastrous if anyone performs blocking IO or do something that could
> possibly block for some indefinite time, and ALL apps will freeze because of
> of it. If anyone needs to do blocking actions, it needs to collect the
> information it needs nonblockingly, and then post a task to another thread
> to do it.
> 
> If another 3rd party library needs opens a netlink socket on the IO thread
> *and* uses gettid() as its socket address, we can write glue code to make it
> work if we have access to the source code. For HAL or 3rd party library that
> we can't control, we just don't allow it to run on the IO thread. It's just
> way too dangerous, and problems like this could be very hard to debug.
If this is the case, UeventPoller should not be implemented in IO thread since we cannot control what others do and cannot afford these risks.  Non-blocking Uevent reader is also a polling design compared to blocking IO.  Besides, I don't see how we can detect the use of multiple netlink sockets in the same thread.
(In reply to Terry Li from comment #46)
> If this is the case, UeventPoller should not be implemented in IO thread
> since we cannot control what others do and cannot afford these risks. 
> Non-blocking Uevent reader is also a polling design compared to blocking IO.
Why not? UeventPoller use the IO thread and it follows the rule of using the IO thread.
It's consumers are gecko hal functions like http://dxr.mozilla.org/mozilla-central/source/hal/Hal.h#98 and the fact that it's nonblocking on the IO thread is the implementation detail hidden behind its consumers like RegisterBatteryObserver() on the main thread. If there is a new module that wants to use UeventPoller, it should also follow the rules.

Also, we don't do the polling ourselves. Polling of the socket file descriptor is multiplexed using the epoll fd on this thread. We are doing nonblocking read and then post a task to another thread. This is different from blocking IO that we don't block other IPC activities if no netlink socket data is coming.

> Besides, I don't see how we can detect the use of multiple netlink sockets
> in the same thread.
We can't. But using gettid() implicitly states what thread we are on. If anyone tries to open another netlink socket using gettid() before us, we see the warning message and might probably need to investigate. If it's after UeventPoller, it will fail. This is not much, but still better than using 0, where someone else tries to open another netlink socket on the same thread and we don't know, which is considered at least to be duplicated functionality.
(In reply to Cervantes Yu from comment #47)
> (In reply to Terry Li from comment #46)
> > If this is the case, UeventPoller should not be implemented in IO thread
> > since we cannot control what others do and cannot afford these risks. 
> > Non-blocking Uevent reader is also a polling design compared to blocking IO.
> Why not? UeventPoller use the IO thread and it follows the rule of using the
> IO thread.
> It's consumers are gecko hal functions like
> http://dxr.mozilla.org/mozilla-central/source/hal/Hal.h#98 and the fact that
> it's nonblocking on the IO thread is the implementation detail hidden behind
> its consumers like RegisterBatteryObserver() on the main thread. If there is
> a new module that wants to use UeventPoller, it should also follow the rules.
If UeventPoller has risks we cannot control, why not using another thread rather than IO thread? It's fine to follow UeventPoller's rule when we use UeventPoller, but I'm thinking about the change of UeventPoller itself to a better and safer implementation.
> 
> Also, we don't do the polling ourselves. Polling of the socket file
> descriptor is multiplexed using the epoll fd on this thread. We are doing
> nonblocking read and then post a task to another thread. This is different
> from blocking IO that we don't block other IPC activities if no netlink
> socket data is coming.
> 
> > Besides, I don't see how we can detect the use of multiple netlink sockets
> > in the same thread.
> We can't. But using gettid() implicitly states what thread we are on. If
> anyone tries to open another netlink socket using gettid() before us, we see
> the warning message and might probably need to investigate. If it's after
> UeventPoller, it will fail. This is not much, but still better than using 0,
> where someone else tries to open another netlink socket on the same thread
> and we don't know, which is considered at least to be duplicated
> functionality.
I don't think we can force HAL or 3rd party library(and all other parts of system) to use gettid(). Because of that, will nl_pid be set to the same value either by getpid() or gettid() in some specific cases? If this happens, this could be another difficult situation to debug.

Regarding duplicated functionality, there are 2 cases:

1. Duplicated in Gecko/Gonk
2. Duplicated in other cases(HAL, 3rd party libs, no source code...)

In the case 2, there's no exactly good solution to prevent it, so we adapt it. In this case, nl_pid = 0 works the best.
In the case 1, your idea of gettid() works, but it doesn't look possible since each patch will be reviewed and no duplicated functionality is allowed. Besides, using some singleton technique to prevent multiple initialization/instances is also a good choice.
(In reply to Terry Li from comment #48)
> If UeventPoller has risks we cannot control, why not using another thread
> rather than IO thread? It's fine to follow UeventPoller's rule when we use
What kind of risks we can't control? Do you have one example of this?

> UeventPoller, but I'm thinking about the change of UeventPoller itself to a
> better and safer implementation.
Safer in terms of being able to block in reading the netlink socket? Or allowing the consumers to block? If we allow the consumers to block then we open the possibility that the consumers will step on each other's toes. If it's only that UeventPoller can block but not the consumers, then it's not different than running on the IO thread. Please note that what happens from the netlink socket is probably something the apps are interested in, and we will finally make it happen on the main thread of some app. 

Of course if you can show that we can have a better and safer solutions, we are always open to improvements. :)

> I don't think we can force HAL or 3rd party library(and all other parts of
> system) to use gettid(). Because of that, will nl_pid be set to the same
> value either by getpid() or gettid() in some specific cases? If this
> happens, this could be another difficult situation to debug.
> 
We already try our best to prevent the collision by using gettid(). If we see collision with an some HAL, that could mean some potential error: why would some other thread use the same value as another thread's tid? This might mean some potential problem that'd better be resolved.
(In reply to Cervantes Yu from comment #49)
> (In reply to Terry Li from comment #48)
> > If UeventPoller has risks we cannot control, why not using another thread
> > rather than IO thread? It's fine to follow UeventPoller's rule when we use
> What kind of risks we can't control? Do you have one example of this?
You have introduced risks of IPC interference, application hang-up, ..etc in prior discussion when UeventPoller exists in IO thread.
> 
> > UeventPoller, but I'm thinking about the change of UeventPoller itself to a
> > better and safer implementation.
> Safer in terms of being able to block in reading the netlink socket? Or
> allowing the consumers to block? If we allow the consumers to block then we
> open the possibility that the consumers will step on each other's toes. If
> it's only that UeventPoller can block but not the consumers, then it's not
> different than running on the IO thread. Please note that what happens from
> the netlink socket is probably something the apps are interested in, and we
> will finally make it happen on the main thread of some app. 
Netlink socket is used for the notification of battery status change, earphone plug change, etc.  If UeventPoller creates a thread when it initializes and running its own service by blocking read these events and dispatch(like Android and other systems do), it won't be possible to effect other processes in IO thread because it lives in its own thread.  In flatfish case, if HAL or 3rd party library creates another netlink socket in HWComposer, it won't be effected as well. Only thing needs to be noted is set nl_pid to 0 to make sure it always work.
> 
> Of course if you can show that we can have a better and safer solutions, we
> are always open to improvements. :)
Just like I described above.  And thanks to be open. I'm kind of comfort you say so.
> 
> > I don't think we can force HAL or 3rd party library(and all other parts of
> > system) to use gettid(). Because of that, will nl_pid be set to the same
> > value either by getpid() or gettid() in some specific cases? If this
> > happens, this could be another difficult situation to debug.
> > 
> We already try our best to prevent the collision by using gettid(). If we
> see collision with an some HAL, that could mean some potential error: why
> would some other thread use the same value as another thread's tid? This
> might mean some potential problem that'd better be resolved.
gettid() and getpid() do not use the same numeric system and no one can guarantee their values can be always different in any situation. Only Gecko/Gonk uses gettid() to bind netlink socket sounds like a dangerous thing to me.  And "the potential error detecting system" won't work if HAL or 3rd party library uses getpid() or zero rather than using gettid().

Set nl_pid to 0 for the first bind() should do all the work. Why make it complex?
> gettid() and getpid() do not use the same numeric system and no one can
> guarantee their values can be always different in any situation.

I beg to differ.

pthread thread identifiers are certainly from a different domain than the kernel process/thread ids.

But in the kernel, the thread-id and process-id come from the same domain.
gittid() == getpid() when you're running on the main thread.
Here's a reference: http://www.win.tue.nl/~aeb/linux/lk/lk-10.html (see section 10.4 near the very end)
Dave, thank you for the clarification.  But the gettid() == getpid() only works when CLONE_THREAD flag is set when using clone().  Check manage of clone(2) or some other results here,

http://stackoverflow.com/questions/12812144/i-have-questined-about-get-tid-on-linux-system-and-result-of-getpid

Depending on kernel/pthread implementation, if
1. CLONE_THREAD is set:
   1) getpid() == gettid() on the main thread or single thread.
   2) If UeventPoller(IO thread) is running on main thread and HAL bind a netlink socket by getpid(), it always fails. I don't know which thread IO thread is currently on, but I don't want to constrain the possible future that IO thread or UeventPoller will be run in a standalone service manager. 
2. CLONE_THREAD is not set:
   1) getpid() != gettid() in every case(and pray there's no portability issues).
   2) Since getpid() != gettid() all the time, setting nl_pid = gettid() does the same thing as nl_pid = 0. And we don't need complex logic to handle error status and retry.

To sum up, using gettid() will still fail in specific normal case, which is the example of case 1. And in case 2, nl_pid = 0 is the best choice.

What do you think, Dave and Cervantes?
(In reply to Terry Li from comment #52)
>    2) If UeventPoller(IO thread) is running on main thread and HAL bind a
UeventPoller is already in the IO thread (and you mentioned that, too). That is never the main thread. We will never have getpid() == gettid().
> netlink socket by getpid(), it always fails. I don't know which thread IO
> thread is currently on, but I don't want to constrain the possible future
> that IO thread or UeventPoller will be run in a standalone service manager.
(In reply to Terry Li from comment #50)

> > We already try our best to prevent the collision by using gettid(). If we
> > see collision with an some HAL, that could mean some potential error: why
> > would some other thread use the same value as another thread's tid? This
> > might mean some potential problem that'd better be resolved.
> gettid() and getpid() do not use the same numeric system and no one can
> guarantee their values can be always different in any situation. Only
> Gecko/Gonk uses gettid() to bind netlink socket sounds like a dangerous
> thing to me.  And "the potential error detecting system" won't work if HAL
> or 3rd party library uses getpid() or zero rather than using gettid().
> 
> Set nl_pid to 0 for the first bind() should do all the work. Why make it
> complex?

Actually there doesn't have a good way to 100% guarantee eliminating all risks when there is 3rd party library which we can't control. If we simply set saddr.nl_pid = 0 and I believe there still has dangerous when 3rd party library set any value conflicting we set in IO thread. Based on this, I thinks that's the reason why the patch did. We tried to fix saddr.nl_pid value which is assigned by gettid(). Any one wants to assign the same saddr.nl_pid will generate a warning message. Maybe it can't cover all cases we just discussed but at least this is what we can do best.
(In reply to Dave Hylands [:dhylands] from comment #42)
> Comment on attachment 8338967 [details] [diff] [review]
> bug-924729-fix-v5.patch
> 
> Review of attachment 8338967 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Sorry - minusing because I made a bad suggestion last time, and that needs
> to be fixed.
> 
> ::: hal/gonk/UeventPoller.cpp
> @@ +128,5 @@
> > +    }
> > +
> > +    if (saddr.nl_pid == 0) {
> > +      break;
> > +    }
> 
> Bah - I screwed up. I suggested using break, which is obviously incorrect.
> 
> We should return false here. The fact that we got here and nl_pid == 0 means
> that we tried to bind already using a socket of 0 and it failed. So trying
> again is pointless.
> 
> @@ +132,5 @@
> > +    }
> > +
> > +    // Once there was any other place in the same process assigning saddr.nl_pid by
> > +    // gettid(), we can detect it and print warning message.
> > +    printf_stderr("The netlink socket address is in use. Let the kernel re-assign.");
> 
> Since we're going to print something, we might as well print the nl_pid that
> failed. I also noticed that there is a \n missing from the end of the line.
> 
> printf_stderr("UeventPoller: netlink socket %d is in use. Let the kernel
> re-assign.\n", saddr.np_pid);

Thanks for your kindly suggestion. Can you please help me to review again? Thanks.
Attachment #8338967 - Attachment is obsolete: true
Attachment #8342098 - Flags: review?(dhylands)
Attachment #8342098 - Flags: review?(dhylands) → review+
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/50c0d05fa250
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: