Closed Bug 1196221 Opened 9 years ago Closed 8 years ago

[GonkSensor] Add system service for device sensors

Categories

(Firefox OS Graveyard :: General, defect)

All
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: tzimmermann, Assigned: tzimmermann)

References

Details

Attachments

(13 files, 1 obsolete file)

62 bytes, text/x-github-pull-request
tzimmermann
: review+
dhylands
: feedback+
Details | Review
49 bytes, text/x-github-pull-request
dhylands
: review+
Details | Review
52 bytes, text/x-github-pull-request
dhylands
: review+
Details | Review
52 bytes, text/x-github-pull-request
dhylands
: review+
Details | Review
52 bytes, text/x-github-pull-request
dhylands
: review+
Details | Review
6.05 KB, text/plain
Details
7.44 KB, patch
Details | Diff | Splinter Review
11.04 KB, patch
Details | Diff | Splinter Review
18.79 KB, text/plain
Details
33.84 KB, text/plain
Details
28.19 KB, text/plain
Details
52 bytes, text/x-github-pull-request
gsvelto
: review+
Details | Review
2.36 KB, patch
gsvelto
: review+
Details | Diff | Splinter Review
This bug is about landing the sensors daemon of bug 1194721.
Attached file Github repository (obsolete) —
Hi Gabriele,

This is the system daemon that covers the sensors API. It provides sensors functionality to Gecko without exposing details of the underlying Android drivers. Therefore I moved the device workarounds from Gecko into the daemon's source code. The data that goes in and out of the daemon should look the same on any device with a specific version of Android.

Once we have a repository for the daemon, I'll create a pull request for the bug.

Some hints: You might also want to check bug 1196046 for the PDU helper library. The overall design and implementation might not be easy to grasp at first. If you want to see a complete implementation (Gecko + daemon), you should check out how Bluetooth works (dom/bluetooth/bluedroid).

I'll upload Gecko patches into bug 1194721. Let's talk if you have questions.
Attachment #8649830 - Flags: review?(gsvelto)
The IPC protocol is documented in doc/ipc.txt.
Depends on: 1196046
Depends on: 1196727
Comment on attachment 8649830 [details]
Github repository

Updated Github tree:

  - fixed compatibility with various Android versions, from ICS to L
  - minor cleanup of out-commented code
Before I begin reviewing this... +1000 to this approach; I have hoped to get rid of the myriad of Android-version-specific workarounds for ages and this is an excellent way to do it.
A few quick notes from a cursory look I've given at the repo (I'll do a more thorough review on Monday):

- This code needs more comments, you should at least document every function exposed in the headers, but comments to describe private functions within C files would also be very helpful both for a reviewer and future contributors to this code.

- ipc.txt gives a nicely detailed description of the protocol but it would be even better if you could add a top-level README.md with some details about the daemon, how it generally works (with a pointer to ipc.txt obviously for the IPC details), what's required for building it and what kind of coding conventions are used through the code.

- If I'm not mistaken the syntax is roughly following gecko guidelines but I might be mistaken. It might be nice if you added a note in the README.md file about what syntax formatting guidelines the code is using so that we keep it consistent in the future.
(In reply to Gabriele Svelto [:gsvelto] from comment #4)
> Before I begin reviewing this... +1000 to this approach; I have hoped to get
> rid of the myriad of Android-version-specific workarounds for ages and this
> is an excellent way to do it.

Thanks. :) The code in the system daemons is a bit #ifdef'y at times, but Gecko will be free from any ifdefs. And our experience with Bluetooth shows that we rarely interfere with existing ifdefs in the daemon. We usually just fix simple bugs or add additional, independent features.

I'll take care of the points you already noted. I'd just like to wait for a bit more feedback and fix issues in larger chunks.

Just one more question, should we return a protocol version when Gecko calls init? In Bluetooth, we look at the system's Android release to decide about the supported features, so we don't need a separate protocol version. What about Sensors?
It might be prudent to add a protocol version during initialization even if we don't actually use it for anything. I think it's good practice to version any protocol going over sockets just in case one needs to change stuff down the line.
Comment on attachment 8649830 [details]
Github repository

The code looks good in general and in fact it's probably good enough so I'm
minus-ing mostly because I'd like to see more in-code documentation, formatting
issues cleaned up and some minor issues addressed before we proceed. Here's my
review comments from the more generic ones to the more specific:

* Please put a blank line between each group of includes, e.g.:

io.h
<blank>
various system includes ...
<blank>
our includes ...

* Sometimes you leave some space between statement and control flow blocks, e.g.

  const struct sensor_t* sensors;

  len = sensors_module->get_sensors_list(sensors_module, &sensors);

  for (i = 0; i < len; ++i) {
    // ...
  }

  And sometimes you don't:

  if (!wbuf) {
    return;
  }
  init_pdu(&wbuf->buf.pdu, SERVICE_POLL, OPCODE_SENSOR_DETECTED_NTF);
  if (append_sensors_t(&wbuf->buf.pdu, sensor) < 0) {
    goto cleanup;
  }

  Either way works for me but it should be consistent across the code. I find
  the first style more readable thanks to the extra space but the choice's up
  to you.

* Always put braces around the statement of an if condition, even if it's a
  one-liner. Depending on the file I've noticed that you did in some places and
  didn't in others.

* In general I'd like to see a comment in each file describing briefly what it
  contains and a quick overview of the functionality. An additional comment for
  every function explaining what it does, what inputs it takes and what it
  returns (if it returns something) would also be greatly appreciated. In
  general I try to document stuff unless it's totally trivial having in mind
  people who've never read the code yet might want to understand and modify it.

* The ARRAY_LENGTH macro is defined in three different places, better put it in
  a header file instead.

* The copyright in certain files is for 2014 and in others for 2015. I don't
  think it makes much of a difference but it would be best to make them
  coherent since it's all new code from what I can tell.

* In compiler.h you're defining the UNUSED macro as expanding to __unused__ then
  you're using it with the ATTRIBS one which will yield
  __attribute__((__unused__)) which is wrong as the correct idiom is
  __attribute__((unused)). Let's define UNUSED as __attribute__((unused))
  directly and use that in the code. Also note that I found instances of both
  ATTRIBS(UNUSED) and ATTRIBS(unused) in the code so check that they're all
  correctly spelled out.

* PDU-handling code uses the 256 constant frequently (because of the 8-bit
  service, 8-bit opcode fields), I think it would be cleaner if this was
  defined as a constant/define in the pdu library.

* I found five instances of this code:

  init_pdu(&wbuf->buf.pdu, cmd->service, cmd->opcode);
  send_pdu(wbuf);

  It's probably worth adding an init_and_send_pdu(wbuf, service, opcode) helper

* I noticed that you're detecting and handling out-of-memory conditions by
  replying with error messages, however those code-paths are very hard to test.
  Since the daemon is automatically restarted upon failure wouldn't it be better
  to wrap malloc() and friends into unfallible handlers and exit(EXIT_FAILURE)
  when hitting an out-of-memory condition? I feel it would be safer.

* In io.c there's a global called io_state plus a number of local variables with
  the same name, it would be best to rename the global (or the locals) so that
  there's no shadowing. g_io_state would be fine but I leave it up to you since
  we're not really using gecko style here. I'd use the g_ prefix everywhere
  for globals as I found it helps me while reading the code.

* Not directly related to this code but I find that having *_cleanup() function
  free a buffer a bit confusing. At first I thought they would just clear it
  and the buffer would be reused but then I had a look at the libpdu code and
  realized that was not the case.

* In io.c:320 we have this line:

  assert(NAME_OFFSET + siz <= sizeof(addr.sun_path));

  I think it would be best to strncpy()/strlcpy() the name after the initial
  nul character and detect if we're overflowing by checking the return value
  instead of asserting before. Similarly there's a few places where you're
  allocating memory as a + b where b is a variable size. I'm not sure what are
  the current best practices for detecting integer overflows are but
  we should probably consider employing them thoroughly just in case. (I've
  become very wary of integer overflows lately, as in "dear libstagefright
  developers, are your devices are belong to us")

* In io.c:uninit_io() you're calling uninit_registry() potentially without
  having called init_registry() which is conditional to the first EPOLLOUT
  event. This seems harmless know but it should be documented as such
  ("it's safe to call uninit_registry() even when init_registry() hasn't been
  called" or something like that).

* In io.c:connected_cmd_socket() you're creating a receive buffer of 64KiB,
  isn't that a little bit on the large size for the intended use?

* In io.c:handle_pdu() you're creating and sending an error PDU in response to
  unhandled message, I think it would be more readable if you abstracted this
  bit into a separate function, there's good chances it might be re-used down
  the line.

* In poll.c you allocate write buffers with the size specified as a series of
  contants with comments describing them, it would be better if we could move
  that to a macro plus documentation, e.g.

  /* PDUs for sensor detected events are made of the following fields for a
   * total of 32 bytes:
   * - A 4 bytes sensor id
   * - A 4 bytes type
   * - A 4 bytes range
   * etc ...
   */
  #define OPCODE_SENSOR_DETECTED_NTF_PDU_SIZE (32UL)

  Or if you prefer to keep the sum to ease additions have them named:

  #define OPCODE_SENSOR_DETECTED_NTF_PDU_ID_FIELD_SIZE (4UL)
  #define OPCODE_SENSOR_DETECTED_NTF_PDU_TYPE_FIELD_SIZE (4UL)
  // etc ...
  #define OPCODE_SENSOR_DETECTED_NTF_PDU_SIZE ( \
    OPCODE_SENSOR_DETECTED_NTF_PDU_ID_FIELD_SIZE + \
    OPCODE_SENSOR_DETECTED_NTF_PDU_ID_FIELD_SIZE + \
    // etc ...
    )

* As a side note, too bad that append_to_pdu() does not use the same format as
  printf otherwise you might have added static checking for parameters using

  __attribute__ ((format (printf, 2, 3)))

* In poll.c append_sensors_event_t() is a rather large function, breaking out
  some of the case statements into separate functions might make it leaner and
  more readable.

* nit: In poll.c:270 put spaces around the multiply operators

* nit: poll.c 688 is longer than 80 characters
Attachment #8649830 - Flags: review?(gsvelto) → review-
Thanks so far for the review. I'll update the code during the next days. Below are some responses and rationales.


> * I noticed that you're detecting and handling out-of-memory conditions by
>   replying with error messages, however those code-paths are very hard to
> test.
>   Since the daemon is automatically restarted upon failure wouldn't it be
> better
>   to wrap malloc() and friends into unfallible handlers and
> exit(EXIT_FAILURE)
>   when hitting an out-of-memory condition? I feel it would be safer.

What do you mean by 'safer'?

I never agreed with this school of thought, especially for system programs.

 * I'd rather like to see fail-safe behavior, rather than fail-stop. Crashing a system service always has consequences in the upper layers of the software stack, because callers need to handle the crash. (Crashing near the top of the stack has fewer consequences and is therefore generally better.) I admit that we probably won't ever achieve full fail-safe support.

 * I don't think automatic restarts would solve this particular problem. These system daemons don't consume much memory. Most of the memory is probably consumed by Gecko and the apps. If we simply restart the service, not much would have changed about the memory consumption and we'd run into the same problem again. We need to give Gecko a chance to free up memory. The least thing for Gecko to do is to cleanly shut down the service and inform the user about the problem.

 * Sensors is a trivial case, but for the more complex systems, such as Bluetooth, we keep state in Gecko that we shouldn't lose easily. Android's Bluetooth API even exports an error code for OOM situations. Killing the service in this case feels like a regression (compared to the API that allows handling the error). If we handle one system service like that, we can also handle all of them this way.


> * Not directly related to this code but I find that having *_cleanup()
> function
>   free a buffer a bit confusing. At first I thought they would just clear it
>   and the buffer would be reused but then I had a look at the libpdu code and
>   realized that was not the case.

What do you think about 'destroy_'?


> * In io.c:320 we have this line:
> 
>   assert(NAME_OFFSET + siz <= sizeof(addr.sun_path));
> 
>   I think it would be best to strncpy()/strlcpy() the name after the initial
>   nul character and detect if we're overflowing by checking the return value
>   instead of asserting before.

Yes, calling |assert| is not a good way for handling this, returning an error is more appropriate.

But |strcpy| et al are often useless functions. We need the string's length and once we have it, we can call |memcpy| et al. That's faster and standardized.

I'll move the code into a separate function and make it a bit more readable.


> Similarly there's a few places where you're
>   allocating memory as a + b where b is a variable size. I'm not sure what
> are
>   the current best practices for detecting integer overflows are but
>   we should probably consider employing them thoroughly just in case. (I've
>   become very wary of integer overflows lately, as in "dear libstagefright
>   developers, are your devices are belong to us")

I'll search through the code and check for potential integer overflows.


> * As a side note, too bad that append_to_pdu() does not use the same format
> as
>   printf otherwise you might have added static checking for parameters using
> 
>   __attribute__ ((format (printf, 2, 3)))

I agreed that this would be a nice check. The problem is that format characters for printf usually don't specify a fixed type size. In the case of our protocol, we need to make sure that certain fields have a specific number of bytes. So it was better to introduce new format characters instead of shoehorning existing ones into the API.
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #9)
>  * I'd rather like to see fail-safe behavior, rather than fail-stop.
> Crashing a system service always has consequences in the upper layers of the
> software stack, because callers need to handle the crash. (Crashing near the
> top of the stack has fewer consequences and is therefore generally better.)
> I admit that we probably won't ever achieve full fail-safe support.

The problem with this is testing. It's practically impossible to test code that handles OOMs especially because the Linux kernel allows overcommit and then you usually end up in a scenario where memory allocations do not fail but stuff starts dying left and right. Besides if you have to back-trace from an OOM you have to make sure you don't allocate anything in the backwards path. This is the case here for now but it might not be in the future, chain-of-OOM-errors (e.g. because we want to print an error once we get to the top in response to an OOM) are a typical case that's also impossible to test.

>  * I don't think automatic restarts would solve this particular problem.
> These system daemons don't consume much memory. Most of the memory is
> probably consumed by Gecko and the apps. If we simply restart the service,
> not much would have changed about the memory consumption and we'd run into
> the same problem again. We need to give Gecko a chance to free up memory.
> The least thing for Gecko to do is to cleanly shut down the service and
> inform the user about the problem.
> 
>  * Sensors is a trivial case, but for the more complex systems, such as
> Bluetooth, we keep state in Gecko that we shouldn't lose easily. Android's
> Bluetooth API even exports an error code for OOM situations. Killing the
> service in this case feels like a regression (compared to the API that
> allows handling the error). If we handle one system service like that, we
> can also handle all of them this way.

I wasn't aware of this.

> What do you think about 'destroy_'?

That's fine, free_ is also fine since that's all we're doing.

> Yes, calling |assert| is not a good way for handling this, returning an
> error is more appropriate.
> 
> But |strcpy| et al are often useless functions. We need the string's length
> and once we have it, we can call |memcpy| et al. That's faster and
> standardized.

In this case NAME_OFFSET is 1 so we can overflow only with a string of size 2^32-1 which should not be a problem but in general if you use strlcpy() this will be much simpler and always safer (strlcpy is part of bionic so we haven't got to worry about it's availability). You could turn this code:

  len = strlen(socket_name);
  siz = len + 1; /* include trailing '\0' */

  addr.sun_family = AF_UNIX;
  assert(NAME_OFFSET + siz <= sizeof(addr.sun_path));
  memset(addr.sun_path, '\0', NAME_OFFSET); /* abstract socket */
  memcpy(addr.sun_path + NAME_OFFSET, socket_name, siz);

Into this:

  addr.sun_family = AF_UNIX;
  memset(addr.sun_path, '\0', NAME_OFFSET); /* abstract socket */
  size_t rlen = strlcpy(addr.sun_path + NAME_OFFSET, socket_name, sizeof(addr.sun_path) - NAME_OFFSET);

  if (rlen >= sizeof(addr.sun_path) - NAME_OFFSET) {
    // Truncation, report an error
  }

Which is shorter, has runtime detection of the overflow and you don't have to bother about integer overflows. This is also quite general.

> I agreed that this would be a nice check. The problem is that format
> characters for printf usually don't specify a fixed type size. In the case
> of our protocol, we need to make sure that certain fields have a specific
> number of bytes. So it was better to introduce new format characters instead
> of shoehorning existing ones into the API.

Yeah, some static checking would have been nice to have but not necessary.
Hi

(In reply to Gabriele Svelto [:gsvelto] from comment #10)
> (In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #9)
> >  * I'd rather like to see fail-safe behavior, rather than fail-stop.
> > Crashing a system service always has consequences in the upper layers of the
> > software stack, because callers need to handle the crash. (Crashing near the
> > top of the stack has fewer consequences and is therefore generally better.)
> > I admit that we probably won't ever achieve full fail-safe support.
> 
> The problem with this is testing. It's practically impossible to test code
> that handles OOMs especially because the Linux kernel allows overcommit and
> then you usually end up in a scenario where memory allocations do not fail
> but stuff starts dying left and right. Besides if you have to back-trace
> from an OOM you have to make sure you don't allocate anything in the
> backwards path. This is the case here for now but it might not be in the
> future, chain-of-OOM-errors (e.g. because we want to print an error once we
> get to the top in response to an OOM) are a typical case that's also
> impossible to test.

I will make this change if you insist. Just let me respond and sum up my points:

 * Your first argument is: testing is hard, so let's not try this approach. But his argument is also true for the case of killing the daemon. We won't know if that works in OOM situations until we try it. And there's again no good way of testing it.

 * Sending an error without allocating memory in the daemon('s user space) can be implemented; although it's not trivial.

 * An OOM in the daemon means that there wasn't enough physical memory to establish a mapping into the daemons address space. That doesn't mean that Gecko will see the OOM as well. If Gecko has enough memory pages allocated, it might be able to handle the failure.

 * If we kill the daemon, Gecko can try to restart it. But Gecko does not know that it should also try to free up memory. Without this information, restarting won't fix the problem. I even doubt that the daemon could be restarted. There's no physical memory left after all.

 * Linux' memory overcommitment and OOM killer is completely unpredictable to us; so speculating about it doesn't really help. It might work to our advantage, or not.


> Into this:
> 
>   addr.sun_family = AF_UNIX;
>   memset(addr.sun_path, '\0', NAME_OFFSET); /* abstract socket */
>   size_t rlen = strlcpy(addr.sun_path + NAME_OFFSET, socket_name,
> sizeof(addr.sun_path) - NAME_OFFSET);
> 
>   if (rlen >= sizeof(addr.sun_path) - NAME_OFFSET) {
>     // Truncation, report an error
>   }
> 
> Which is shorter, has runtime detection of the overflow and you don't have
> to bother about integer overflows. This is also quite general.

I'm honestly not trying to piss you off or get into a flame war, but if NAME_OFFSET is larger than |addr.sun_path|, you're memset'ing out-of-bounds. Once you handle this problem with an additional 'if', you can also do the extra check for the string length as well. This will put all checks into a single place before anything gets written. The original code does this already, although |assert| should not have been used.

I also find this code hard to parse by eye.
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #11)
> I will make this change if you insist. Just let me respond and sum up my
> points:
> 
>  * Your first argument is: testing is hard, so let's not try this approach.
> But his argument is also true for the case of killing the daemon. We won't
> know if that works in OOM situations until we try it. And there's again no
> good way of testing it.
> 
>  * Sending an error without allocating memory in the daemon('s user space)
> can be implemented; although it's not trivial.
> 
>  * An OOM in the daemon means that there wasn't enough physical memory to
> establish a mapping into the daemons address space. That doesn't mean that
> Gecko will see the OOM as well. If Gecko has enough memory pages allocated,
> it might be able to handle the failure.
> 
>  * If we kill the daemon, Gecko can try to restart it. But Gecko does not
> know that it should also try to free up memory. Without this information,
> restarting won't fix the problem. I even doubt that the daemon could be
> restarted. There's no physical memory left after all.
> 
>  * Linux' memory overcommitment and OOM killer is completely unpredictable
> to us; so speculating about it doesn't really help. It might work to our
> advantage, or not.

OK, considering that the bluetooth daemon is working the same I'd say we can leave the daemon as-is but it would beneficial if we added some actual tests covering this (possibly by mocking malloc() with an LD_PRELOAD stub we can control). In this case we've got a simple scenario on our hands were snaking back from an OOM to the top-level event-loop is rather easy but if we pile up more complex code we might just end up in the situation where crawling back from an OOM actually leaves us in an inconsistent state because of a snag in the (non tested) error handling code. That's my only worry but I wouldn't hold this up just because of it.

> I'm honestly not trying to piss you off or get into a flame war, but if
> NAME_OFFSET is larger than |addr.sun_path|, you're memset'ing out-of-bounds.
> Once you handle this problem with an additional 'if', you can also do the
> extra check for the string length as well. This will put all checks into a
> single place before anything gets written. The original code does this
> already, although |assert| should not have been used.
> 
> I also find this code hard to parse by eye.

As with my comment about the OOM I'm not religious about this, I just wanted to make sure we're writing robust code and if you want to use memset + checks that's fine by my. Let's just try to make sure that pathological string sizes don't cause overflows in the checks. Again I'm not going to hold this off just because of this specific bit.
Comment

https://bugzilla.mozilla.org/show_bug.cgi?id=1194721#c17

should have been posted here

> How about handling OOMs this way:
> 
>  * Failing during startup will already quit the daemon immediately. So we
> should be fine here.
> 
>  * If we fail to allocate memory while processing commands and
> notifications, let's first try to send out an error response or
> notification. If that already fails, let the daemon itself.
Comment on attachment 8649830 [details]
Github repository

Hi Gabriele,

sorry that this update took so long. It took me several days to go through all your review comments and then I had a problem with my computer that prevented me from building.

I tried to address all the comments you had and wrote documentation and rationales for all the code.
Attachment #8649830 - Flags: review- → review?(gsvelto)
Comment on attachment 8649830 [details]
Github repository

Excellent stuff, I just spotted a few typos here and there but everything else seems in good shape:

compiler.h:18 typo fro -> for
io.c:77-78 typo "You cannot directly send PDUs directly." -> "You cannot send PDUs directly."
io.c:445 typo othe -> other
main.c:167 typo mor -> more
pdu.h:64 & pdu.h:69 function arguments are indented two spaces too far
poll.c:270 typo hary -> very

As a side note, what's the plan about landing this and getting it out? I suppose we can land at any time but we can't push this to devices without FOTA IIRC. I've missed the last FOTA meeting but will be attending the next one so I hope we'll discuss this at that time.
Attachment #8649830 - Flags: review?(gsvelto) → review+
Oh, cool! Thanks a lot.

There's one more thing that I was wondering about: recent Androids support a so-called wake-up flag for events. It signals that the driver has picked up a wake lock and expects the event to be delivered _now_.

Shall we support this? It would mean to add a flag value to the protocol for event notifications. And the daemon and Gecko had to grab additional wake locks while such an event is in the queue.
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #16)
> There's one more thing that I was wondering about: recent Androids support a
> so-called wake-up flag for events. It signals that the driver has picked up
> a wake lock and expects the event to be delivered _now_.
> 
> Shall we support this? It would mean to add a flag value to the protocol for
> event notifications. And the daemon and Gecko had to grab additional wake
> locks while such an event is in the queue.

Do we surface this in any way or form yet? If we don't I'd say we should bring it up on dev-webapi to discuss the addition and then implement it. It would be nice to have IMHO.
The feature is documented at [1].

Starting with L, there are two variants of each sensor, one that grabs a wake lock and one that doesn't. I'm not sure how we can handle this in the DOM API, since there seem to be only callbacks and events.

Even if we don't support wake ups directly, I think it makes sense to at least implement the pre-L semantics.

Bringing this topic up on dev-webapi is a good idea. Let's also add the wake-up flag, so that the IPC protocol is complete.

[1] https://source.android.com/devices/sensors/suspend-mode.html
Comment on attachment 8649830 [details]
Github repository

Updated Github tree:

  - add support for immediate delivery of events

I added support for immediate delivery of events to the IPC protocol. The sensor-detected and event notifications have an additional field named 'delivery mode' with either

  0x00: best effort
  0x01: immediate delivery

The daemon currently doesn't handle them much differently, but in a future patch it's supposed to maintain a wakelock for events on immediate-delivery sensors, and so is Gecko.

Wakelock support was only introduced with L. [1] So on earlier versions, the protocol flag mirrors the behavior of Android: proximity and significant-motion sensors/events have the immediate-delivery flag set, all others are best-effort.

[1] https://source.android.com/devices/sensors/suspend-mode.html
Attached file Github pull request
Equivalent to previous Github tree, r=gsvelto
Attachment #8649830 - Attachment is obsolete: true
Attachment #8664281 - Flags: review+
Comment on attachment 8664281 [details] [review]
Github pull request

Updated Github pull request

  - moved sensors setup and listing into |register_poll|

Testing on different devices revealed that the only safe driver call to run on the poll thread is |sensors_poll_device_t::poll|. Especially |setDelay| seems to interfere with anything else on the poll thread.

I moved the code for disabling all sensors and for listing the sensors into |register_poll| on the I/O thread. This is the same semantics that the in-Gecko implementation uses. This change makes sensorsd work on the Nexus 4 and fixes some mysterious warnings on the Flame.
Depends on: 1208991
Quickly glanced the last PRs, since I'm still not sure if we can flash the Flame boot image directly with ./flash.sh just ping me if you want a modified boot.img with the script in place. That will be required for the FOTA update too.
Attachment #8667866 - Flags: review?(mwu)
Attachment #8667935 - Flags: review?(mwu)
Attachment #8667936 - Flags: review?(mwu)
Attached file irc.txt
Gabriele,

I just had a chat with Michael on IRC. AFAICT his main concerns were about porting overhead, IPC performance and about vendors treating the IPC protocol like stable interface and we might end up with an incompatible mess of different protocols. Full chat log is attached.

I fully agree that the IPC protocol is not a public interface and that HW vendors should not re-implement our system daemons. Let's block that if anyone ever asks about it. Android APIs is the public interface for drivers and HW.

Regarding performance, he'd like to see us using binder instead of Unix sockets. I'll look into binder support and report back when I have something working.
Flags: needinfo?(gsvelto)
I've read the exchange and while I hadn't thought about it he does have a point about vendors possibly trying to offer their "value add" by replacing the daemon. That being said there's not much we can do about that, we've seen modified versions of gecko in shipping devices so I doubt that creating a daemon will make the situation any worse. I also agree about pushing back any idea of making the interface stable; it's internal stuff and will remain in flux as we see fit. BTW it might be a good idea to put huge warning labels in the code & documentation on this topic, just to scare any re-implementer wannabe.

Anyway the upside of all this - and it was missing from the discussion - is that with the C/C++ library dependency moved out of gecko we'll be finally able to replace gecko while keeping the rest of the stuff in place. And it's also a lot easier than doing the same thing by providing binary compatibility for every possible version of Android in a single gecko (which would probably involve a significant degree of magic in the build system plus a ton of dlopen() calls). It's not going to happen tomorrow but it will happen.

Finally about binder, I did some background reading (http://www.angryredplanet.com/~hackbod/openbinder/docs/html/BinderIPCMechanism.html) and had a look at the C interface (bionic/libc/kernel/common/linux/binder.h and frameworks/native/cmds/servicemanager/binder.h) and it's pretty low-level stuff. There's not much information on how to use the ioctl()s directly, the best I could find is this:

http://rts.lab.asu.edu/web_438/project_final/Talk%208%20AndroidArc_Binder.pdf
http://www.angryredplanet.com/~hackbod/openbinder/docs/html/BinderIPCMechanism.html

It looks like valuable stuff but it also looks like quite a bit of additional work just to figure out how it works. There's also a C++ wrapper under frameworks/native/include/binder. It seems to provide a higher-level abstraction which might be nicer to use in our code but is also largely documentation-free.

Overall I think it would be great if we could leverage this, and possibly make our own simpler wrapper to serve our own use-cases, but I wouldn't hold up landing the daemon because of this. In my eyes it's more important to move the update work forward and work on using binder as an optimization for later.
Flags: needinfo?(gsvelto)
(In reply to Gabriele Svelto [:gsvelto] from comment #28)
> Anyway the upside of all this - and it was missing from the discussion - is
> that with the C/C++ library dependency moved out of gecko we'll be finally
> able to replace gecko while keeping the rest of the stuff in place. And it's
> also a lot easier than doing the same thing by providing binary
> compatibility for every possible version of Android in a single gecko (which
> would probably involve a significant degree of magic in the build system
> plus a ton of dlopen() calls). It's not going to happen tomorrow but it will
> happen.

I understand this is the true aim, and I don't think it will actually improve that situation. In most cases, the work is simply moved - the real work of porting still has to be done, and then we have to concern ourselves with making sure the IPC interface is compatible. There's also pieces like gfx where deep integration is *required*. Compatibility concerns there can't be solved with IPC.

At best, I believe we can use a single gecko for all devices sharing the same Android base version.
Hey!

Thanks for all the comments.

Another note on the topic of porting: When I said that porting gets easier, I didn't mean that it gets easier from a technical POV. But it appears to be easier to manage and distribute the workload among developers: quickly get it to build and then give it to the teams to port their system services. If everything is located in Gecko, you'd rather have to port everything at once, or end up with crazy 'if 0' and build flags until the full port is ready.

I agree that IPC is not a panacea for incompatible APIs, but it definitely helps with smaller differences and incremental changes. For example, Bluetoothd contains 'ugly' driver-specific workarounds, Sensorsd implements the same protocol on top of different API versions and abstracts a bit from the actual driver APIs.

> At best, I believe we can use a single gecko for all devices sharing the same Android base version.

True. Having one Gecko binary for each Android version is possible, having one for all versions is a lot harder. Things like graphics and media are the obstacles here, as IPC is probably too slow to be used.

We have this initiative on OTA updates where we try to solve the issues around long-term support and updates. Gecko compatibility with different Android versions is one of the major technical obstacles. We're aiming for a single Gecko binary per Android version.

Regarding Binder: I started to re-implement sensorsd on top of Binder. That might take a bit as I have no experience with Binder's APIs and good documentation (i.e., tutorials) seems non-existent. I currently use hal/ipc/KeyStore.cpp as blueprint.
To echo what Thomas already said the goal here is to make updates of gecko possible on a phone whose underlying Android system is not being updated, not to make them possible across Android versions. The biggest issue there is compatibility with libraries which are often not standardized across different vendors (of which geolocation and the RIL are the biggest offenders) and thus we can't rely on.

IPC is not a panacea to solve this issue but it isolates the version-specific and device-specific workarounds outside of gecko and it's a lot easier to try to handle multiple versions/implementations directly within gecko.

BTW I suddenly feel silly about never having invited you to the FxOS FOTA meetings; your expertise on gonk would be very valuable there.
Hi!

Sorry that I didn't report back earlier. Binder is mostly undocumented and I had to do some research from kernel to frameworks to understand how it works and how to use it.

I added a Binder service to Gecko that receives sensor events from the daemon. It's implemented using the C++ framework that is also used by Android's services. The code is in the attached patch [02].

The sensors daemon adds a timestamp to each event, from the moment the event has been returned by the HAL driver. Patch [01] adds code for measuring the time it takes for an event's notification to reach Gecko's main thread. This is the first time we can make use of it.

I measured for the current in-Gecko implementation, the socket implementation and the Binder service. I'll post the results one by one.
These are the results for the internal implementation. Each line show the results for a period of 5 seconds. The listed values are

 * the average time it takes for a notification to reach the main thread, 
 * the maximum time, 
 * and the number of event

during the 5-second period.

The numbers fluctuate depending on the overall workload, but I think it's fair to say that an average of 60 usec per event is possible. There's no I/O involved, so this is the base line.
These are the results for the socket-based I/O with the same output as before. Again the numbers fluctuate depending on the workload, but something around 120-200 usec on a quite device seem realistic. Roughly half of this probably is the base overhead that happens when the event moves from Gecko's reading thread to the main thread.
Here are the numbers for Binder. It's the same code base, just that the notification was sent and received via the Binder service.

The numbers are significantly higher than for the socket I/O. Even in the best cases, the average time is longer than 1 msec. I'm pretty much surprised about this because I expected Binder and socket code to be roughly on par.

All measurements were done on an aries-l with Gecko rev 274445:7883e81f3c30. Each time I flashed the phone and started the Browser app after the reboot. It starts constant polling for sensor events. I kept this running for ~10 minutes in each case to grab the logcat.

The daemon code for the socket measurements is available at [1] and the daemon's Binder code is available at [2]. I took care to minimize the differences and overhead except for the actual message transfer. If I missed something significant, let me know.

[1] https://github.com/tdz/platform_system_sensorsd/tree/bug-1196221-perf
[2] https://github.com/tdz/platform_system_sensorsd/tree/bug-1196221-binder-cxx
Hi Michael

Please don't forget to look at these patches. It's currently blocking any progress here. Thanks!
Flags: needinfo?(mwu)
Hmmm, binder is not being used idiomatically. The standard binder pattern is a bit more efficient and safer than what's being done with libpdu. libpdu has to interpret a list of types and copy it over to the buffer, and then the shim you have for binder adds an additional memcpy. This isn't so much about binder being fast as much as libpdu looking weak compared to other IPC designs. The problem that's being solved by libpdu has already been solved better by IPDL which provides type safety (no type format string) and speed (serialize/deserialization code is generated, no format string and switch loop). I also believe the sensor daemon belongs in the gecko repo, so using IPDL would not be an issue in that context.
Flags: needinfo?(mwu)
Hi

thanks for taking a look.

(In reply to Michael Wu [:mwu] from comment #38)
> Hmmm, binder is not being used idiomatically. The standard binder pattern is
> a bit more efficient and safer than what's being done with libpdu. libpdu
> has to interpret a list of types and copy it over to the buffer, and then
> the shim you have for binder adds an additional memcpy.

True, but none of this explains the difference in overhead (factor 5 to 10).

Type-lists in libpdu work the same for socket I/O, so it's not the limiting factor. The additional |memcpy| could go away in production code, but that's not the limiting factor either: there's so much memcpy'ing in the overall codebase that the system would be unusable.

It more and more seems to me that Binder's high-performance-low-overhead reputation is just an urban legend. I searched for other's performance results, I could not find any measurements to validate such claims.

> This isn't so much
> about binder being fast as much as libpdu looking weak compared to other IPC
> designs. The problem that's being solved by libpdu has already been solved
> better by IPDL which provides type safety (no type format string) and speed
> (serialize/deserialization code is generated, no format string and switch
> loop).

IPDL would have to support multiple protocol versions at the same time. I don't think it does. Gecko has to support a minimum number of protocol versions at the same time to give external devs a chance to catch up with development.

If strong type safety is a concern, Gecko's packing and unpacking templates could be re-used. Or otherwise, additional format characters could be stored in the transfered data. We had this idea for testing, but production code might benefit as well, if the overhead is low.

> I also believe the sensor daemon belongs in the gecko repo, so using
> IPDL would not be an issue in that context.

I tend towards an external program to make a clear distinction between Gecko and daemon, but either is fine for me.
Attachment #8667866 - Flags: review?(mwu)
Attachment #8667935 - Flags: review?(mwu)
Attachment #8667936 - Flags: review?(mwu)
Attachment #8671211 - Flags: review?(mwu)
Before I forget about this: last week, dhylands, gsvelto and me had a meeting about this bug.  We found the daemon easier to handle when being stored in an external repository.  Devs of other daemons (nfcd, bluetoothd) expressed similar opinions.  The only good reason for in-Gecko tree would be the use of existing Gecko IPC protocols (IPDL).  But we require a certain amount of stability of the IPC protocol and some support for multiple protocol versions; Gecko's implementation doesn't provide either.  The plan for this bug is to continue with the current code base and to have Dave give it another review before landing.
Attachment #8664281 - Flags: feedback?(dhylands)
So it looks like sensord acts as a socket client? (the client normally calls connect) and the client acts as a server? (the server normally calls bind/accept).

I'm confused.

I was expecting that sensord would be the server, and the client would connect to it? Or maybe I'm just missing something?
Hi

(In reply to Dave Hylands [:dhylands] from comment #41)
> So it looks like sensord acts as a socket client? (the client normally calls
> connect) and the client acts as a server? (the server normally calls
> bind/accept).
> 
> I'm confused.
> 
> I was expecting that sensord would be the server, and the client would
> connect to it? Or maybe I'm just missing something?

You are right, the daemon connects to Gecko. So technically, it is the client.

We start these services on demand. Gecko instructs the init system to start the service daemon when the module gets activated and shuts down the daemon when the module gets deactivated. It's the same for Bluetooth and NFC.

This has a number of benefits:

 * No race conditions between Gecko and daemons when starting up modules. The order of operation is always

  * start Gecko
  * open a listening socket in Gecko
  * start daemon from Gecko
  * wait for the daemon to connect

   If the daemon starts first and listens, Gecko can never know when the daemon is ready. Rilproxy uses this scheme and Gecko often has to try multiple times to open a connection, because rilproxy isn't ready yet.

 * It allows us to save resources by only running daemons that are in use.

 * Daemons that are not running don't offer an attack surface.

 * Reliability: Gecko can easily kill any stale daemons. The socket name is randomized on each restart to minimize collisions with other programs.
Comment on attachment 8664281 [details] [review]
Github pull request

Updated Github pull request:

  - improvement performance by sending without polling first if socket is not blocked
  - uses switch in command-line parser
  - renamed ALOG?_ERRNO_NO to ALOG?_ERRNO_NUM
  - added range checks
  - fixed out-of-bounds access to g_default_sensors_flags
  - fixed typos and comments
Attachment #8667866 - Flags: review?(dhylands)
Attachment #8667935 - Flags: review?(dhylands)
Attachment #8667936 - Flags: review?(dhylands)
Attachment #8671211 - Flags: review?(dhylands)
Attachment #8664281 - Flags: feedback?(dhylands) → feedback+
Comment on attachment 8671211 [details] [review]
Github pull request for b2g-manifest

You may need to get these mirrored on git.mozilla.org as well
Attachment #8671211 - Flags: review?(dhylands) → review+
Attachment #8667936 - Flags: review?(dhylands) → review+
Attachment #8667935 - Flags: review?(dhylands) → review+
Attachment #8667866 - Flags: review?(dhylands) → review+
Comment on attachment 8667936 [details] [review]
Github pull request for device-flame

Do we need to add this for aries as well?
(In reply to Dave Hylands [:dhylands] from comment #44)
> Comment on attachment 8671211 [details] [review]
> Github pull request for b2g-manifest
> 
> You may need to get these mirrored on git.mozilla.org as well

Already done in bug 1208991.
(In reply to Dave Hylands [:dhylands] from comment #45)
> Comment on attachment 8667936 [details] [review]
> Github pull request for device-flame
> 
> Do we need to add this for aries as well?

Only flame{-kk} uses init.target.rc. All other systems use init.b2g.rc.
Hi Thomas, I saw a build error in emulator ICS, https://treeherder.mozilla.org/logviewer.html#?job_id=3864651&repo=b2g-inbound.

02:41:59    ERROR -  make: *** No rule to make target `out/target/product/generic/obj/lib/libfdio.so', needed by `out/target/product/generic/obj/EXECUTABLES/sensorsd_intermediates/LINKED/sensorsd'.  Stop.
02:41:59     INFO -  make: *** Waiting for unfinished jobs....

Could you help to take a look? Thank you.
(In reply to Edgar Chen [:edgar][:echen] from comment #52)
> Hi Thomas, I saw a build error in emulator ICS,
> https://treeherder.mozilla.org/logviewer.html#?job_id=3864651&repo=b2g-
> inbound.
> 
> 02:41:59    ERROR -  make: *** No rule to make target
> `out/target/product/generic/obj/lib/libfdio.so', needed by
> `out/target/product/generic/obj/EXECUTABLES/sensorsd_intermediates/LINKED/
> sensorsd'.  Stop.
> 02:41:59     INFO -  make: *** Waiting for unfinished jobs....
> 
> Could you help to take a look? Thank you.

more information from my local environment,

target thumb C: sensorsd <= system/sensorsd/src/poll.c
system/sensorsd/src/poll.c:37:23: error: fdio/task.h: No such file or directory
system/sensorsd/src/poll.c:199: error: return type is an incomplete type
system/sensorsd/src/poll.c: In function 'send_ntf_pdu_cb':
system/sensorsd/src/poll.c:201: error: 'IO_OK' undeclared (first use in this function)
system/sensorsd/src/poll.c:201: error: (Each undeclared identifier is reported only once
system/sensorsd/src/poll.c:201: error: for each function it appears in.)
cc1: warnings being treated as errors
system/sensorsd/src/poll.c:201: error: 'return' with a value, in function returning void
system/sensorsd/src/poll.c: In function 'error_ntf':
system/sensorsd/src/poll.c:549: error: implicit declaration of function 'run_task'
system/sensorsd/src/poll.c: At top level:
system/sensorsd/src/poll.c:939: error: return type is an incomplete type
system/sensorsd/src/poll.c: In function 'detect_sensors_cb':
system/sensorsd/src/poll.c:941: error: 'IO_OK' undeclared (first use in this function)
system/sensorsd/src/poll.c:941: error: 'return' with a value, in function returning void
make: *** [out/target/product/generic/obj/EXECUTABLES/sensorsd_intermediates/poll.o] Error 1
make: *** Waiting for unfinished jobs....
Oops, this library is missing. I'll fix this ASAP. It surprises me that sensorsd gets build. It's not yet listed in the Android.mks.
Attachment #8714286 - Flags: review?(gsvelto)
Comment on attachment 8714286 [details] [review]
Github pull request for b2g-manifest (libfdio)

Here we go.
Attachment #8714286 - Flags: review?(gsvelto) → review+
Edgar, it should be fixed.
Flags: needinfo?(echen)
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #58)
> Edgar, it should be fixed.

I can build pass locally now. Thank you, Thomas.
Flags: needinfo?(echen)
Hi Gabriele,

the recent manifest fix is not picked up automatically by the test infrastructure, because of bug 1243436. This patch has a  manual update.
Attachment #8714708 - Flags: review?(gsvelto)
Comment on attachment 8714708 [details] [diff] [review]
[01] Bug 1196221: Add libfdio to sources.xml for ICS emulators

Nasty, ICS keeps causing us pain.
Attachment #8714708 - Flags: review?(gsvelto) → review+
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: