Closed Bug 1291768 Opened 3 years ago Closed 3 years ago

Firefox for Android Crashes when performance recording is started

Categories

(Firefox for Android :: General, defect)

48 Branch
defect
Not set

Tracking

()

RESOLVED FIXED
Firefox 51
Tracking Status
firefox48 --- wontfix
firefox49 --- fixed
fennec 49+ ---
firefox50 --- fixed
firefox51 --- fixed

People

(Reporter: tmcrodrigues, Assigned: glandium, NeedInfo)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36
Firefox for Android

Steps to reproduce:

- Open Firefox 48 on Mac OSX
- Open Firefox 48 on Android
- Setup remote debugging via USB
- Open any website on Firefox for Android
- Open performance profiler on inspector window
- Start recording performance


Actual results:

Firefox on Android just crashes


Expected results:

It doesn't crash and I can use the performance profiler.
Tiago: Can you please provide the crash report if one is generated? You can find it in about:crashes? Thanks.
Flags: needinfo?(tmcrodrigues)
Looks to be a dupe of Bug 1241536
Status: UNCONFIRMED → RESOLVED
Closed: 3 years ago
Flags: needinfo?(tmcrodrigues)
Resolution: --- → DUPLICATE
Duplicate of bug: 1241536
Yeah, I noticed that now on the crash report (which I hadn't seen before). The keywords I looked for didn't really lead me to that bug before :(
Bug 1241536 is something different (no crash, just doesn't work). I can reproduce this one too, though.
Status: RESOLVED → REOPENED
Ever confirmed: true
Resolution: DUPLICATE → ---
Component: Untriaged → General
Product: Firefox → Firefox for Android
Mike this seems linker related
Flags: needinfo?(mh+mozilla)
So, the address of the crash is 0xb4d10000, which, according to the module data in the raw dump is for libLLVM.so, but /proc/self/maps says libLLVM.so starts at 0xb4d34000, and that the memory at 0xb4d10000 is the placeholder between the two PT_LOADs of libcneconn.so. IOW, that doesn't match.

The likely problem is that /system/lib/libLLVM.so has some funky elf headers. Can one of the people that can reproduce the bug pull that file from their device, and attach it to this bug.
Flags: needinfo?(mh+mozilla)
Couldn't attach the file directly because it's too large for bugzilla attachments, but you can get it from my Dropbox here: https://www.dropbox.com/s/s0bsmv3rh13sybx/libLLVM.so?dl=0

This is from this device https://crash-stats.mozilla.com/report/index/a6a73728-f4de-43ed-a60b-3cab42160803
This is exactly what I suspected:

$ readelf -l ~/Downloads/libLLVM.so  | grep '\(Type\|LOAD\)'
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD           0x000000 0x00024000 0x00024000 0xa10e48 0xa10e48 R E 0x1000
  LOAD           0xa114d0 0x00a364d0 0x00a364d0 0x49304 0x50c05 RW  0x1000
 
Most libraries will have both Offset and VirtAddr at 0x0000000 for the first LOAD. That's the assumption our linker does. The funny thing is that the address the system linker puts in the structures it gives to debuggers (which we use to find the libraries it has loaded) is for a VirtAddr of 0, which is why we're trying to read at 0xb4d10000 (which we got from there), and why /proc/self/maps places libLLVM.so at 0xb4d34000, which is 0x24000 further.

This type of situation is what the combination of mincore and Elf::Ehdr::validate are trying to avoid hitting, but here we're unlucky because the zone we're reading *is* mapped, *and* not readable.

It also doesn't help that the system linker doesn't put the full path to the library in its debugger-targetted structures (which is why our linker is trying to read the elf header from the memory in the first place).

A way around this is to try to write to a pipe from that address, which will not segfault, but instead return a EFAULT error.
Assignee: nobody → mh+mozilla
tracking-fennec: ? → 49+
Comment on attachment 8782175 [details]
Bug 1291768 - Avoid SIGSEGV trying to read ELF headers of libraries with a non-null base virtual address.

https://reviewboard.mozilla.org/r/72390/#review70310

r=me; I don't think it's absolutely necessary to get the error handling right here, but I'm sure you can appreciate how much extra work POSIX wants you to do.

::: mozglue/linker/ElfLoader.cpp:157
(Diff revision 1)
> +        do {
> +          ret = read(read_fd, raw_ehdr, sizeof(raw_ehdr));
> +        } while (ret == -1 && errno == EINTR);

The amount of error checking done here to be POSIX-ly correct is...stunning.  But I'm not sure if it's actually sufficient.  The POSIX page for `read` (http://pubs.opengroup.org/onlinepubs/9699919799/functions/read.html) has these rage-inducing paragraphs:

> Earlier versions of this standard allowed two very different behaviors with regard to the handling of interrupts. In order to minimize the resulting confusion, it was decided that POSIX.1-2008 should support only one of these behaviors. Historical practice on AT&T-derived systems was to have read() and write() return -1 and set errno to [EINTR] when interrupted after some, but not all, of the data requested had been transferred. However, the US Department of Commerce FIPS 151-1 and FIPS 151-2 require the historical BSD behavior, in which read() and write() return the number of bytes actually transferred before the interrupt. If -1 is returned when any data is transferred, it is difficult to recover from the error on a seekable device and impossible on a non-seekable device. Most new implementations support this behavior. The behavior required by POSIX.1-2008 is to return the number of bytes transferred.

> POSIX.1-2008 does not specify when an implementation that buffers read()s actually moves the data into the user-supplied buffer, so an implementation may choose to do this at the latest possible moment. Therefore, an interrupt arriving earlier may not cause read() to return a partial byte count, but rather to return -1 and set errno to [EINTR].

> Consideration was also given to combining the two previous options, and setting errno to [EINTR] while returning a short count. However, not only is there no existing practice that implements this, it is also contradictory to the idea that when errno is set, the function responsible shall return -1.

And the Linux man page seems to concur:

> On success, the number of bytes read is returned (zero indicates end of file), and the file position is advanced by  this number.  It is not an error if this number is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now (maybe because we  were  close  to  end-of-file, or because we are reading from a pipe, or from a terminal), or because read() was interrupted by a signal.  On error, -1 is returned, and errno is set appropriately.  In this case, it is left unspecified whether the file position (if any) changes.

AIUI, the same logic applies to write as well.  That is, you cannot assume that `read` or `write` does everything in one go, because it might have been interrupted in the middle (and still return success), so you have to loop in the non-error case as well.  I think to be fully correct here and above, you would need to track how much you've actually read/written, continuing when `nbytes != sizeof(raw_ehdr) || (ret == -1 && errno == EINTR)` or somesuch, depending on how you want to write the loop.
Attachment #8782175 - Flags: review?(nfroyd) → review+
Thankfully, that's not true for write:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html

> Complete/partial/deferred: A write request:
> (...)
> ret = write(fildes, buf, nbyte);
>
> may return:
>
> (...)
> Partial
>     ret<nbyte
>
>    This shall never happen if nbyte<= {PIPE_BUF}. If it does happen (with nbyte> {PIPE_BUF}), this volume of POSIX.1-2008 does not guarantee atomicity, even if ret<= {PIPE_BUF}, because atomicity is guaranteed according to the amount requested, not the amount written.
(In reply to Mike Hommey [:glandium] from comment #12)
> Thankfully, that's not true for write:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html

Ah, very nice.  That's what I get for not reading both spec pages.
Comment on attachment 8782175 [details]
Bug 1291768 - Avoid SIGSEGV trying to read ELF headers of libraries with a non-null base virtual address.

Would you mind looking at the interdiff?
Attachment #8782175 - Flags: review+ → review?(nfroyd)
Comment on attachment 8782175 [details]
Bug 1291768 - Avoid SIGSEGV trying to read ELF headers of libraries with a non-null base virtual address.

Looks good, thank you for the comments.
Attachment #8782175 - Flags: review?(nfroyd) → review+
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/b9eeff3c5d37
Avoid SIGSEGV trying to read ELF headers of libraries with a non-null base virtual address. r=froydnj
https://hg.mozilla.org/mozilla-central/rev/b9eeff3c5d37
Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 51
Mike, I guess we want to uplift that to aurora and probably beta, could you fill the uplift request? Thanks
Flags: needinfo?(mh+mozilla)
James, do we want to uplift this?
Flags: needinfo?(mh+mozilla) → needinfo?(snorp)
I think we do if you don't believe it's too risky
Flags: needinfo?(snorp)
Flags: needinfo?(mh+mozilla)
Comment on attachment 8782175 [details]
Bug 1291768 - Avoid SIGSEGV trying to read ELF headers of libraries with a non-null base virtual address.

Approval Request Comment
[User impact if declined]: Crash when enabling performance recording in some cases on some devices. Technically, this has affected Fennec for as long as the profiler has supported symbolication on android (bug 783331 for the linker part that introduced this bug)
[Describe test coverage new/current, TreeHerder]: STR in comment 0.
[Risks and why]: The worst that can happen is performance recording crashing or freezing the browser on more devices than what was affected initially, if the patch somehow doesn't work as it's expected to (but I expect the chances of that happening are rather slim). The patch doesn't affect anything else than the performance recording feature/profiler, so there is no risk for normal operation of the browser.
[String/UUID change made/needed]: None
Flags: needinfo?(mh+mozilla)
Attachment #8782175 - Flags: approval-mozilla-beta?
Attachment #8782175 - Flags: approval-mozilla-aurora?
Blocks: 1299058
Hello Tiago, could you please verify this issue is fixed as expected on a latest Nightly build? Thanks!
Flags: needinfo?(tmcrodrigues)
Hi there. A coworker of mine has verified it and already managed to debug the issue we were after. Thanks a lot for the fix :)
Comment on attachment 8782175 [details]
Bug 1291768 - Avoid SIGSEGV trying to read ELF headers of libraries with a non-null base virtual address.

Verified on nightly, let's uplift to aurora and beta so the profiler will avoid some crashes.
Attachment #8782175 - Flags: approval-mozilla-beta?
Attachment #8782175 - Flags: approval-mozilla-beta+
Attachment #8782175 - Flags: approval-mozilla-aurora?
Attachment #8782175 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.