Closed Bug 907087 Opened 7 years ago Closed 7 years ago

[Messages] Application hangs attempting to open message

Categories

(Firefox OS Graveyard :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 908907

People

(Reporter: rhelmer, Unassigned)

References

Details

(Keywords: regression, Whiteboard: [from-geeksphone])

Phone type (Keon, Peak, other): Keon
OS version (Settings>Device information>More information>): 1.2.0.0-prerelease
Build Identifier (...>More information>): 20130820022245
Update channel (...>More information>): default

Steps to Reproduce:
1) open SMS ("Messages") app
2) select message

Expected Result:

SMS message is displayed

Actual Result:

App hangs until terminated

Is this issue sometimes or always reproducible:

Always
Unfortunately, we don't have one of these devices at Bocoup, so I can't confirm by reproduction.
Summary: SMS app hangs attempting to open message → [Messages] Application hangs attempting to open message
Same thing is happening here. On Keon. Using the official nightly.
blocking-b2g: --- → koi?
Keywords: regression
I tried sending a message, which worked, and now I am able to load messages again (!)
(In reply to Robert Helmer [:rhelmer] from comment #3)
> I tried sending a message, which worked, and now I am able to load messages
> again (!)

Killed the app w/ task manager and re-opened, it's hanging again as before.
BTW my Keon build is 20130818024607 (Nightly, OTA updates)
Is there any meaningful logcat line ?
Not sure if it is relevant, but I get these:

I/Gecko   (  585): ###################################### forms.js loaded
I/Gecko   (  585): ############################### browserElementPanning.js loaded
I/Gecko   (  585): ######################## BrowserElementChildPreload.js loaded
I/Gecko   (  109): [Parent 109] WARNING: waitpid failed pid:521 errno:10: file /home/geeksphone/FOS/keon/gecko/ipc/chromium/src/base/process_util_posix.cc, line 260
I/Gecko   (  109): [Parent 109] WARNING: waitpid failed pid:521 errno:10: file /home/geeksphone/FOS/keon/gecko/ipc/chromium/src/base/process_util_posix.cc, line 260
I/Gecko   (  109): [Parent 109] WARNING: Failed to deliver SIGKILL to 521!(3).: file /home/geeksphone/FOS/keon/gecko/ipc/chromium/src/chrome/common/process_watcher_posix_sigchld.cc, line 118

The WARNING appear when I tap on a thread to read the messages (and I get nothing displayed).
And was 521 the PID for the Messages app ? Was the Messages app process still running ?
Ok, I see what is happening.

The message appears the second time I try, after kill the Messages app. In that case 521 is the PID of Messages app that I killed with the task manager.

109 is the PID if the main b2g process.
I updated to nightly-images-keon-2013-08-26.Gecko-f216c74.Gaia-f94d4a9.zip (without erasing user data)

It is still happening.
(In reply to Hubert Figuiere [:hub] from comment #10)
> I updated to nightly-images-keon-2013-08-26.Gecko-f216c74.Gaia-f94d4a9.zip
> (without erasing user data)
> 
> It is still happening.

I have been keeping up with OTA updates, and same here (same user data).

In fact it's a little worse, I was able to send messages (and that would fix it until it was restarted, see comment 3), but it also hangs when sending messages now :/
Well, bug 908757 doesn't allow me to get the OTA updates :-/
(In reply to Hubert Figuiere [:hub] from comment #12)
> Well, bug 908757 doesn't allow me to get the OTA updates :-/

Oh! Yeah that's happening to me too, I guess I have not been keeping up to date :( Thanks!
it's easily reproducable with the reference workload.

don't know what happens though, there is nothing interesting in the logs.
Can this be reproduced on a device other than Keon for 1.2?
Keywords: qawanted
Can't reproduce on Inari using my eng build done this AM with a light reference workload.
QA Contact: dkumar
In response to comment 15
Tested on both Buri and Unagi and was not able to reproduce this issue.

Environmental Variables
Build ID: 20130827040201
Gecko: http://hg.mozilla.org/mozilla-central/rev/e42dce3209da
Gaia: 599214a0f41eece076dc83cd85f5b27f8cfe67f2
Platform Version: 26.0a1 

I was able to see and open up the SMS message successfully.
Thanks for testing this. Could it have to do with our user data maybe?
I've tried with the same medium workload on an unagi and a one touch fire, and the bug hasn't occured.

I'm thinking of a driver problem then, maybe we can try to turn off the hardware graphical acceleration on the Keon.
Keywords: qawanted
No longer blocks: b2g-central-dogfood
clear the blocking flag as no one can reproduce
please nominate if reproduced again. thanks
blocking-b2g: koi? → ---
Joe, I definitely reproduce on Keon.
blocking-b2g: --- → koi?
(In reply to Julien Wajsberg [:julienw] from comment #21)
> Joe, I definitely reproduce on Keon.

This needs to be reproducible on something other than Keon in order to block on this.
(In reply to Jason Smith [:jsmith] from comment #22)

> This needs to be reproducible on something other than Keon in order to block
> on this.

Isn't Keon an officially supported platform?
so, tried again on hamachi, and it doesn't reproduce there.
blocking-b2g: koi? → ---
To put in perspective: this is the bug that forced me to stop dogfooding Firefox OS. This make it very hard.
Hubert: yes, this is important to me too, this is just not blocking the release, if I understood correctly.
So let's put it that way: we are ready to release Firefox OS that won't work on one of the few unlocked devices in existence? This is ridiculous.
blocking-b2g: --- → koi?
Just to note, this also breaks the tiled layers backend on the Keon. As soon as dup() is called on an FD that has mapped shared memory on a process beyond the first one, it halts during the dup.

If it helps, this is very easy to reproduce using the tiled layers backend. Apply this patch: https://bugzilla.mozilla.org/attachment.cgi?id=792112

Then debug child processes, attach to the child process, break in ShareToProcess, then continue. If you step from this point, it gets stuck in some assembly inside dup().

This appears to be a regression, this worked fine for me a couple of weeks ago. Lost 3 days thinking it was a problem with my code :(
Just to clarify my last comment (comment #28), I think the real bug here is that dup'ing a file descriptor hangs.

I'm not sure how general this is, in my case, if more than one process tries to dup fd's mapped to shared memory, every subsequent process will hang indefinitely inside the dup after the 'swi' op.
Michael, this sounds like to be something for you (see comment 28 and 29).
Component: Gaia::SMS → General
Flags: needinfo?(mwu)
This sounds like a problem I was having with seccomp, for which I filed bug 907006: the syscall filter specifies that the process should be killed[*], and it does enter Z state but the usual notifications don't happen, which manifests as a hang.  If the other devices mentioned don't have seccomp-bpf support in their kernels yet, then it makes sense that this wouldn't be reproducible there.  We suspect a kernel bug; note that the kernel support had to be backported to the old kernel versions we're currently using.

This can be verified by doing `adb shell cat /proc/N/status`, where N is the pid.  If the hung process has "State: Z" and "Seccomp: 2", then it's the same bug.  And if the child processes — live or otherwise — *don't* have "Seccomp: 2", then seccomp-bpf isn't being used.

[*] There are a few known issues with Gecko's behavior and/or the seccomp whitelist; see bug 906996 (unlink) and bug 908907 (dup).
(In reply to Julien Wajsberg [:julienw] from comment #30)
> Michael, this sounds like to be something for you (see comment 28 and 29).

Sounds like a seccomp issue which I'm not familiar with.
Flags: needinfo?(mwu)
Jed, I can confirm that it's the case.

Is there anything we can do to help you here ? This is really painful for me because the keon is the only phone with moz-central that I have which works correctly on my network otherwise (not to mention dogfooders on keon)
Also confirmed that Buri doesn't use seccomp.
Assuming that the current situation is considered untenable, where b2g is confusingly broken on devices that are commonly used for development or dogfooding but aren't part of normal testing, options are:

(1) Disable seccomp entirely until it's fixed.  I think we don't want to do that, either.

(2) Whitelist any syscall that plugin-container is observed to use, even if it's not something we want to allow long-term (e.g., unlink), and file a bug to remove it.  Also, --enable-content-sandbox-reporter by default on b2g to work around bug 907006, so that any failures will be obvious.
Depends on: 906996
* who could be a good assignee for (2) ?
* what are the implication of the sandbox reporter ? Is it a good idea to always enable this ? Why is it not always enabled ?

Thanks Jed !
I have done these things, and I'm filing/adjusting bugs for it, with bug 912791 as the meta-bug.

As for why we don't want the reporter on all the time: I expect it's to reduce the attack surface.  If the bad system call is in fact due to a compromise of the child process, it's safer (or, at least, no less safe) to kill the process immediately than to let it continue running in a signal handler.  (The underlying assumption would be that we'll have done sufficient testing on this by the time it's released that there will be no false positives in production.)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 908907
blocking-b2g: koi? → ---
You need to log in before you can comment on or make changes to this bug.