Closed Bug 1142384 Opened 5 years ago Closed 5 years ago

[MTBF] System crashed, wifi keep spinning

Categories

(Firefox OS Graveyard :: Stability, defect)

ARM
Gonk (Firefox OS)
defect
Not set

Tracking

(blocking-b2g:2.2+, firefox38 wontfix, firefox39 wontfix, firefox40 fixed, b2g-v2.2 fixed, b2g-master fixed)

RESOLVED FIXED
2.2 S11 (1may)
blocking-b2g 2.2+
Tracking Status
firefox38 --- wontfix
firefox39 --- wontfix
firefox40 --- fixed
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: pyang, Assigned: mcmanus)

References

Details

(Keywords: crash, Whiteboard: [b2g-crash])

Crash Data

Attachments

(5 files)

Attached file dmp file
STR: Run mtbf-test for more than 12 hours
Reproduce rate: low

Wifi spinning and can't be brought up.
Logcat keeps printing "[Parent][MessageChannel] Error: Channel error: cannot send/recv" might be a ipc error.
Version info:
Build ID               20150310162504
Gaia Revision          5af6f8d5d6161dea02002634c6d0a570a122e5dd
Gaia Date              2015-03-10 19:17:12
Gecko Revision         https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/ec87adb8cf13
Gecko Version          37.0
Device Name            flame
Firmware(Release)      4.4.2
Firmware(Incremental)  eng.cltbld.20150310.200728
Firmware Date          Tue Mar 10 20:07:39 EDT 2015
Bootloader             L1TC100118D0
Attached file Symbol zip file
Vincent, can you provide comment for this issue?
Flags: needinfo?(vchang)
Attached file stack.txt
The stack generated from attachment 8576428 [details] & 8576429.
Blocks: MTBF-B2G
It seems wpa_supplicant and wifi driver work fine. I use start wpa_supplicant command and wpa_cli to verify it manually. 

After restart b2g, I could use settings app to turn on/off wifi, and get AP list from wpa_cli scan command. However, I still could not see the scan list shown up on the settings apps.

Not sure what's happened here, may need to put some debug logs.
Flags: needinfo?(vchang)
Vincent - would you like to provide build or patch so that we can get more information? thanks.
Henry, Since Vincent isn't in Taipei, can you please check this issue? Thanks.
Flags: needinfo?(hchang)
blocking-b2g: --- → 2.2?
(In reply to Ken Chang[:ken](OOO from 2/18 to 3/1) from comment #8)
> Henry, Since Vincent isn't in Taipei, can you please check this issue?
> Thanks.

No problem. I'll take it a look!
Flags: needinfo?(hchang)
I actually don't any connection between the crash and the wifi issue...
With gecko: ec87adb8cf13 and gaia: 5af6f8d5d6161de,

The wifi never ever shows scan result and keep printing 

"W/Settings( 1700): [JavaScript Error: "Error: wifiListStart mark not found" {file: "app://settings.gaiamobile.org/shared/js/usertiming.js" line: 130}]"

Paul,

Do you also see the same message?
Flags: needinfo?(pyang)
This bug crashed and accidentally few logs left, so can't tell if above log appeared.
Will try to reproduce in next round.
Flags: needinfo?(pyang)
Assignee: nobody → hchang
blocking-b2g: 2.2? → 2.2+
Attached patch Bug1142384.diffSplinter Review
As Arthur suggested, move around the window.performance.mark('wifiListStart')
to avoid race condition.
Crash Signature: [@ mozilla::RefPtr<mozilla::AudioInitTask>::~RefPtr() ]
Keywords: crash
Whiteboard: [b2g-crash]
The bug mentioned in comment 13 is going to fix and land in Bug 1146208, and the crash doesn't related to WiFi in comment 10. So we would like to drop the bug and let others people to jump in.
Assignee: hchang → nobody
Look like a crash in audioTrack, Bobby do we have a chance to see this?
Flags: needinfo?(bchien)
ni? Alastor for audioTrack related
Flags: needinfo?(alwu)
Hi Steven, could you have comments? I saw call stack is strange, it looks not possible to crash in audioInitTask.
Flags: needinfo?(bchien) → needinfo?(slee)
(In reply to Bobby Chien [:bchien] from comment #17)
> Hi Steven, could you have comments? I saw call stack is strange, it looks
> not possible to crash in audioInitTask.
Agree.

1. As the call stack shows, it crashed at "libxul.so!mozilla::RefPtr<mozilla::AudioInitTask>::~RefPtr() + 0xa", but AudioInitTask is running on "CubeInit" thread, [1]. 
2. From the call stack, the crash thread should be SocketTransportService thread. 

So that I think it should not be an audio related problem.

[1] https://dxr.mozilla.org/mozilla-central/source/dom/media/AudioStream.h#422
Flags: needinfo?(slee)
cancel ni? per comment 18
Flags: needinfo?(alwu)
Jason, This issue is very rarely appear. However, it looks like crashed in HTTP stack. could you help to have comment on this? Thanks.
Flags: needinfo?(jduell.mcbugs)
Doug, could you help to find someone to take a look on this bug? Thanks.
Flags: needinfo?(dougt)
There are a lot of FennecAndroid crash report pointing to the same crash signature "mozilla::RefPtr<mozilla::AudioInitTask>::~RefPtr()", in each report the crash happens in different kind of threads.  Maybe this problem is not related to specific thread.
The function name of the call stack might be wrong because the |Release| and auto pointer destructor will be optimize to one function instance. We can see a lot of functions are mapping to the same address. |AsyncLatencyLogger::Release()| and |mozilla::RefPtr<mozilla::AudioInitTask>::~RefPtr()| happen to be the first entry of the group of that kind of functions in symbol file.

I think the following call stack is more reasonable by investigating the source code.

> 0  libxul.so!nsCOMPtr_base::~nsCOMPtr_base()
> 1  libxul.so!mozilla::net::EventTokenBucket::~EventTokenBucket() [nsCOMPtr.h : 344 + 0x7]
> 2  libxul.so!mozilla::net::EventTokenBucket::~EventTokenBucket() [EventTokenBucket.cpp:ec87adb8cf13 : 133 + 0x3]
> 3  libxul.so!mozilla::net::EventTokenBucket::Release()
> 4  libxul.so!mozilla::net::nsHttpConnectionMgr::OnMsgUpdateRequestTokenBucket(int, void*) [nsRefPtr.h : 47 + 0x5]
> 5  libxul.so!mozilla::net::nsHttpConnectionMgr::nsConnEvent::Run() [nsHttpConnectionMgr.h:ec87adb8cf13 : 631 + 0xb]
maybe garvan can take a look. bounce it back if you can't.
Flags: needinfo?(dougt) → needinfo?(gkeeley)
I don't know this code, and plate is full ATM, so I'll have to bounce it.

Some obvious things to try would be to null check param in OnMsgUpdateRequestTokenBucket()
https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpConnectionMgr.cpp#530

Might be worth pinging Patrick McManus, author of the code:
https://hg.mozilla.org/mozilla-central/diff/ecf37b2b9a96/netwerk/protocol/http/nsHttpConnectionMgr.cpp

Is there any useful/meaningful way to sanity check "param"? Is it possible that the EventTokenBucket had its refcount drop to zero before it gets to OnMsgUpdateRequestTokenBucket. I assume there is some async behaviour in this code, perhaps that introduces that possibility.

If there is going to be guessing happening, it would great to find some way to increase to probability of this crash. Not knowing the tests involved, I don't know if they can kicked into overdrive to trigger this bug faster.
Flags: needinfo?(gkeeley)
(In reply to Garvan Keeley [:garvank] from comment #25)
> I don't know this code, and plate is full ATM, so I'll have to bounce it.
> 
> Some obvious things to try would be to null check param in
> OnMsgUpdateRequestTokenBucket()
> https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/
> nsHttpConnectionMgr.cpp#530
> 
> Might be worth pinging Patrick McManus, author of the code:
> https://hg.mozilla.org/mozilla-central/diff/ecf37b2b9a96/netwerk/protocol/
> http/nsHttpConnectionMgr.cpp
> 
> Is there any useful/meaningful way to sanity check "param"? Is it possible
> that the EventTokenBucket had its refcount drop to zero before it gets to
> OnMsgUpdateRequestTokenBucket. I assume there is some async behaviour in
> this code, perhaps that introduces that possibility.
> 
> If there is going to be guessing happening, it would great to find some way
> to increase to probability of this crash. Not knowing the tests involved, I
> don't know if they can kicked into overdrive to trigger this bug faster.

Paul, are we hitting this now? Or can you trigger a local run to see if we can catch the test results and get more info here?
Flags: needinfo?(pyang)
Haven't seen this issue for long time. I can try and see in our next trigger.
Flags: needinfo?(pyang)
see comment #25
Flags: needinfo?(mcmanus)
so that member is only supposed to be assigned on the socket thread, and the stack trace looks fine.. however I did find one place where it is assinged on the main thread and during a pref change and that could be racing against the stack trace we see.. the backtrace that is included here has only 2 seconds of uptime, so it makes sense that it is reading the startup prefs.

I'm not certain this is your issue, but its worth giving it a try
Flags: needinfo?(mcmanus)
Attachment #8597632 - Flags: review?(hurley)
Assignee: nobody → mcmanus
Status: NEW → ASSIGNED
Attachment #8597632 - Flags: review?(hurley) → review+
https://hg.mozilla.org/mozilla-central/rev/36c4d774fa03
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Flags: needinfo?(jduell.mcbugs)
Please request b2g37 approval on this patch when you get a chance.
Flags: needinfo?(mcmanus)
Target Milestone: --- → 2.2 S11 (1may)
Comment on attachment 8597632 [details] [diff] [review]
eventtokenbucket thread management

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): long standing latent bug
User impact if declined: potential startup crashes. seen in qa mtbf test
Testing completed: regression only
Risk to taking this patch (and alternatives if risky): very low. it has had a month of platform coverage
String or UUID changes made by this patch: none
Flags: needinfo?(mcmanus)
Attachment #8597632 - Flags: approval-mozilla-b2g37?
As comment 34 and comment 35, ni Josh to aware last minute request for v2.2.
Flags: needinfo?(jocheng)
Flags: needinfo?(jocheng)
Attachment #8597632 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
You need to log in before you can comment on or make changes to this bug.