Crash in AudioChannelService while stability testing

RESOLVED FIXED in FxOS-S1 (26Jun)

Status

defect
--
critical
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: ggrisco, Assigned: baku)

Tracking

({crash})

unspecified
FxOS-S1 (26Jun)
ARM
Gonk (Firefox OS)
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:2.2+, b2g-v2.2 fixed, b2g-master wontfix)

Details

(Whiteboard: [b2g-crash][caf-crash 625][caf priority: p1][CR 823419], crash signature)

Attachments

(14 attachments, 21 obsolete attachments)

6.34 KB, patch
Details | Diff | Splinter Review
133.29 KB, text/plain
Details
522.86 KB, text/plain
Details
3.46 KB, patch
Details | Diff | Splinter Review
424.27 KB, text/plain
Details
6.37 KB, patch
Details | Diff | Splinter Review
4.00 KB, text/plain
Details
4.48 MB, text/plain
Details
4.83 KB, text/plain
Details
140.58 KB, text/plain
Details
477.34 KB, text/plain
Details
138.48 KB, text/plain
Details
519.19 KB, text/plain
Details
16.50 KB, patch
Details | Diff | Splinter Review
(Reporter)

Description

4 years ago
Saw this signature a couple of times now on latest build.  This is coming during stability test, so I won't be able to say exactly what steps lead to this.  cafbot will update with minidump and logs.

[@ mozilla::dom::AudioChannelService::NotifyEnumerator | nsBaseHashtable, nsAutoPtrmozilla::dom::AudioChannelService::AudioChannelAgentData, mozilla::dom::AudioChannelService::AudioChannelAgentData>::s_EnumReadStub | PL_DHashTableEnumerate | nsBaseHashtable, nsAutoPtrmozilla::dom::AudioChannelService::AudioChannelAgentData, mozilla::dom::AudioChannelService::AudioChannelAgentData>::EnumerateRead ]
Posted file EXTRA file attachment - (obsolete) —
Posted file decoded minidump - (obsolete) —
Whiteboard: [CR 823419] → [caf priority: p1][CR 823419]
Whiteboard: [caf priority: p1][CR 823419] → [b2g-crash][caf-crash 625][caf priority: p1][CR 823419]
Keywords: crash

Comment 6

4 years ago
Hi! Steven,

Could your team help to take a look? Thanks

--
Keven
Flags: needinfo?(slee)

Comment 7

4 years ago
Hi Alastor,

Please check this problem. Thanks.
Flags: needinfo?(alwu)

Updated

4 years ago
Assignee: nobody → alwu
Flags: needinfo?(alwu)
(Reporter)

Comment 8

4 years ago
We saw this crash twice so far in the first few hours of testing on AU 150. So right now it is our highest priority crash to fix.
Posted patch LOG (obsolete) — Splinter Review
Hi, Greg,
Could you apply this patch and capture the new log?
Thanks!
Flags: needinfo?(ggrisco)
Posted patch Log (obsolete) — Splinter Review
Sorry, the previous patch is wrong patch.
Attachment #8603939 - Attachment is obsolete: true
(Reporter)

Comment 11

4 years ago
Ok, I applied the patch and it should get built into AU 154.  So if cafbot reports this on AU 154 or later then you'll get the logs.
Flags: needinfo?(ggrisco)

Updated

4 years ago
blocking-b2g: 2.2? → 2.2+
Hi, Greg,
It seems that the new patch doesn't be applied on your side, because I can't see the new log, only see the old one.
Could you help me to check that?
Thanks!
Flags: needinfo?(ggrisco)
(Reporter)

Comment 16

4 years ago
(In reply to Alastor Wu [:alwu] from comment #15)
> Hi, Greg,
> It seems that the new patch doesn't be applied on your side, because I can't
> see the new log, only see the old one.
> Could you help me to check that?
> Thanks!

Hi Alastor,

comment 12 is for AU 153.  The patch was built into AU 154 and later, so we have to wait for next cafbot mail which might not come for another day or so.
Flags: needinfo?(ggrisco)

Comment 20

4 years ago
(In reply to cafbot (PoC: ggrisco) from comment #19)
> Created attachment 8605238 [details]
> decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.154

It crashed at AudioChannelService.cpp : 672 which is [1](the line number is different because of applying attachment 8604000 [details] [diff] [review]) and the address is 0x5a5a5a6e. I'm guessing that aAgent has been deleted. Alastor, is it possible that we put an agent that has been deleted in [2]?

[1] http://git.mozilla.org/?p=releases/gecko.git;a=blob;f=dom/audiochannel/AudioChannelService.cpp;h=21968630cbc5862d6fb3073e9b45c05595a401f7;hb=4f049c08e74aab39ee81e7c6e7fcee311a1868b0#l669
[2] http://git.mozilla.org/?p=releases/gecko.git;a=blob;f=dom/audiochannel/AudioChannelService.cpp;h=21968630cbc5862d6fb3073e9b45c05595a401f7;hb=4f049c08e74aab39ee81e7c6e7fcee311a1868b0#l143
Flags: needinfo?(slee)
As I known, I think the references of AudioChannelAgent should be removed before they died. 
In the dtor of the AudioChannelAgent, it would call StopPlaying() [1] to unregister itself from the AudioChannelService. [2]

However, I found that the agent may unregister fail when the xpcom is shutdown.
After the xpcom shutdown, we would set the mDisable to true [3], then the register() and unregister() can't be executed successfully. 

In my assumption, maybe we should check the mDisable to ensure whether we need to notify AudioChannelAgent.

[1] https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/file/73847cb1a3d9/dom/audiochannel/AudioChannelAgent.cpp#l35
[2] https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/file/73847cb1a3d9/dom/audiochannel/AudioChannelService.cpp#l222
[3] https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/file/73847cb1a3d9/dom/audiochannel/AudioChannelService.cpp#l786
Posted patch Check mDisable & Add log (obsolete) — Splinter Review
Hi, Greg,
Could you help me remove the patch from the comment10, then apply this one?
This patch has the possible solution and adds some logs.
Thanks :)
Attachment #8604000 - Attachment is obsolete: true
Flags: needinfo?(ggrisco)
(Reporter)

Comment 23

4 years ago
(In reply to Alastor Wu [:alwu] from comment #22)

> Hi, Greg,
> Could you help me remove the patch from the comment10, then apply this one?
> This patch has the possible solution and adds some logs.
> Thanks :)

Hi Alastor, thanks for the patch.  I have applied it internally and it should be built into AU 157.
Flags: needinfo?(ggrisco)

Comment 27

4 years ago
(In reply to cafbot (PoC: ggrisco) from comment #26)
> Created attachment 8606142 [details]
> decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.156

Hi Greg,

This dump has not applied attachment 8605637 [details] [diff] [review], right?
Flags: needinfo?(ggrisco)
(Reporter)

Comment 28

4 years ago
(In reply to StevenLee[:slee] from comment #27)
> (In reply to cafbot (PoC: ggrisco) from comment #26)
> > Created attachment 8606142 [details]
> > decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.156
> 
> Hi Greg,
> 
> This dump has not applied attachment 8605637 [details] [diff] [review],
> right?

Right, testing is just beginning on AU 157, so we might have to wait until Monday to see if the crash reproduced.
Flags: needinfo?(ggrisco)
(Reporter)

Comment 32

4 years ago
Hi Alastor, as you can see this is still happening on AU 157 even with the patch applied.  Can you look at the logs from comment 30 and comment 31?
Flags: needinfo?(alwu)
Hi, Baku,
Sorry to bother you,
Could you give me some help about this issue? 

It seems that we used the deleted agent (0x5a5a5a5a) so that we got crash. However, I don't have any idea why this case would happen, because the agent should be unregistered from the hash table before it was destroyed.

In addition, I found that the agents number was 95 in the logcat (comment30). It was very strange.

Very appreciate!
Flags: needinfo?(alwu) → needinfo?(amarchesini)

Comment 34

4 years ago
After discussing with Alastor, we found that if the destructor of AudioChannelAgent is called during AudioChannelAgent::StartPlaying,[1], is executing, we may register a dangling pointer to AudioChannelService. We will try to see if it's possible in 2.2 branch.

[1] https://dxr.mozilla.org/mozilla-central/source/dom/audiochannel/AudioChannelAgent.cpp#133
Posted patch LogsSplinter Review
Hi, Greg,
Could you help me remove previous patched, then apply this one?
Thanks!
Attachment #8602889 - Attachment is obsolete: true
Attachment #8602890 - Attachment is obsolete: true
Attachment #8602915 - Attachment is obsolete: true
Attachment #8602916 - Attachment is obsolete: true
Attachment #8604499 - Attachment is obsolete: true
Attachment #8604500 - Attachment is obsolete: true
Attachment #8605237 - Attachment is obsolete: true
Attachment #8605238 - Attachment is obsolete: true
Attachment #8605637 - Attachment is obsolete: true
Attachment #8606141 - Attachment is obsolete: true
Attachment #8606142 - Attachment is obsolete: true

Updated

4 years ago
Flags: needinfo?(ggrisco)
(Assignee)

Comment 39

4 years ago
(In reply to StevenLee[:slee] from comment #34)
> After discussing with Alastor, we found that if the destructor of
> AudioChannelAgent is called during AudioChannelAgent::StartPlaying,[1], is
> executing, we may register a dangling pointer to AudioChannelService. We
> will try to see if it's possible in 2.2 branch.

How can it be possible? Can you tell me more?

I agree with Alastor about having if (mDisabled) return; in AudioChannelService::Notify(), but maybe without the MOZ_RELEASE_ASSERT.
Flags: needinfo?(amarchesini)

Comment 40

4 years ago
(In reply to Andrea Marchesini (:baku) from comment #39)
> How can it be possible? Can you tell me more?
Our assumption is 
1. AudioChannelAgent::~AudioChannelAgent and AudioChannelAgent::StartPlaying are ran in different thread
2. Thread-1 executes AudioChannelAgent::StartPlaying in [1]. At the same time, thread-2 has finished AudioChannelAgent::~AudioChannelAgent, and the memory of AudioChannelAgent will be set as 0x5a including this pointer. 
3. Thread-1 continues and we will input 0x5a5a5a5a into AudioChannelService::RegisterAudioChannelAgent.

We're still tracing whether AudioChannelAgent::~AudioChannelAgent and AudioChannelAgent::StartPlaying are ran on different threads.

[1] https://dxr.mozilla.org/mozilla-central/source/dom/audiochannel/AudioChannelAgent.cpp#133

Comment 41

4 years ago
Hi ggrisco,

Can you also tell us the first revision you found this problem? We have only few hints about why gecko has this problem. We may able to find some clues from the commit logs.
Thanks.
(Assignee)

Comment 42

4 years ago
> 1. AudioChannelAgent::~AudioChannelAgent and AudioChannelAgent::StartPlaying
> are ran in different thread

I _really_ hope AudioChannelAgent is main-thread only! And I would like to see MOZ_ASSERT(NS_IsMainThread()) everywhere to prove this. All the use onf AudioChannelAgents are main-thread only. Alastor, can you add such assertions in your patch?

Does the crash happens when we block Notify() to run when mDisabled is true?
Can you reproduce the crash locally?
Flags: needinfo?(slee)
(Reporter)

Comment 43

4 years ago
(In reply to StevenLee[:slee] from comment #41)
> Hi ggrisco,
> 
> Can you also tell us the first revision you found this problem? We have only
> few hints about why gecko has this problem. We may able to find some clues
> from the commit logs.
> Thanks.

We saw this first on AU 129 which had these gaia and gecko revisions:

gaia  = http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=ea735c21bfb0d78333213ff0376fce1eac89ead6
gecko = http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=ac12d858a58574b2db482b5eed61750a3e210040

Although we could have been blocked from seeing this crash earlier by other high occurrence crashes, so just keep that in mind.  In fact, we saw this only once on AU 129 and then we didn't see it again until AU 150 when it crashed 11 times.  Since then it's been showing up pretty consistently.

Thanks,

-Greg
Flags: needinfo?(ggrisco)

Comment 44

4 years ago
(In reply to Andrea Marchesini (:baku) from comment #42)
> I _really_ hope AudioChannelAgent is main-thread only! And I would like to
Me too, so we are trying to find if there is any code that uses AudioChannelAgent on non-main-thread.

> see MOZ_ASSERT(NS_IsMainThread()) everywhere to prove this. All the use onf
> AudioChannelAgents are main-thread only. Alastor, can you add such
> assertions in your patch?
> 
> Does the crash happens when we block Notify() to run when mDisabled is true?
I'm not sure. But I think it could happen. If some agents have been deleted without unregistering, NotifyEnumerator will hit the problem.

> Can you reproduce the crash locally?
No, it only happens during vendor's stability test.
(Reporter)

Comment 45

4 years ago
We still have the patch from bug 1159434 applied (since AU 148).  Any chance that this is causing instability?  Anyway, I think I have to remove it to apply patch from comment 35.
Hi, Baku,
We made this assumption due to the crash content 0x5a5a5a5a. It seems that we use the deleted content as key of the hash table. It would happen when the memory releasing of the Agent executes as the same time with the Agent::StartPlaying(). 

The content of |this| could be 0x5a5a5a5a if the Agent is release at the same time.
> AudioChannelAgent.cpp @ StartPlaying()
> service->RegisterAudioChannelAgent(this, static_cast<AudioChannel>(mAudioChannelType), mWithVideo);

However, I can't find any possible code path to prove it.

---

The second assumption is that we might forget to unregister the Agent from the AudioChannelService, if the Agent is registered but the |mIsRegToService| is false. (Maybe directly be accessed from address by unknown reason?)

If the |mIsRegToService| is changed by unexpected reason, then this agent would be set to 0x5a5a5a5a.
> AudioChannelAgent.cpp @ ~AudioChannelAgent()
> if (mIsRegToService) {
>   StopPlaying();
> }

---

> Does the crash happens when we block Notify() to run when mDisabled is true?
Yes, the crash still happened even if we block the Notify() to run when mDisabled is true. The comment29 (AU157) have applied the patch of mDisable checking.
Posted patch Add assertionSplinter Review
Hi, Greg,
Could you help me to apply this patch? (No need to remove previous one)
Thanks!
Attachment #8607006 - Attachment is obsolete: true
Attachment #8607007 - Attachment is obsolete: true
Flags: needinfo?(ggrisco)
(In reply to Alastor Wu [:alwu] from comment #46)
> If the |mIsRegToService| is changed by unexpected reason, then this agent
> would be set to 0x5a5a5a5a.

Correct error, I means the content of the agent would be set to 0x5a5a5a5a after finishing the destructor. Therefore, anyone who knows the address of this agent would get the 0x5a5a5a5a.
(Assignee)

Comment 49

4 years ago
I suspect I found the problem:

1. AudioChannelService::Notify does a EnumerateRead. It contains 2 (or more) agents.
2. The first agent receives a notification and it does something so that the second agent unregister itself.
3. AudioChannelService receives the unregister
4. EnumerateRead calls the second agent notify()

EnumerateRead doesn't support changes in the hashtable. We should use Enumerate instead or an nsTObserverArray.
I'm going to submit a patch soon.
(Assignee)

Comment 50

4 years ago
Posted patch patch (obsolete) — Splinter Review
I don't know if this is enough but I would give a try.
Can you apply this patch and tell me if it fixes the issue? Thanks!
Hi, all,
We can reproduce this issue in OpenL (msm8909) now, but it doesn't 100%.

STR
1. Make a conference call in OpenL.
2. Start to record the video.
3. Receive a new call during the conference call.
4. Crash.

Our preliminary analysis is that the memory in hash table might be error modified, so that we got 0x5a5a5a5a. In PLDHashTable::Enumerate(), |entryAddr| seems to be assigned the wrong address. 

Now I am trying to print more useful logs to help us solve this issue.
Here is the log, but I have not trim it yet. 
This log would be little messy.
I would simplify it after the analysis.
Here is the process info during the crash.

> APPLICATION    SEC USER     PID   PPID  VSIZE  RSS     WCHAN    PC        NAME
> b2g              0 root      262   1     321776 92916 ffffffff b570c072 t /system/b2g/b2g
> (Nuwa)           0 root      560   262   69464  24072 ffffffff b6d12238 S /system/b2g/b2g
> Homescreen       2 u0_a3299  3299  560   87596  28160 ffffffff b6d12238 S /system/b2g/b2g
> Built-in Keyboa  2 u0_a4636  4636  262   91816  33316 ffffffff b4f22238 S /system/b2g/plugin-container
> Communications   2 u0_a4811  4811  560   78468  26592 ffffffff b6d12238 S /system/b2g/b2g
> Find My Device   2 u0_a7789  7789  262   83516  32784 ffffffff b4fe0238 S /system/b2g/plugin-container
> Camera           2 u0_a7821  7821  262   121680 41100 ffffffff b4f9e9ac S /system/b2g/plugin-container
> Clock            2 u0_a9007  9007  560   83564  27940 ffffffff b6d12238 S /system/b2g/b2g
> Usage            2 u0_a9550  9550  560   76524  25684 ffffffff b6d12238 S /system/b2g/b2g
> Messages         2 u0_a9625  9625  262   91512  39256 ffffffff b4f16238 S /system/b2g/plugin-container
> (Preallocated a  2 u0_a9650  9650  560   73208  17812 ffffffff b6d12238 S /system/b2g/b2g
The log on comment 52 is adding from this patch.
(Assignee)

Comment 55

4 years ago
> We can reproduce this issue in OpenL (msm8909) now, but it doesn't 100%

with my patch too?
Flags: needinfo?(alwu)

Comment 56

4 years ago
Posted file simplified_log
Hi baku,

This log is a simplified version of attachment 8608710 [details]. I'm guessing that the problem is resulted from we insert a new item during EnumerateRead.

The following call path triggers observer-service to send "audio-channel-process-changed". It causes AudioManager to create an AudioChannelAgent with telephony type and insert into the hashtable. But the first mAgents.EnumerateRead is not finished. I'm not sure why it causes the following gets the wrong data, 0x5a5a5a5a.

AudioChannelService::NotifyEnumerator AudioChannelAgent::NotifyAudioChannelStateChanged
AudioChannelService::GetStateInternal
AudioChannelService::SendAudioChannelChangedNotification
Flags: needinfo?(slee)
(Assignee)

Comment 57

4 years ago
slee, so probably my patch goes in the right direction. Let me know when you have results with my patch applied. Thanks!
Hi, Baku,
Since the reproducing rate is low and I want to capture the correct logs, the preliminary test was executed without your patch.
I will test your patch and update the information later.
Thanks!
Flags: needinfo?(alwu)
Hi, All,
The issue still exists even if I applied the patch in comment50.
Here is the new log.

---

Process info

> APPLICATION    SEC USER     PID   PPID  VSIZE  RSS     WCHAN    PC        NAME
> b2g              0 root      259   1     308060 87748 ffffffff b570c062 t /system/b2g/b2g
> (Nuwa)           0 root      564   259   69324  24144 ffffffff b6d7b238 S /system/b2g/b2g
> Homescreen       2 u0_a3392  3392  564   87584  30792 ffffffff b6d7b238 S /system/b2g/b2g
> Built-in Keyboa  2 u0_a4755  4755  259   91816  33152 ffffffff b4f69238 S /system/b2g/plugin-container
> Find My Device   2 u0_a4877  4877  564   74556  22536 ffffffff b6d7b238 S /system/b2g/b2g
> Camera           2 u0_a5693  5693  564   141636 65872 ffffffff b6d509ac S /system/b2g/b2g
> Communications   2 u0_a6117  6117  564   78396  26492 ffffffff b6d7b238 S /system/b2g/b2g
> (Preallocated a  2 u0_a6846  6846  564   73200  17660 ffffffff b6d7b238 S /system/b2g/b2g
Posted file Log : crash details
Hi, Baku,
I think that we found the root cause! 
The root cause might be the unexpected calling the PLDHashTable::changeTable() during the PLDHashTable::EnumerateRead().

--

When we enumerate the agents, the agent would call the AudioChannelService::GetState() to get its playable state. Then we would temporary leave the loop of the PLDHashTable::Enumerate(), for some reason, the user want to play a new AudioChannelAgent. 

> Once we start playing the AudioChannelAgent, we would add this agent to the hashtable. 

Here is the KEY POINT! If we satisfy the condition of the PLDHashTable::changeTable(), then we would clear the contents of the original hash table, moving them to the new memory address. 

BUT! The previous PLDHashTable::EnumerateRead() doesn't end, and it doesn't know about its memory space has been release. 

Therefore, once we continue the PLDHashTable::EnumerateRead(), we would get the 0x5a5a5a5a!

In short, the crash happens when,
(1) Modify the hash table during the PLDHashTable::EnumerateRead().
(2) Satisfy the condition of the PLDHashTable::changeTable() during step (1).
Hi, Baku,
After discuss with Steven, here is our propose to solve this issue.
Could you give us some suggestion?

---

Use a member variable to check whether we modify the hash table during the Enumerate(). If so, we can send the runnable to do modification, and execute it after finishing the Enumerate(), rather than direct modify the hash table during the Enumerate().

Like that,

AudioChannelService's member variable, only be true during the Enumerate().
> bool mRegLock;

Change mRegLock, before/after the Enumerate().
>  mRegLock = true;
>  mAgents.Enumerate(NotifyEnumerator, nullptr);
>  mRegLock = false;

Send runnable if the mRegLock is true.
> if (mRegLock) {
>   ModifyHashTableRunnable runnable = new ModifyHashTableRunnable(this);
>   NS_DispatchToCurrentThread(runnable);
> } else {
>   mAgents.Put(aAgent, data);
> }

How do you think :)?

---

Very appreciate your help!
(Assignee)

Comment 65

4 years ago
> How do you think :)?

What about if we use a nsObserverTArray instead an hashtable?
changing the array when iterating is supported by the nsObserverTArray.
If you want I can do it.
Flags: needinfo?(alwu)
(Assignee)

Comment 66

4 years ago
> In short, the crash happens when,
> (1) Modify the hash table during the PLDHashTable::EnumerateRead().
> (2) Satisfy the condition of the PLDHashTable::changeTable() during step (1).

But if we use ::Enumerate instead EnumerateRead, we should not have this problem. Right?
Hi, Baku,
If the nsObserverTArray is the better way, it would be good!
Since I'm not familiar with it, if you can do that, I will very appreciate :)

> But if we use ::Enumerate instead EnumerateRead, we should not have this problem. Right?
Sorry for the mistake, the issue would happen on both Enumerate() and EnumerateRead().
Thanks!
Flags: needinfo?(alwu)
(Assignee)

Comment 68

4 years ago
On monday you will have a patch to review. Thanks.
Assignee: alwu → amarchesini
(Reporter)

Comment 69

4 years ago
Just so you know, I applied the patch from comment 35 and it is built into AU 163+.  I can update the patch with new patch whenever it's available.
Flags: needinfo?(ggrisco)
(Reporter)

Comment 70

4 years ago
(In reply to Greg Grisco from comment #69)
> Just so you know, I applied the patch from comment 35 and it is built into
> AU 163+.  I can update the patch with new patch whenever it's available.

Looks like this patch itself is causing another crash which is much more reproducible, so we will remove it from our internal builds:

[@ mozilla::dom::AudioChannelService::WindowDestroyedEnumerator | nsBaseHashtable, nsAutoPtrmozilla::dom::AudioChannelService::AudioChannelAgentData, mozilla::dom::AudioChannelService::AudioChannelAgentData*>::s_EnumStub | PL_DHashTableEnumerate | mozilla::dom::AudioChannelService::Observe ]

Comment 71

4 years ago
(In reply to Andrea Marchesini (:baku) from comment #66)
> > In short, the crash happens when,
> > (1) Modify the hash table during the PLDHashTable::EnumerateRead().
> > (2) Satisfy the condition of the PLDHashTable::changeTable() during step (1).
> 
> But if we use ::Enumerate instead EnumerateRead, we should not have this
> problem. Right?

As https://dxr.mozilla.org/mozilla-central/source/xpcom/glue/pldhash.h#120, if we add/remove items into/from hashtable, the enumerator should return PL_DHASH_STOP. For EnumerateRead, it checks whether the callback function returns PL_DHASH_STOP, https://dxr.mozilla.org/mozilla-central/source/xpcom/glue/nsBaseHashtable.h#394. I found all enumerators in AudioChannelSerice always return PL_DHASH_NEXT. That could cause the problem in comment 60.
(Assignee)

Comment 72

4 years ago
Posted patch a.patch (obsolete) — Splinter Review
I hope this patch is enough.
Attachment #8608634 - Attachment is obsolete: true
Attachment #8610127 - Flags: review?(alwu)
Comment on attachment 8610127 [details] [diff] [review]
a.patch

Review of attachment 8610127 [details] [diff] [review]:
-----------------------------------------------------------------

Hi, Baku,
I tested this patch, and it seems that have some problems.
Therefore, I can't verify the issue can be fixed with it.

====================================================

[1] Undefined content
If the UnregisterType() was called after the mAgents.RemoveElementAt(), the content of the |data| would become undefined. 
We would send the meaningless |data->mChannel| into the function, and then we would get crash at GetInternalType().

Crash point, 
https://dxr.mozilla.org/mozilla-central/source/dom/audiochannel/AudioChannelService.cpp#975

====================================================

[2] Repeatedly deleting
After fixing the problem [1], I tested the STR (comment51) to verify whether the issue is fixed. 
Approximately 5 times, the crash would happen. It seems that we deleted the same position in the mAgents again.
We should do some check for this condition.

Crash enter point,
#8  mozilla::dom::AudioChannelService::UnregisterAudioChannelAgent
> mAgents.RemoveElementAt(pos);

All crash BT,
#0  nsCOMPtr_base::~nsCOMPtr_base (this=0xb0aabe80, __in_chrg=<optimized out>) at ../../../../dist/include/nsCOMPtr.h:296
#1  0xb580e744 in ~AudioChannelAgentData (this=0xb0aabe80, __in_chrg=<optimized out>) at /home/alastor/B2G_8909/gecko/dom/audiochannel/AudioChannelService.h:192
#2  ~nsAutoPtr (this=0xb09d54d4, __in_chrg=<optimized out>) at ../../dist/include/nsAutoPtr.h:74
#3  Destruct (aE=0xb09d54d4) at ../../dist/include/nsTArray.h:488
#4  DestructRange (aCount=aCount@entry=1, aStart=3, aStart@entry=1, this=0xb15fc2d0, this@entry=0xb09d54e8) at ../../dist/include/nsTArray.h:1701
#5  nsTArray_Impl<nsAutoPtr<mozilla::dom::AudioChannelService::AudioChannelAgentData>, nsTArrayInfallibleAllocator>::RemoveElementsAt (this=this@entry=0xb15fc2d0, aStart=aStart@entry=3, 
    aCount=aCount@entry=1) at ../../dist/include/nsTArray.h:1397
#6  0xb580e8e2 in RemoveElementAt (aIndex=3, this=0xb15fc2d0) at ../../dist/include/nsTArray.h:1403
#7  RemoveElementAt (aIndex=3, this=0xb15fc2cc) at ../../dist/include/nsTObserverArray.h:229
#8  mozilla::dom::AudioChannelService::UnregisterAudioChannelAgent (this=0xb15fc2c0, aAgent=0xab05bda0) at /home/alastor/B2G_8909/gecko/dom/audiochannel/AudioChannelService.cpp:233
#9  0xb580da7c in mozilla::dom::AudioChannelAgent::StopPlaying (this=0xab05bda0) at /home/alastor/B2G_8909/gecko/dom/audiochannel/AudioChannelAgent.cpp:139
#10 0xb57b6c84 in mozilla::dom::gonk::AudioManager::SetPhoneState (this=0xb08d8600, aState=0) at /home/alastor/B2G_8909/gecko/dom/system/gonk/AudioManager.cpp:630

Crash log,
05-26 14:24:26.866   259   259 I Gecko   : DD | ACS, RegisterAudioChannelAgent, agent = 0xab05bda0, type = 5
05-26 14:24:26.956   259   259 I Gecko   : DD | ACS, RegisterAudioChannelAgent, agent = 0xb064fc60, type = 4
05-26 14:24:27.486   259   259 I Gecko   : DD | ACS, UnregisterAudioChannelAgent, aGent =0xb064fc60, pData type = 4
05-26 14:24:30.116   259   259 I Gecko   : DD | ACS, UnregisterAudioChannelAgent, aGent =0xab05bda0, pData type = 5
05-26 14:24:30.256   259   259 I Gecko   : DD | ACS, UnregisterAudioChannelAgent, aGent =0xab05bda0, pData type = 5

::: dom/audiochannel/AudioChannelService.cpp
@@ +218,3 @@
>    }
>  
>    nsAutoPtr<AudioChannelAgentData> data;

Delete this line and #226.

@@ +222,5 @@
> +  while (iter.HasMore()) {
> +    AudioChannelAgentData* pData = iter.GetNext();
> +    if (pData->mAgent == aAgent) {
> +      uint32_t pos = mAgents.IndexOf(pData);
> +      data = pData;

We need to move the UnregisterType(...) to here.
When we leave the while-loop, the content of the |data| would become undefined.
It caused some crashes. [1]

@@ +223,5 @@
> +    AudioChannelAgentData* pData = iter.GetNext();
> +    if (pData->mAgent == aAgent) {
> +      uint32_t pos = mAgents.IndexOf(pData);
> +      data = pData;
> +      mAgents.RemoveElementAt(pos);

However, we would get the another crash here. [2]
I found that the content of mAgents was repeatedly deleted.
Maybe we need to check whether the |pos| have already been released before.
Attachment #8610127 - Flags: review-
(Assignee)

Comment 77

4 years ago
Posted patch patch (obsolete) — Splinter Review
Oh! Right. New patch.
Attachment #8610127 - Attachment is obsolete: true
Attachment #8610127 - Flags: review?(alwu)
Attachment #8610405 - Flags: review?(alwu)
Comment on attachment 8610405 [details] [diff] [review]
patch

Review of attachment 8610405 [details] [diff] [review]:
-----------------------------------------------------------------

It seems that the issue can be solved by this patch \o/
I can't reproduce it on my local environment now.
Thanks!
Attachment #8610405 - Flags: review+
Hi, Greg,
Could you try to apply the patch on comment77?
Thanks!
Flags: needinfo?(ggrisco)
Comment on attachment 8610405 [details] [diff] [review]
patch

Review of attachment 8610405 [details] [diff] [review]:
-----------------------------------------------------------------

Change the flag.
Attachment #8610405 - Flags: review?(alwu)
(Reporter)

Comment 81

4 years ago
(In reply to Alastor Wu [:alwu] from comment #79)
> Hi, Greg,
> Could you try to apply the patch on comment77?
> Thanks!

Hi Alastor, I am bringing in this patch now.  It should hopefully get built into AU 169 or 170.
Flags: needinfo?(ggrisco)
(Assignee)

Comment 82

4 years ago
If this patch fixes the problem I would like to have a review from somebody experts of hashtable and observerTArray and nsAutoPtr. I really don't want to have other regressions!
(Assignee)

Comment 83

4 years ago
Another important point: if this patch works completely, we should port it into the new AudioChannelService API.
Blocks: 1113086
(Assignee)

Comment 84

4 years ago
any news?
Flags: needinfo?(ggrisco)
(Reporter)

Comment 85

4 years ago
So far no crashes seen, but we only have 35 hours of test on the new build.  Prefer to wait one more day before declaring victory.
(Reporter)

Comment 86

4 years ago
Ok, we haven't seen this reproduce, so I think it's safe to say it's fixed now.  Thanks!
Flags: needinfo?(ggrisco)
(Assignee)

Comment 87

4 years ago
Comment on attachment 8610405 [details] [diff] [review]
patch

Ehsan, can you take a look at this patch? I used the same pattern also for the new AudioChannelService.
Attachment #8610405 - Flags: review?(ehsan)

Comment 88

4 years ago
Comment on attachment 8610405 [details] [diff] [review]
patch

Review of attachment 8610405 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/audiochannel/AudioChannelService.cpp
@@ +139,3 @@
>                                  true /* aElementHidden */,
>                                  AUDIO_CHANNEL_STATE_MUTED /* aState */,
>                                  aWithVideo);

Nit: do you mind fixing the intentation here while you're touching this code?

@@ +221,5 @@
> +  nsTObserverArray<nsAutoPtr<AudioChannelAgentData>>::ForwardIterator iter(mAgents);
> +  while (iter.HasMore()) {
> +    nsAutoPtr<AudioChannelAgentData>& pData = iter.GetNext();
> +    if (pData->mAgent == aAgent) {
> +      uint32_t pos = mAgents.IndexOf(pData);

This is wasteful.  Why not just compute pos as you go?

@@ +370,1 @@
>                                  aElementHidden, oldElementHidden);

Nit: please fix the indentation here as well.

@@ +971,3 @@
>  
> +    if (window == aWindow) {
> +      agents.AppendElement(data->mAgent);

Hmm, couldn't you call WindowVolumeChanged() here directly?  I don't understand why you need to first build this array and then iterate over it.
Attachment #8610405 - Flags: review?(ehsan) → review+
(Reporter)

Comment 89

4 years ago
Can we get a rebase of this patch on v2.2?  Now we have bug 1157121 in our tree so it doesn't apply.  I think we might also be dependent on bug 1153915 and 1147819 (which are not in v2.2)?  Would be great if you could check, because right now I have to remove the patch internally.  Thanks!
Flags: needinfo?(amarchesini)
(Assignee)

Comment 90

4 years ago
> @@ +221,5 @@
> > +  nsTObserverArray<nsAutoPtr<AudioChannelAgentData>>::ForwardIterator iter(mAgents);
> > +  while (iter.HasMore()) {
> > +    nsAutoPtr<AudioChannelAgentData>& pData = iter.GetNext();
> > +    if (pData->mAgent == aAgent) {
> > +      uint32_t pos = mAgents.IndexOf(pData);
> 
> This is wasteful.  Why not just compute pos as you go?

because the next line 'steals' the data from pData. After this, I cannot use indexOf with pData.
Flags: needinfo?(amarchesini)
(Assignee)

Comment 91

4 years ago
Posted patch patch v2.2 (obsolete) — Splinter Review
Patch rebased for v2.2
(Assignee)

Updated

4 years ago
Flags: needinfo?(ggrisco)
(Reporter)

Comment 92

4 years ago
Thanks for the rebase.  I applied the patch internally again, should be built into AU 175+.
Flags: needinfo?(ggrisco)
(Assignee)

Comment 93

4 years ago
Comment on attachment 8614605 [details] [diff] [review]
patch v2.2

[Approval Request Comment]
[Bug caused by] (feature/regressing bug #): AudioChannel policy
[User impact] if declined: b2g crashes
[Testing completed]: on device
[Risk to taking this patch] (and alternatives if risky): low.
[String changes made]: none.
Attachment #8614605 - Flags: approval-gaia-v2.2?
Hi Andrea,
Do we need this for m-c?
Thanks
Flags: needinfo?(amarchesini)
(Assignee)

Comment 95

4 years ago
No. we want to land bug 1113086. and it already has this fix.
Flags: needinfo?(amarchesini)

Updated

4 years ago
Attachment #8614605 - Flags: approval-gaia-v2.2? → approval-gaia-v2.2+
(In reply to Andrea Marchesini (:baku) from comment #95)
> No. we want to land bug 1113086. and it already has this fix.

Thanks Andrea!
https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/841acdbb6d90
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 2.2 S14 (12june)
(Assignee)

Comment 100

4 years ago
Posted patch patch v2.2 (obsolete) — Splinter Review
I changed 1 line:

-    nsRefPtr<AudioChannelAgent> mAgent;
+    // AudioChannelAgent will unregister itself in the DTOR.
+    AudioChannelAgent* mAgent;
Attachment #8614605 - Attachment is obsolete: true
Flags: needinfo?(amarchesini)
Attachment #8617424 - Flags: review?(alwu)
Hi, Baku,
Why the changing of the nsRefPtr to raw pointer can fix the crash?
Thanks!
(Assignee)

Comment 102

4 years ago
Without this patch AudioChannelService doesn't keep alive the agents. With this patch it does and that changes the behavior of the Agents in some mochitests. I tested it locally but here a try push:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f6455c76a82
Attachment #8617424 - Attachment is patch: true
Attachment #8610405 - Attachment is obsolete: true
Hi, Baku,
Sorry I still have some questions..
AudioChannelService has used the nsAutoPtr to keep agents alive, why the nsRefPtr in the AudioChannelAgentData would affect the life of agents?
In your previous patch, what is the condition that the agent would not be alive?
Thanks :)
(Assignee)

Comment 104

4 years ago
> AudioChannelService has used the nsAutoPtr to keep agents alive, why the
> nsRefPtr in the AudioChannelAgentData would affect the life of agents?

Not really. Where do we use nsAutoPtr to keep agents alive? AudioChannelService stores nsPtrHashKey<AudioChannelAgent> as keys in 
mAgents. But nsPtrHashKey keeps a raw pointer: ./xpcom/glue/nsHashKeys.h +388.

So we didn't keep agents alive in the service before this patch. The agents are kept alive by who uses them (WebAudio, HTMLMediaElement, etc) and they have to unregister themselves before going away.
With my patch I used nsRefPtr and that changes this behavior. and this is wrong :)
(In reply to Andrea Marchesini (:baku) from comment #102)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f6455c76a82

Looks like that fixes the B2G emulator failures in comment 99, but not the crashes in comment 98.
Flags: needinfo?(amarchesini)
Hi, Baku,
In your patch on comment91, from my understanding, here is the connection graph.

>                       nsAutoPtr                          nsRefPtr                      nsCOMPtr
> AudioChannelService ♢---------> AudioChannelAgentData ♢---------> AudioChannelAgent <--------♢ Users 
> 

When we register agent, we create the AudioChannelAgentData which would keep the AudioChannelAgent. The AudioChannelService also keep AudioChannelAgentData alive. Therefore, we can say that the AudioChannelService keep the AudioChannelAgent alive indirectly, right?

If you remove the nsRefPtr, we don't have the method to keep AudioChannelAgent alive, don't we?
That is why I don't understand the meaning of removing the nsRefPtr.

---

> So we didn't keep agents alive in the service before this patch. The agents are kept alive by who uses them
> (WebAudio, HTMLMediaElement, etc) and they have to unregister themselves before going away.

Another point confused me is that "they have to unregister themselves before going away". Unregistering their agents before they die sounds normal, why we need to change it?

---

Sorry here are lots of question,
Very appreciate!!!
(Assignee)

Comment 107

4 years ago
> Another point confused me is that "they have to unregister themselves before
> going away". Unregistering their agents before they die sounds normal, why
> we need to change it?


Because if we keep them alive, the DTOR doesn't run. We have to wait until the window is destroy and then we force the unregistrations. This is not what the previous code did.
Flags: needinfo?(amarchesini)
(Assignee)

Comment 108

4 years ago
Posted patch patch v2.2 (obsolete) — Splinter Review
Ryan, this patch fixes that crash.
Attachment #8617424 - Attachment is obsolete: true
Attachment #8617424 - Flags: review?(alwu)
Attachment #8621561 - Flags: review?(alwu)

Updated

4 years ago
Attachment #8621561 - Attachment is patch: true
Comment on attachment 8621561 [details] [diff] [review]
patch v2.2

Review of attachment 8621561 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM, Thanks!
Attachment #8621561 - Flags: review?(alwu) → review+
(Assignee)

Comment 110

4 years ago
The patch is reviewed, can you land it on b2g v2.2? Thanks!
Flags: needinfo?(ryanvm)
Yes, no need to ni? it.
Flags: needinfo?(ryanvm)
(Assignee)

Comment 114

4 years ago
Posted patch patch v2.2Splinter Review
Attachment #8621561 - Attachment is obsolete: true
Flags: needinfo?(amarchesini)
Attachment #8623068 - Attachment is patch: true
As NGA Program Manager suggested, let's replace the NGA-Sx milestones with FxOS-Sx ones (more generic ones), once Bug 1174794 has already landed
Target Milestone: NGA S3 (26Jun) → FxOS-S1 (26Jun)
You need to log in before you can comment on or make changes to this bug.