Closed Bug 1287874 Opened 3 years ago Closed 3 years ago

Assert fail on ctx->timeout_ms

Categories

(Core :: WebRTC: Signaling, defect, P2)

x86_64
Windows 7
defect

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox50 --- fixed

People

(Reporter: q1, Unassigned)

References

()

Details

Attachments

(1 file)

While attempting to establish a Hello session between two instances of FF 47.0 x64 debug hosted on the same machine, I hit an assertion failure in media\mtransport\third_party\nICEr\src\stun\stun_client_ctx.c, line 419: 

419:     assert(ctx->timeout_ms);

I had the debugger attached, so I was able to investigate further. The proximate cause of the problem appears to be that |ctx->maximum_transmits_timeout_ms| == 0. This matters because the condition on line 401 was satisfied:

401:     if (ctx->request_ct >= ctx->maximum_transmits) {
402:       /* Reliable transport only get here once. Unreliable get here for
403:        * their final timeout. */
404:       ctx->timeout_ms += ctx->maximum_transmits_timeout_ms;
405:     }

so, if |ctx_timeout_ms| was 0 previously, it would remain 0 and thus trigger the assert.

The instance of FF that failed was displaying the page http://iris.not.iac.es/view/index.shtml . I opened Hello in that instance, copied the conversation link, and pasted it into a different instance of FF. After ~30 seconds (this is a debug build!), the assert appeared.

The stack trace is:

 	ucrtbase.dll!000007feea3e6ab0()	Unknown
>	xul.dll!nr_stun_client_send_request(nr_stun_client_ctx_ * ctx=0x000000000a6c400c) Line 419	C
 	xul.dll!nr_stun_client_start(nr_stun_client_ctx_ * ctx=0x000000000a6c400c, int mode=0x00000014, void(*)(void *, int, void *) finished_cb=0x000007fecd1fa660, void * cb_arg=0x00000000150b3cac) Line 143	C
 	xul.dll!nr_turn_stun_ctx_start(nr_turn_stun_ctx_ * ctx=0x00000000150b3cac) Line 212	C
 	xul.dll!nr_turn_client_allocate(nr_turn_client_ctx_ * ctx=0x0000000018d2a40c, void(*)(void *, int, void *) finished_cb=0x000007fecd1cd630, void * cb_arg=0x000000001880300c) Line 648	C
 	xul.dll!nr_ice_start_relay_turn_timer_cb(void * s=0x0000000000000000, int how=0x00000000, void * cb_arg=0x000000001880300c) Line 771	C
 	xul.dll!mozilla::nrappkitTimerCallback::Notify(nsITimer * timer=0x0000000019dcda60) Line 126	C++
 	xul.dll!nsTimerImpl::Fire() Line 528	C++
 	xul.dll!nsTimerEvent::Run() Line 290	C++
 	xul.dll!nsThread::ProcessNextEvent(bool aMayWait=true, bool * aResult=0x000000000462f122) Line 994	C++
 	xul.dll!NS_ProcessNextEvent(nsIThread * aThread=0x0000000003602f00, bool aMayWait=true) Line 297	C++
 	xul.dll!nsSocketTransportService::Run() Line 894	C++
 	xul.dll!nsThread::ProcessNextEvent(bool aMayWait=false, bool * aResult=0x000000000462f5f2) Line 994	C++
 	xul.dll!NS_ProcessNextEvent(nsIThread * aThread=0x0000000003602f00, bool aMayWait=false) Line 297	C++
 	xul.dll!mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate * aDelegate=0x0000000000967530) Line 326	C++
 	xul.dll!MessageLoop::RunInternal() Line 235	C++
 	xul.dll!MessageLoop::RunHandler() Line 228	C++
 	xul.dll!MessageLoop::Run() Line 202	C++
 	xul.dll!nsThread::ThreadFunc(void * aArg=0x0000000003602f00) Line 398	C++
 	nss3.dll!_PR_NativeRunThread(void * arg=0x0000000000923c00) Line 406	C
 	nss3.dll!pr_root(void * arg=0x0000000000923c00) Line 96	C
 	ucrtbase.dll!000007feea3e82dd()	Unknown
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown

I also have a dump file, which I will be happy to examine (but not upload) if you have specific questions.

I have not attempted to reproduce the problem.
OS: Unspecified → Windows 7
Hardware: Unspecified → x86_64
I reproduced the bug the first time I tried to do so.
Hmm, I just noticed that *both* instances of FF crash with the assert. Also, the crash occurs even when using brand-new profiles in both instances.
This is most likely an issue in the core webrtc code, so moving to that product/component for better visibility.
Component: Client → WebRTC
Product: Hello (Loop) → Core
Rank: 21
Component: WebRTC → WebRTC: Signaling
Priority: -- → P2
|maximum_transmits_timeout_ms| should never be 0, looking at the code. Does the dump file show that this is the case? If so, what kind of shape is the rest of |ctx| in?
Flags: needinfo?(q1)
|maximum_transmits_timeout_ms| could be 0 if "stun.client.final_retransmit_backoff" or "stun.client.retransmission_timeout" are set to 0. But that does not match the description that this happens even with fresh profiles.
Very strange is that the |ctx| just got created in the nr_turn_client_allocate() call 3 frames up here: http://searchfox.org/mozilla-central/source/media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c#639
(In reply to Byron Campen [:bwc] from comment #4)
> |maximum_transmits_timeout_ms| should never be 0, looking at the code. Does
> the dump file show that this is the case?

Yes.

> If so, what kind of shape is the rest of |ctx| in?

I'm not sure what it should look like, but it isn't overwritten with obvious trash. For example, most of the pointers leading out of the object point to valid memory or contain nullptr, the |label| field contains believable text, the |my_addr| and |peer_addr| fields contain believable addresses, etc. The |params| field, though, seems uninitialized; should it contain anything sensible?
Flags: needinfo?(q1)
(In reply to q1 from comment #6)
> (In reply to Byron Campen [:bwc] from comment #4)
> > |maximum_transmits_timeout_ms| should never be 0, looking at the code. Does
> > the dump file show that this is the case?
> 
> Yes.

To clarify, the dump file shows |maximum_transmits_timeout_ms| == 0.
(In reply to q1 from comment #6)
> (In reply to Byron Campen [:bwc] from comment #4)
> > |maximum_transmits_timeout_ms| should never be 0, looking at the code. Does
> > the dump file show that this is the case?
> 
> Yes.
> 
> > If so, what kind of shape is the rest of |ctx| in?
> 
> I'm not sure what it should look like, but it isn't overwritten with obvious
> trash. For example, most of the pointers leading out of the object point to
> valid memory or contain nullptr, the |label| field contains believable text,
> the |my_addr| and |peer_addr| fields contain believable addresses, etc. The
> |params| field, though, seems uninitialized; should it contain anything
> sensible?

   How strange. I would expect |params| to just be zeros. What about |rto_ms|, |retransmission_backoff_factor|, and |maximum_transmits|?
(In reply to Byron Campen [:bwc] from comment #8)
> (In reply to q1 from comment #6)
> > (In reply to Byron Campen [:bwc] from comment #4)
> > > |maximum_transmits_timeout_ms| should never be 0, looking at the code. Does
> > > the dump file show that this is the case?
> > 
> > Yes.
> > 
> > > If so, what kind of shape is the rest of |ctx| in?
> > 
> > I'm not sure what it should look like, but it isn't overwritten with obvious
> > trash. For example, most of the pointers leading out of the object point to
> > valid memory or contain nullptr, the |label| field contains believable text,
> > the |my_addr| and |peer_addr| fields contain believable addresses, etc. The
> > |params| field, though, seems uninitialized; should it contain anything
> > sensible?
> 
>    How strange. I would expect |params| to just be zeros.

The first two bytes of |params| are 0x10, 0x0e, and the rest are 0.

> What about |rto_ms|,

0x64

> |retransmission_backoff_factor|,

2.0

> and |maximum_transmits|?

1

Have you tried to reproduce this crash? It happens every time for me. I am using 47.0 x64 debug, built with no modifications, but not including Crash Reporter, which had build errors; I used the --disable-crashreporter switch to exclude it.
(In reply to Byron Campen [:bwc] from comment #10)
> This is bizarre. It's almost as if pow is malfunctioning here:
> https://dxr.mozilla.org/mozilla-release/source/media/mtransport/third_party/
> nICEr/src/stun/stun_client_ctx.c#102

Hmm. My build log says of that line "warning C4013: 'pow' undefined; assuming extern returning int", so perhaps it's linked to something unexpected.
(In reply to Byron Campen [:bwc] from comment #10)
> This is bizarre. It's almost as if pow is malfunctioning here:
> https://dxr.mozilla.org/mozilla-release/source/media/mtransport/third_party/
> nICEr/src/stun/stun_client_ctx.c#102

It *is* malfunctioning. The first argument, |ctx->retransmission_backoff_factor|, == 2.0, and the second argument, |ctx->maximum_transmits| == 7, so the result should be 128.0. However, the function instead returns 0 in register |rax|, which the generated code then multiplies by |ctx->rto_ms| using the |imul| instruction.

This implies that the compiler is indeed calling the wrong version of |pow| because of the missing declaration.
Neato.
Attachment #8775282 - Flags: review?(drno)
So the question is, why does this code ever work? Have I found this problem only because the debug build is so slow?
Could just be compiler version.
(In reply to Byron Campen [:bwc] from comment #16)
> Could just be compiler version.

FWIW, I used VS 2015 Pro update 3 RC to build. But I think the compiler did the "right" thing given that it had no function prototype for |pow|: it sent the function the arguments, and expected a 32-bit int back.
Attachment #8775282 - Flags: review?(drno) → review+
(In reply to q1 from comment #17)
> FWIW, I used VS 2015 Pro update 3 RC to build.

I guess we should check how Mozilla's current official Windows compiler is handling this.
So the current treeherder builds show some other C4013 warnings, for which I opened bug 1289984. But from the logs it also looks like treeherder uses VS2015 u2.
https://hg.mozilla.org/mozilla-central/rev/8e0019ff663e
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
I verified with VS 2015 U2 (the compiler currently in use on our build machines) in the debugger that pow() does the correct thing.
The strange thing though is that even after upgrading to VS 2015 U3 I don't see the C4013 compiler warning for the call to pow() (I do see the warning for the things fixed in bug 1289984). Is is possible that either some environment things, like special compiler flags or libs installed on the system, have caused this to only show up on the system of the bug reporter?
As we are not in control of the compiler version used on our build machines I think we should uplift this as high a possible to prevent any problems in case RelEng decides to start using VS 2015 U3.
(In reply to Nils Ohlmeier [:drno] from comment #23)
> I verified with VS 2015 U2 (the compiler currently in use on our build
> machines) in the debugger that pow() does the correct thing.
> The strange thing though is that even after upgrading to VS 2015 U3 I don't
> see the C4013 compiler warning for the call to pow() (I do see the warning
> for the things fixed in bug 1289984). Is is possible that either some
> environment things, like special compiler flags or libs installed on the
> system, have caused this to only show up on the system of the bug reporter?

I don't know of any special libraries or of unusual build flags other than --disable-crashreporter (which I needed because crashreporter had build errors. I did verify the source tarball's SHA512, so I shouldn't have broken source). I built using Mozillabuild 2.2.0. Did you build the x64 image? That's the one I built.
Finally got it. I get the warning for the pow() function only with 64bit build and enable-debug and disable-optimize. And even though I still can't repro the 0 value at least all the C4013 warnings are gone (except for one in sqlite).

Unfortunately I can't go back to U2 any more (VS keeps crashing after the downgrade).
(In reply to Nils Ohlmeier [:drno] from comment #26)
> Finally got it. I get the warning for the pow() function only with 64bit
> build and enable-debug and disable-optimize. And even though I still can't
> repro the 0 value

That is bizarre. Maybe it happens only with U3 RC, and not U3 release?
 
> Unfortunately I can't go back to U2 any more (VS keeps crashing after the
> downgrade).

Ugh.
But the fact that the warning only shows up on 64bit debug builds lets me sleep better. And I don't think we need to uplift it for that small build group.
You need to log in before you can comment on or make changes to this bug.