Closed
Bug 806407
Opened 12 years ago
Closed 12 years ago
createOffer failing silently on opt linux builds
Categories
(Core :: WebRTC, defect, P1)
Tracking
()
VERIFIED
FIXED
mozilla19
People
(Reporter: ted, Assigned: ehugg)
References
()
Details
(Whiteboard: [WebRTC], [blocking-webrtc+])
Attachments
(2 files)
3.98 KB,
patch
|
jesup
:
review+
|
Details | Diff | Splinter Review |
1.91 KB,
patch
|
Details | Diff | Splinter Review |
I'm testing with the demo in the URL, but this appears to happen in other demos as well. This demo works fine on a Mac nightly, but on a Linux or Windows nightly it stops at the createOffer step. I don't get any errors and the error callback is not called. I tried in a local Linux debug build and that worked fine as well.
Reporter | ||
Comment 1•12 years ago
|
||
Further update: built a local Linux opt build and it works, so this is either an issue with the toolchain in use or an issue that only manifests in PGO builds (which both Linux and Windows nightlies are).
Reporter | ||
Comment 2•12 years ago
|
||
Ugh. I tested with a Linux non-PGO tinderbox build and the demo works, which means this is likely to be a PGO-only problem.
Reporter | ||
Comment 3•12 years ago
|
||
Okay, but a Win32-non-PGO build reproduces the problem, so that's handy.
Reporter | ||
Comment 4•12 years ago
|
||
Here's a log including NSPR log output (ikran:5) from a Windows optimized debug build:
http://people.mozilla.org/~tmielczarek/peerconnection.log.gz
Comment 5•12 years ago
|
||
So, the interesting part from that log was repeated messages:
0[27cd60]: cpr cprSendMessage - Msg not sent: 1444
Ted looked and thinks the size for the thread ID in cprSendMessage is wrong (16 vs 32-bits), but that doesn't explain all the issues.
Updated•12 years ago
|
Priority: -- → P1
Whiteboard: [WebRTC], [blocking-webrtc+]
Assignee | ||
Comment 6•12 years ago
|
||
So I got this to happen consistently with Nightly but not my regular build. I edited the local video test like this:
-pc1.createOffer(step1, failed, {});
+pc1.createOffer(step1, failed);
And now it works on Nightly (Linux64). Still investigating.
Comment 7•12 years ago
|
||
Hmmm. Constraints? Or Constraints stepping on other data? (write-after free, uninitialized?)
Reporter | ||
Comment 8•12 years ago
|
||
Yeah, on Linux64 I could reproduce the error with a Nightly (which is PGO optimized), but not with a debug build or a normal opt build (non-PGO).
Assignee | ||
Comment 9•12 years ago
|
||
Been testing a bunch on Linux64 and Win64 and here are my results, in case it spurs any ideas on how to find this:
Tried two Nightly downloads 10/31 and 11/02 and they both will fail pretty consistently, perhaps 7 out of 10 tries.
If I take the third parameter out of the createOffer call they never fail. i.e.
-pc1.createOffer(step1, failed, {});
+pc1.createOffer(step1, failed);
I thought this would point me to a cause, but in peerconnection.js there is this:
if (!constraints) {
constraints = {};
}
Which makes the two calls equivalent when they hit the C++ layer. As far as I can tell the with/without the third parameter in JS sends the exact same params to the C++ layer.
I do not have a buildable failure case. I tried debug-opt, nodebug-opt and nodebug-opt-pgo. I have not seen a Nightly I built on either Linux64 or Win64 have this problem.
Please let me know if you have ideas or other test results.
Comment 10•12 years ago
|
||
IIRC, the logs of machines that build the nightlies have their complete mozconfigs and build environments, which might help you produce a build more like the ones demonstrating the problems.
Separate random idea: add some logging code (or even dtrace probes?) into the branch that is pref-controlled. You might be able to then extrace more useful info from the nightlies themselves.
Assignee | ||
Comment 11•12 years ago
|
||
After using Alder as my personal debugger, I've found this:
When it fails it's an error return from msgsnd() with errno set to EINVAL. From cpr_linux_ipc.c:
if (msgsnd(msgq->queueId, &mbuf,
sizeof(struct msgbuffer) - offsetof(struct msgbuffer, msgPtr),
IPC_NOWAIT) != -1) {
Checking the parameters I found that the queueId is correct and used successfully before this. The mbuf is on the stack and the last two params are constants. However, I found that msgsnd makes one more check that could return EINVAL - of mbuf.mtype. From kernel msg.c:650:
if (mtype < 1)
return -EINVAL;
I found that mtype is defined in cpr_linux_ipc.c:
struct msgbuffer {
int32_t mtype; /* Message type */
void *msgPtr; /* Ptr to msg */
void *usrPtr; /* Ptr to user data */
};
But in linux/msg.h like this:
struct msgbuf {
long mtype; /* type of message */
char mtext[1]; /* message text */
};
(It appears you are supposed to cast your own struct this way as long as you get the first param and sizes right.)
So, even though we always set mtype to be 1, on my Linux64 box msgsnd() is reading mtype to be more bytes (depending on pack they may be random or part of msgPtr). If those bytes happen to make mtype negative then msgsnd() call will fail.
I will put a simple patch up in a bit after I do some more testing and some investigation into why this doesn't fail more often.
Comment 12•12 years ago
|
||
Yeah! Good catch!
Reporter | ||
Comment 13•12 years ago
|
||
Wow, interesting! You'd probably have to look at the generated code to see what's happening differently in the PGO case. It's probably something with how the args are being passed.
Assignee | ||
Comment 14•12 years ago
|
||
Assignee | ||
Comment 15•12 years ago
|
||
Comment on attachment 680091 [details] [diff] [review]
Fix type in struct passed into msgsnd
Also changed some printfs in cpr_linux_ipc to output to NSPR instead.
Attachment #680091 -
Flags: review?(rjesup)
Assignee | ||
Comment 17•12 years ago
|
||
It appears that the OSX code has the same issue although we've never seen the problem there. I'll post a separate patch for it.
Comment 18•12 years ago
|
||
Comment on attachment 680091 [details] [diff] [review]
Fix type in struct passed into msgsnd
Review of attachment 680091 [details] [diff] [review]:
-----------------------------------------------------------------
::: media/webrtc/signaling/src/sipcc/cpr/linux/cpr_linux_ipc.h
@@ +19,5 @@
>
>
> /* Message buffer layout */
> struct msgbuffer {
> + long mtype; /* Message type */
Do we really want this to standardize to a non-specific size? I'm ok with in any case
Attachment #680091 -
Flags: review?(rjesup) → review+
Assignee | ||
Comment 19•12 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #18)
> Comment on attachment 680091 [details] [diff] [review]
> Fix type in struct passed into msgsnd
>
> Review of attachment 680091 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: media/webrtc/signaling/src/sipcc/cpr/linux/cpr_linux_ipc.h
> @@ +19,5 @@
> >
> >
> > /* Message buffer layout */
> > struct msgbuffer {
> > + long mtype; /* Message type */
>
> Do we really want this to standardize to a non-specific size? I'm ok with
> in any case
But that's what caused the problem. This has to match the Linux API in linux/msg.h which defines it as long and someone set it to int32_t.
Assignee | ||
Comment 20•12 years ago
|
||
Assignee | ||
Comment 21•12 years ago
|
||
Comment on attachment 680214 [details] [diff] [review]
Patch 2 - Fix size of thread ID for Windows build
Patch 2 uploaded to facilitate testing. Changes the threadId type on Windows and removes unused code from OSX.
Assignee | ||
Updated•12 years ago
|
Assignee: nobody → ethanhugg
Assignee | ||
Comment 22•12 years ago
|
||
I'm having trouble duplicating the original problem on Windows. If anyone has a Windows installation that has this problem, if you could also try the latest build from the Alder tbpl - https://tbpl.mozilla.org/?tree=Alder That would help a lot. Thanks.
Comment 23•12 years ago
|
||
Windows opt build from Alder seems good
Comment 24•12 years ago
|
||
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
Comment 25•12 years ago
|
||
For those of us that don't know what opt builds are - can someone clarify how one person would get an opt build?
Comment 26•12 years ago
|
||
opt = optimized, as in what you see on tbpl
Comment 27•12 years ago
|
||
Okay. I'll mark verified per your comment in comment 23.
Status: RESOLVED → VERIFIED
Assignee | ||
Comment 28•12 years ago
|
||
I think I caused some confusion on this bug. The patch that was pushed was for Linux64. It was failing only on the opt-pgo builds made by the builders. This was because of the random chance of having something in that memory area that would make the integer less that 0.
The Windows patch above that comment23 refers to has not been pushed. I think I will create a new bug for it since it has a different solution and I am less confident in the fix.
To verify the Linux64 you'll need to take the output of the builder from M-C and see if it stalls on createOffer. The symptom was createOffer would never call back either the success or failure functions.
Assignee | ||
Comment 29•12 years ago
|
||
Created a new bug for the Windows issue - Bug 811333
Summary: createOffer failing silently on opt linux/windows builds → createOffer failing silently on opt linux builds
You need to log in
before you can comment on or make changes to this bug.
Description
•