Closed Bug 806407 Opened 8 years ago Closed 8 years ago

createOffer failing silently on opt linux builds

Categories

(Core :: WebRTC, defect, P1)

x86_64
Linux
defect

Tracking

()

VERIFIED FIXED
mozilla19

People

(Reporter: ted, Assigned: ehugg)

References

()

Details

(Whiteboard: [WebRTC], [blocking-webrtc+])

Attachments

(2 files)

I'm testing with the demo in the URL, but this appears to happen in other demos as well. This demo works fine on a Mac nightly, but on a Linux or Windows nightly it stops at the createOffer step. I don't get any errors and the error callback is not called. I tried in a local Linux debug build and that worked fine as well.
Further update: built a local Linux opt build and it works, so this is either an issue with the toolchain in use or an issue that only manifests in PGO builds (which both Linux and Windows nightlies are).
Ugh. I tested with a Linux non-PGO tinderbox build and the demo works, which means this is likely to be a PGO-only problem.
Okay, but a Win32-non-PGO build reproduces the problem, so that's handy.
Here's a log including NSPR log output (ikran:5) from a Windows optimized debug build:
http://people.mozilla.org/~tmielczarek/peerconnection.log.gz
So, the interesting part from that log was repeated messages:
0[27cd60]: cpr cprSendMessage - Msg not sent: 1444

Ted looked and thinks the size for the thread ID in cprSendMessage is wrong (16 vs 32-bits), but that doesn't explain all the issues.
Priority: -- → P1
Whiteboard: [WebRTC], [blocking-webrtc+]
So I got this to happen consistently with Nightly but not my regular build.  I edited the local video test like this:

-pc1.createOffer(step1, failed, {});
+pc1.createOffer(step1, failed);

And now it works on Nightly (Linux64).  Still investigating.
Hmmm.   Constraints?  Or Constraints stepping on other data?  (write-after free, uninitialized?)
Yeah, on Linux64 I could reproduce the error with a Nightly (which is PGO optimized), but not with a debug build or a normal opt build (non-PGO).
Been testing a bunch on Linux64 and Win64 and here are my results, in case it spurs any ideas on how to find this:

Tried two Nightly downloads 10/31 and 11/02 and they both will fail pretty consistently, perhaps 7 out of 10 tries.

If I take the third parameter out of the createOffer call they never fail.  i.e.

-pc1.createOffer(step1, failed, {});
+pc1.createOffer(step1, failed);

I thought this would point me to a cause, but in peerconnection.js there is this:

    if (!constraints) {
      constraints = {};
    }

Which makes the two calls equivalent when they hit the C++ layer.  As far as I can tell the with/without the third parameter in JS sends the exact same params to the C++ layer.

I do not have a buildable failure case.  I tried debug-opt, nodebug-opt and nodebug-opt-pgo.  I have not seen a Nightly I built on either Linux64 or Win64 have this problem.

Please let me know if you have ideas or other test results.
IIRC, the logs of machines that build the nightlies have their complete mozconfigs and build environments, which might help you produce a build more like the ones demonstrating the problems.

Separate random idea: add some logging code (or even dtrace probes?) into the branch that is pref-controlled.  You might be able to then extrace more useful info from the nightlies themselves.
After using Alder as my personal debugger, I've found this:

When it fails it's an error return from msgsnd() with errno set to EINVAL.  From cpr_linux_ipc.c:

    if (msgsnd(msgq->queueId, &mbuf,
             sizeof(struct msgbuffer) - offsetof(struct msgbuffer, msgPtr),
               IPC_NOWAIT) != -1) {

Checking the parameters I found that the queueId is correct and used successfully before this.  The mbuf is on the stack and the last two params are constants.  However, I found that msgsnd makes one more check  that could return EINVAL - of mbuf.mtype.  From kernel msg.c:650:

        if (mtype < 1)
                  return -EINVAL;

 I found that mtype is defined in cpr_linux_ipc.c:

struct msgbuffer {
    int32_t mtype;    /* Message type */
    void   *msgPtr;   /* Ptr to msg */
    void   *usrPtr;   /* Ptr to user data */
};

But in linux/msg.h like this:

struct msgbuf {
    long mtype;         /* type of message */
    char mtext[1];      /* message text */
};

(It appears you are supposed to cast your own struct this way as long as you get the first param and sizes right.)

So, even though we always set mtype to be 1, on my Linux64 box msgsnd() is reading mtype to be more bytes (depending on pack they may be random or part of msgPtr).  If those bytes happen to make mtype negative then msgsnd() call will fail.


I will put a simple patch up in a bit after I do some more testing and some investigation into why this doesn't fail more often.
Yeah!  Good catch!
Wow, interesting! You'd probably have to look at the generated code to see what's happening differently in the PGO case. It's probably something with how the args are being passed.
Comment on attachment 680091 [details] [diff] [review]
Fix type in struct passed into msgsnd


Also changed some printfs in cpr_linux_ipc to output to NSPR instead.
Attachment #680091 - Flags: review?(rjesup)
Duplicate of this bug: 740253
It appears that the OSX code has the same issue although we've never seen the problem there.  I'll post a separate patch for it.
Comment on attachment 680091 [details] [diff] [review]
Fix type in struct passed into msgsnd

Review of attachment 680091 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/signaling/src/sipcc/cpr/linux/cpr_linux_ipc.h
@@ +19,5 @@
>  
>  
>  /* Message buffer layout */
>  struct msgbuffer {
> +    long    mtype;    /* Message type */

Do we really want this to standardize to a non-specific size?  I'm ok with in any case
Attachment #680091 - Flags: review?(rjesup) → review+
(In reply to Randell Jesup [:jesup] from comment #18)
> Comment on attachment 680091 [details] [diff] [review]
> Fix type in struct passed into msgsnd
> 
> Review of attachment 680091 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: media/webrtc/signaling/src/sipcc/cpr/linux/cpr_linux_ipc.h
> @@ +19,5 @@
> >  
> >  
> >  /* Message buffer layout */
> >  struct msgbuffer {
> > +    long    mtype;    /* Message type */
> 
> Do we really want this to standardize to a non-specific size?  I'm ok with
> in any case

But that's what caused the problem.  This has to match the Linux API in linux/msg.h which defines it as long and someone set it to int32_t.
Comment on attachment 680214 [details] [diff] [review]
Patch 2 - Fix size of thread ID for Windows build


Patch 2 uploaded to facilitate testing.  Changes the threadId type on Windows and removes unused code from OSX.
Assignee: nobody → ethanhugg
I'm having trouble duplicating the original problem on Windows.  If anyone has a Windows installation that has this problem, if you could also try the latest build from the Alder tbpl - https://tbpl.mozilla.org/?tree=Alder  That would help a lot.  Thanks.
Windows opt build from Alder seems good
https://hg.mozilla.org/mozilla-central/rev/891f2dc392f0
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
For those of us that don't know what opt builds are - can someone clarify how one person would get an opt build?
opt = optimized, as in what you see on tbpl
Okay. I'll mark verified per your comment in comment 23.
Status: RESOLVED → VERIFIED
I think I caused some confusion on this bug.  The patch that was pushed was for Linux64.  It was failing only on the opt-pgo builds made by the builders.  This was because of the random chance of having something in that memory area that would make the integer less that 0.

The Windows patch above that comment23 refers to has not been pushed.  I think I will create a new bug for it since it has a different solution and I am less confident in the fix.

To verify the Linux64 you'll need to take the output of the builder from M-C and see if it stalls on createOffer.  The symptom was createOffer would never call back either the success or failure functions.
Created a new bug for the Windows issue - Bug 811333
Summary: createOffer failing silently on opt linux/windows builds → createOffer failing silently on opt linux builds
You need to log in before you can comment on or make changes to this bug.