Closed Bug 474937 Opened 11 years ago Closed 11 years ago

Playing Theora videos using video element causes crashes in Theora MMX/SSE routines [@ oc_state_frag_recon_mmx]

Categories

(Core :: Audio/Video, defect, critical)

x86
Windows XP
defect
Not set
critical

Tracking

()

VERIFIED FIXED

People

(Reporter: cajbir, Assigned: cajbir)

References

Details

(Keywords: crash, fixed1.9.1)

Crash Data

Attachments

(1 file, 1 obsolete file)

After bug 462082 landed there were reports of crashes on older hardware. The common link seems to be problems with SSE optimizations that that bug enabled on Windows. The hardware having problems doesn't seem to support SSE.

libtheora does cpu detection at runtime to see if the hardware supports SSE. The ability to do this on Windows is new in the recent update. Looks like there may be problems with it.

See bug 462082 comment 12 for a stack trace from windbg.

I think we should disable the SSE optimisations for now.
Flags: blocking1.9.1?
No longer depends on: 462082
Are you sure that is the only cause?  In bug 462082 comment 11, a Core 2 Duo system is listed as affected.  Also, the P3 supports SSE, but not SSE2.
Summary: Playing Theora videos using video element causes crashes on non-SSE enabled machines → Playing Theora videos using video element causes crashes in Theora MMX/SSE routines
I've changed the title to reflect that. Feel free to change it to whatever you feel is more appropriate.
Severity: normal → critical
Keywords: crash
Summary: Playing Theora videos using video element causes crashes in Theora MMX/SSE routines → Playing Theora videos using video element causes crashes in Theora MMX/SSE routines [@ oc_state_frag_recon_mmx]
This disables the new optimized MMX/SSE routines for Windows
Assignee: nobody → chris.double
Status: NEW → ASSIGNED
blocking1.9.1? This is needed on top of bug 462082 to fix crashes on windows. That bug is blocking1.9.1+.
(In reply to comment #3)
> This disables the new optimized MMX/SSE routines for Windows

Bummer.  Theora performance for me is pathetic (100% CPU, stuttering--easily < 10 frames/sec, crackling audio) on Windows.  Is there any possibility the optimizations will be fixed and reintroduced any time soon?
> Is there any possibility the optimizations will be fixed and reintroduced any time soon?

Yes, I'm disabling them until the cause is tracked down. There's some bugs that may interest you that are tracking ideas to improve performance for the things you are seeing (crackling audio, etc). See bug 474748, bug 474749 and bug 474540.
Flags: blocking1.9.1? → blocking1.9.1+
Ok.  Thank you.
Blocks: 475226
Chris, we need this fixed for beta3. Please make sure it lands.
Priority: -- → P1
Pushed to mozilla-central:
http://hg.mozilla.org/mozilla-central/rev/b62f99aa1fb6
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Priority: P1 → --
Resolution: --- → FIXED
Whiteboard: [baking for 1.9.1]
Er, I have a question about this patch ... it looks like it will cause dec/x86 to be compiled on Windows, which can't be right.
I have an old non-SSE AMD 1.4ghz Thunderbird and I am no longer crashing using the latest hourly build that includes the patch. 

Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2a1pre) Gecko/20090125 Minefield/3.2a1pre Firefox/3.0.4 ID:20090125104555

changeset: http://hg.mozilla.org/mozilla-central/rev/abd5a41c0b70
Status: RESOLVED → VERIFIED
Pushed to mozilla-1.9.1:
http://hg.mozilla.org/releases/mozilla-1.9.1/rev/766a9eaa2989
Keywords: fixed1.9.1
Whiteboard: [baking for 1.9.1]
Jeff, I wonder if you could look into getting Theora's SSE2 detection working on Windows... We may need an old PC to do testing on, maybe there's one in the Toronto office?
It may not be the detection, it may be the MSVC version of the SSE/MMX code itself is buggy. It was a fairly recent addition by a contributor to libtheora.
Yeah, I can take a look. I don't know if we have an old PC in the toronto office, but I have a pre-SSE2 box at home, are there builds that show the problem?
(In reply to comment #15)
> but I have a pre-SSE2 box at home,

Great!

> are there builds that show the problem?

Changeset 53a4746eb1ba should have the problem.

Thanks!!!
If I had to make a guess, it looks like the mmx code is overwriting the return address on the stack. I'm seeing a crash while exiting oc_frag_recon_inter_mmx().


The problem doesn't look related to cpu detection.
The attached patch re-enables all of the mmx code except for the crashing oc_frag_recon_inter_mmx(). This seems to fix the problem for me and at least lets us get some performance improvement for the time being. 

I have a try server build here if other people want to test it out: https://build.mozilla.org/tryserver-builds/2009-01-26_21:20-jmuizelaar@mozilla.com-1233033586/
Attachment #358674 - Attachment is obsolete: true
I don't know if what I wrote above is correct. With the build above oc_frag_recon_inter_mmx is still called through oc_state_frag_recon_mmx and there are no problems. However, the code for oc_frag_recon_inter_mmx is different then the failing nightly.

The failing nightly appears to have additional code at the end that calls __security_check_cookie(). I'm not sure why this there as I didn't think we were building with -GS...

More investigation is needed...
Something weird is happening. I tried re-enabling the mmx code and the build from try server no longer crashes for me...
Hi folks. 

SSE detection is not the problem. 

The code in question is only using pentium 1 style mmx. Also the asm-code itself looks fine and is most likely not the problem, but I will double-check that.

At the moment my guess is that something is going wrong in the function prolog. That doesn't explain why it works on some machines and not on others, but it's a start.

If anyone of you have access to one of the machines that triggers the crash, could you please give it another try and - if it crashes - send me the object file please? I'd like to have a look at the generated code.
(In reply to comment #21) 
> At the moment my guess is that something is going wrong in the function prolog.
> That doesn't explain why it works on some machines and not on others, but it's
> a start.
> 
> If anyone of you have access to one of the machines that triggers the crash,
> could you please give it another try and - if it crashes - send me the object
> file please? I'd like to have a look at the generated code.

I've been able to reproduce the crash on all the window machines that I've tried with the build from here http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2009-01-24-03-mozilla-central/
Thanks a lot, Jeff. That was a great help to track it down.

I've been able to reproduce the crash as well. Now I know quite a bit more what's going on.

The function in question get's some special treatment from the compiler that other (non crashing) functions don't get. VC decides to add an epilog to the code. That by itself is not a problem, but unfortunately VC forgot to add the matching function prolog.

At the end a stack-frame gets removed that never existed, and this blows everything up. I guess I've triggered a code-generator bug here.


We can do two things to fix it:

1. Use the C version and take the performance hit. Wait until the next compiler revision comes out and then give it another try.

2. Take the code out of VC's hand and use an external assembly-file. The problem may be that VC does not come with an assembler and we may have to use third party tools or add some unmaintainable hack via __emit...


There are other strange things in the code as well. I'd like to point out the failed attempt to align the loop entry to 16 bytes. The compiler added a paranoia emms instruction as well.

Also the loads of the local variables from EBP-A0h upwards look strange, but that may have something to do with whole program optimization.



Here's the relevant part of the disassembly with some annotations.

oc_frag_recon_inter_mmx:
104239C5  pxor        mm0,mm0 
104239C8  mov         esi,dword ptr [ebp-9Ch] 
104239CE  mov         edi,dword ptr [ebp-0A8h] 
104239D4  mov         eax,dword ptr [ebp-94h] 
104239DA  mov         edx,dword ptr [ebp-0A0h] 
104239E0  mov         ebx,dword ptr [ebp-0ACh] 
104239E6  mov         ecx,4 

; next instruction is just a fancy NOP. It should 
; align to a 16 byte boundary but fails to do so.
; (take a look at the address!)

104239EB  lea         esp,[esp] 				
104239EF  movq        mm3,mmword ptr [esi] 
104239F2  movq        mm1,mmword ptr [edi] 
104239F5  movq        mm2,mmword ptr [edi+8] 
104239F9  movq        mm7,mmword ptr [esi+eax] 
104239FD  movq        mm4,mm3 
10423A00  movq        mm5,mmword ptr [edi+10h] 
10423A04  punpckhbw   mm4,mm0 
10423A07  movq        mm6,mmword ptr [edi+18h] 
10423A0B  punpcklbw   mm3,mm0 
10423A0E  paddsw      mm4,mm2 
10423A11  movq        mm2,mm7 
10423A14  paddsw      mm3,mm1 
10423A17  punpckhbw   mm2,mm0 
10423A1A  packuswb    mm3,mm4 
10423A1D  punpcklbw   mm7,mm0 
10423A20  movq        mmword ptr [edx],mm3 
10423A23  paddsw      mm2,mm6 
10423A26  add         edi,20h 
10423A29  paddsw      mm7,mm5 
10423A2C  sub         ecx,1 
10423A2F  packuswb    mm7,mm2 
10423A32  lea         esi,[esi+eax*2] 
10423A35  movq        mmword ptr [edx+ebx],mm7 
10423A39  lea         edx,[edx+ebx*2] 
10423A3C  jne         104239EF 

; Unasked function prolog from here on...

10423A3E 0F 77            emms             
10423A40 8B 4D FC         mov         ecx,dword ptr [ebp-4] 
10423A43 5F               pop         edi  
10423A44 33 CD            xor         ecx,ebp 
10423A46 5E               pop         esi  

; WTF is this? 
10423A47 E8 E4 6B EB FF   call        102DA630
 
; Root of evil is most likely the next instruction.
; EBP does not point to the stack here.
10423A4C 8B E5            mov         esp,ebp 

; Crash happends here (ESP is trashed)
10423A4E 5D               pop         ebp  				
10423A4F 8B E3            mov         esp,ebx 
10423A51 5B               pop         ebx  
10423A52 C3               ret              



Cheers,
  Nils
(In reply to comment #23)
> Thanks a lot, Jeff. That was a great help to track it down.
> 
> I've been able to reproduce the crash as well. Now I know quite a bit more
> what's going on.
> 
> The function in question get's some special treatment from the compiler that
> other (non crashing) functions don't get. VC decides to add an epilog to the
> code. That by itself is not a problem, but unfortunately VC forgot to add the
> matching function prolog.
> 
> At the end a stack-frame gets removed that never existed, and this blows
> everything up. I guess I've triggered a code-generator bug here.
> 
> 
> We can do two things to fix it:
> 
> 1. Use the C version and take the performance hit. Wait until the next compiler
> revision comes out and then give it another try.
> 
> 2. Take the code out of VC's hand and use an external assembly-file. The
> problem may be that VC does not come with an assembler and we may have to use
> third party tools or add some unmaintainable hack via __emit...
> 

What's interesting is that it doesn't seem broken all the time. For example, this build https://build.mozilla.org/tryserver-builds/2009-01-29_15:03-jmuizelaar@mozilla.com-1233270144/ should be using the mmx code but doesn't seem to have the same problem. If I recall it might not be doing the call to __security_check_cookie. I'm not sure why this call gets emitted as I don't think we're compiling with that option...


> ; WTF is this? 
> 10423A47 E8 E4 6B EB FF   call        102DA630

That's as far as I can tell a call to __security_check_cookie: http://msdn.microsoft.com/en-us/library/aa290051.aspx

> 
> ; Root of evil is most likely the next instruction.
> ; EBP does not point to the stack here.
> 10423A4C 8B E5            mov         esp,ebp
> What's interesting is that it doesn't seem broken all the time. 
> For example, this build
> https://build.mozilla.org/tryserver-builds/2009-01-29_15:03-
> jmuizelaar@mozilla.com-1233270144/
> should be using the mmx code but doesn't seem to have the same problem. 
> If I recall it might not be doing the call to __security_check_cookie. 
> I'm not sure why this call gets emitted as I don't think we're compiling
> with that option...

Pure luck if you ask me.. I've never had such a problem in my projects, but I've also never build a library as huge as xul.dll. Maybe Microsoft is using some randomized algorithm somewhere, or they stop optimizing based on available memory at certain points. That could explain why the same code compiles to different assembly each other day.

I think the best thing we can do at the moment is to remove the frag_recon functions from the codebase and reimplement them using MMX intrinsics. 

That way the functions aren't anything special for the compiler anymore. My hope is that this will make the problem go away. The generated code will become a good deal slower, but it should still be fast enough to be usable.

I'll ask the theora guys what they think about the issue.
note that -GS is the default in VC8+, and we don't change that, so we get the security_check_cookie stuff.
Oh, also, the official nightlies build with PGO, so that could definitely have an effect on codegen.
If you go with separate asm, I'd suggest just checking in the object files, rather than adding an assembler to the dependencies. As mentioned, this code doesn't change often.
That seems bad, since we'd need object files per OS or maybe even finer grained than that.
The source files (and the bug) are MSVC specific. I don't see how object files are much different. My point was that the extra build dependency on an external assembler is avoidable for everyone but the module maintainer, since the code is both upstream and rarely changing.
(In reply to comment #30)
> The source files (and the bug) are MSVC specific. I don't see how object files
> are much different. My point was that the extra build dependency on an external
> assembler is avoidable for everyone but the module maintainer, since the code
> is both upstream and rarely changing.

But the main reason for using an external assembler would be to have platform independence of the source files. We have the same platform independence problem in other parts of the source tree so there are other non-upstream places that would also benefit.

-Jeff
Sorry, I assumed some context that wasn't actually here ... what we've been considering is (I think) writing assembler in as/gas syntax, and on Windows bundling yasm into the MozillaBuild tools package. With some hackery to work around calling conventions, we could then potentially share x86 asm across OSes.
I guess whether that's a good idea for Ogg upstream is a separate and more difficult question.
Ah, sorry for the misunderstanding. Yes, if you have asm elsewhere in the tree, it makes sense to have a unified policy. For upstream, we want it to build out of the box with minimal dependencies, so we've resisted the external assembler so far. Also, inline asm is nice because it handles arguments for you, preprocessor is available, etc.

But as always, it's also up to whatever people prefer to maintain...
So, I'm confused.

The original crash report at https://bugzilla.mozilla.org/show_bug.cgi?id=462082#c12 shows the fault occurring in oc_state_frag_recon_mmx(), unfortunately without enough information to identify the exact build. Nils shows a clear error in the assembly in oc_frag_recon_inter_mmx() in one particular build.

The former function _already_ uses intrinsics, and contains no inline asm. So I do not think rewriting the code with intrinsics will fix anything. The latter function uses no intrinsics, so I don't think replacing the existing ones with inline asm will fix anything, either.

The current patch attempts to disable the latter function, and the latter function only, by not putting it in the function table. However, oc_state_frag_recon_mmx() calls oc_frag_recon_inter_mmx() directly, without using the function table (an optimization used because we know there's only one accelerated version of the function, and we know that MMX is available). So the patch actually doesn't disable this function at all, and if that patch fixes anything, it is just because you got lucky and didn't trip the compiler bug in the next build, not because you've actually solved the problem.

So you're left with the only safe option being to completely disable the asm on Windows. I think if you want to enable it, you need to determine a) whether or not this is a known bug (I find it highly unlikely that we are the first to expose it) b) what versions of the compiler are affected and c) whether there is a workaround based on _confirmed_ information about the bug from the upstream vendor. Simply applying a patch and saying "this build worked for me" is insufficient, since we do not know the real cause, and thus can't guarantee that the bug won't reappear in any arbitrary future build.

I'm going to put my foot down and refuse to support "hackery to work
around calling conventions": these are different not only for every platform, but also different "gcc-compatible" compilers on that platform, and even different command-line switches to that compiler. Mozilla may get to dictate these things on each platform. Xiph does not, and does not want to. I also don't personally have access to most of these to test them, even if I thought it was worth the extra maintenance hassle, which I do not. Every set of macros I've ever seen to mangle symbol names or handle different addressing modes has been deeply flawed in some way. That kind of garbage is why we have compilers in the first place.

But unfortunately, right now Windows is already a special case. There isn't really anyone in the core Xiph team that can maintain it. I do want to thank Nils for coming in to help take a look at this problem. He's the only reason you have Win32 asm (without using mingw32) at all. If Nils wants to do a yasm version, at this point I'm not going to stop him (disabling the asm if it's not installed seems no worse off than the current situation, given the compiler bug). But our intent would still be to use it for MSVC on Windows only.
(In reply to comment #35)
> So, I'm confused.
> 
> The original crash report at
> https://bugzilla.mozilla.org/show_bug.cgi?id=462082#c12 shows the fault
> occurring in oc_state_frag_recon_mmx(), unfortunately without enough
> information to identify the exact build.

Did you not notice the build ID just above the stack trace?
Forgive my ignorance of the Mozilla development process, but I couldn't figure out how to translate ID:20090122033712 into an actual build on https://build.mozilla.org/tryserver-builds/, or if the build even came from there in the first place.
Build IDs are just date stamps, so that's 2009-01-22 03:37:12. It's a nightly build, which means it's this build:
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2009-01-22-03-mozilla-central/

The builds in tryserver-builds are just testing builds that people have fired off. If in fact this bug is only triggered in PGO builds, then we would not see it in tryserver builds, because they don't build with PGO, but the nightlies do.
Perhaps we could get mostly-cross-platform asm without being sensitive to calling conventions by writing C stubs with inline assembler that just moves parameters to registers and then do a direct call to a function defined in asm.

But then perhaps the expedient option now would be to check in MSVC object files, as Ralph suggested.
(In reply to comment #39)
> But then perhaps the expedient option now would be to check in MSVC object
> files, as Ralph suggested.

Yeah, worst case, this is fine; we had to do this for Firefox 2 with pixman, because vc6 miscompiled it (I had a pixman.lib checked in from vc7).
> Perhaps we could get mostly-cross-platform asm without being sensitive to
> calling conventions by writing C stubs with inline assembler that just moves
> parameters to registers and then do a direct call to a function defined in asm.

Unfortunately, the problem is larger than that. Another issue is that some platforms append _'s to global symbols and some do not, and figuring out which set of #define's to check to cover _exactly_ the platform and compiler combinations that do each is error prone without the platforms in question to test on, since everyone likes to #define everyone else's macros so they can all pretend to be "compatible". We had a number of issues with this when someone checked in code that used a (poorly tested) MANGLE() macro so they could reference these symbols directly from the asm instead of letting gcc do it. Yet another issue is position independent code (PIC), which computes all references to global symbols (e.g., constant tables) as offsets relative to the instruction pointer, and really changes the way you construct the memory reference. Directly referencing such symbols produces text relocations which are viewed as a security hazard in many Linux distributions and are completely disallowed be selinux. This caused another set of bug reports that were solved by letting gcc construct the references. Note that PIC is usually larger and slower than non-PIC code (due mostly to the extra function prologue and epilogue added to _every_ visible function to set up ebx to do the referencing, but also due to the fact that there's one less register available), so we don't want to force people to use it.

Addressing different calling conventions is actually the _easy_ part, as that really only has major differences on x86-32 vs. x86-64, so there's only two configurations to test, both of which I actually have access to.

BUT, I looked at the nightly build referenced above that exhibited the problem, and I have a theory that may allow us to avoid rewriting everything with intrinsics or using yasm or any other drastic measures.

That user reported the crash at

> 104177a8 5b              pop     ebx

which is part of a function epilog similar to Nils' disassembly:

> ; Root of evil is most likely the next instruction.
> ; EBP does not point to the stack here.
> 10423A4C 8B E5            mov         esp,ebp 

> Crash happends here (ESP is trashed)
> 10423A4E 5D               pop         ebp                  
> 10423A4F 8B E3            mov         esp,ebx 
> 10423A51 5B               pop         ebx  
> 10423A52 C3               ret      

Nils, are you quite sure that the crash happens at the pop ebp, and not the pop ebx? I don't see how it could possibly have gotten past __security_check_cookie without ebp being correct (that function also doesn't change ebp, and [ebp-4] was used just a few instructions before, so...).

But here's a theory that does explain the observed crashes. It appears that ebx is the register that should point to the stack but does not. I don't know that much about the x86-32 ABI in Windows, but on Linux ebx must be preserved across function calls (it's used for PIC), and the pop ebx suggests a similar rule for Windows. The frag_recon_* routines use ebx, and are getting inlined into oc_state_frag_recon_mmx() (hence the lack of a function prolog; they're never called from anywhere else), but they never save or restore ebx. I don't know what MSVC's policy is regarding registers clobbered by asm blocks. There's no way to tell it which ones are, in fact, clobbered, and the documentation I saw before suggested that it assumed all of them were clobbered, but the comments around the asm block in __security_check_cookie (and the generated code itself) suggest that instead it assumes _you_ will do all the preserving. So this may not be a compiler bug. It may in fact be ours.

What's happening (in the nightly build, at least) is that ebx is used to save esp when it first enters oc_state_frag_recon_mmx(), and is otherwise untouched, _except_ in the inlined oc_frag_recon_*() functions, which clobber it without preserving it. When it later is used to restore esp, things go boom. I suspect that if either a) security cookies are not added or b) those functions are not inlined, then the problem gets masked (which does not mean it is not a problem). Notably, the gcc versions of these functions do not use ebx. The easy fix is to push/pop ebx in the asm blocks in oc_frag_recon_*(). See the discussion at http://lists.xiph.org/pipermail/theora-dev/2007-December/003512.html for how to eliminate the need to use ebx at all.

It would be great if someone with Windows could actualy test this theory.
Since this bug was resolved by disabling the optimized code, I filed bug 498770 to get the optimized code reenabled.
Crash Signature: [@ oc_state_frag_recon_mmx]
You need to log in before you can comment on or make changes to this bug.