Closed Bug 224487 Opened 22 years ago Closed 22 years ago

FPU control word gets clobbered by PM - can't start unless delete compreg.dat first

Categories

(Core :: JavaScript Engine, defect)

x86
OS/2
defect
Not set
critical

Tracking

()

RESOLVED FIXED

People

(Reporter: mrmazda, Assigned: jhpedemonte)

References

Details

Attachments

(2 files, 1 obsolete file)

Mozilla/5.0 (OS/2; U; Warp 4.5; en-US; rv:1.6a) Gecko/20031101 To reproduce: 1-delete previous build 2-unzip new build 3-start mozilla 4-browse some pages 5-close mozilla 6-try to start mozilla Actual results: 11-02-2003 15:04:39 SYS3183 PID 0041 TID 0001 Slot 006c H:\MOZILLA\BIN\MOZILLA.EXE c000009a 1e2b108d EAX=02300f7f EBX=02430b80 ECX=0013f938 EDX=1e2b0d1e ESI=023c4560 EDI=0013f938 DS=0053 DSACC=f0f3 DSLIM=ffffffff ES=0053 ESACC=f0f3 ESLIM=ffffffff FS=150b FSACC=00f3 FSLIM=00000030 GS=0000 GSACC=**** GSLIM=******** CS:EIP=005b:1e2b108d CSACC=f0df CSLIM=ffffffff SS:ESP=0053:0013f900 SSACC=f0f3 SSLIM=ffffffff EBP=0013f918 FLG=00012202 MOZJS.DLL 0001:0003108d Expected results: Functional Mozilla Notes: Works fine by simply deleting registry.dat, if it exists, before starting. I tried with a fresh profile, but it also traps trying to run 'mozilla -createprofile "profilename profiledir"'.
I know what this is: we need to strengthen the code in javascript that sets the FPU control word. Working on a fix.
Status: NEW → ASSIGNED
To clarify, the workaround is to delete your components\compreg.dat before startup (not registry.dat). Updating summary.
Summary: Can't start unless delete registry.dat first → Can't start unless delete compreg.dat first
Attached patch js patch (obsolete) — Splinter Review
Brendan, this patch is basically an extension of the javascript patch in bug 215581. Although that patch was checked in, it was not entirely sufficient, and now the trunk crashes on startup after the initial launch with a floating point underflow exception. Again, the problem is that PM calls (system function calls) are changing the FPU control word from what was set by Mozilla at the start of a Javascript context. This patch (as compared to the one from bug 215581) adds a few more things: - Sets the FPU control word in js_NewNumberValue (in addition to JS_strtod and js_dtoa) - Restores the FPU control word to the previous saved value at the end of these functions. This was suggested by another developer, who stated we should play nice, even if PM isn't. - Does not optimize jsnum.c. Due to an optimizer bug in GCC, the _control87() call was getting reordered after the floating point calls, resulting in a crash. I am still investigating whether this whole mess is a compiler bug (this did not happen with Visual Age C++, only with GCC for OS/2). In the meantime, though, the nightlies are in bad shape, and I'd like to get this fixed.
Attachment #135213 - Flags: review?(brendan)
Comment on attachment 135213 [details] [diff] [review] js patch I still think it's evil that the fpu mode can't be set persistently. How do numerically intensive apps cope with any efficiency? Use ok, not rv, as the name of JSBool result variables. Also, set ok from return values to propagate errors, avoid if (!js_NewDoubleValue(...)) ok = JS_FALSE. E.g. instead of this: js_NewNumberValue(JSContext *cx, jsdouble d, jsval *rval) { jsint i; + JSBool rv = JS_TRUE; + + SET_FPU(); if (JSDOUBLE_IS_INT(d, i) && INT_FITS_IN_JSVAL(i)) { *rval = INT_TO_JSVAL(i); } else { if (!js_NewDoubleValue(cx, d, rval)) - return JS_FALSE; + rv = JS_FALSE; } - return JS_TRUE; + + RESTORE_FPU(); + return rv; } do this: js_NewNumberValue(JSContext *cx, jsdouble d, jsval *rval) { jsint i; + JSBool ok; + + SET_FPU(); if (JSDOUBLE_IS_INT(d, i) && INT_FITS_IN_JSVAL(i)) { *rval = INT_TO_JSVAL(i); + ok = JS_TRUE; } else { ok = js_NewDoubleValue(cx, d, rval); - return JS_FALSE; } - return JS_TRUE; + + RESTORE_FPU(); + return ok; } /be
brendan: here's what i would have hoped, except i believe that i'm wrong. any fpu intensive app should put its fpu task onto a thread which doesn't call ui functions, and hope that the os is responsible for preserving fpu state for thread context switches. the problem with this is of course that we (dbradley and I) think the os doesn't do that. if that's true, then no matter what you're likely screwed in the event *some* thread changes the state and doesn't manage to restore it before it loses cpu control :).
Component: XPCOM Registry → JavaScript Engine
QA Contact: PhilSchwartau
Summary: Can't start unless delete compreg.dat first → FPU control word gets clobbered by PM - can't start unless delete compreg.dat first
This is one of the things we were hoping to avoid with the previous patch, but apparently it is an issue. My fear is that this only reduces the chance. What happens when a thread context switch occurs within the set/restore area? I'm thinking for this to be solid you'd need to prevent any context switches between the two. I know this is possible under Windows, not sure about other operating systems. The other issue is performance. There was another solution and that was to make the function work in the presence of higher precision. Testing such changes is difficult and that's why we avoided that solution initially. But given the negatives of this solution, maybe that should be revisited.
Component: JavaScript Engine → XPCOM Registry
Any OS that doesn't properly virtualize machine state among threads, so that FPU mode is saved on context switch and restored on switch back to run that thread, is just broken. Don't use broken OSes. Fixing Component. /be
Component: XPCOM Registry → JavaScript Engine
A thought on avoiding de-optimizing the compilation of jsnum.c on OS2. Maybe you could use volatile appropriately (declare _control87 as volatile, or something). /be
Don't mind me, I remembered it wrong. It wasn't another process that was setting the control register and leaking it across, it was a COM side effect. I should have went back and reviewed the bug instead of relying on my memory. The question remains how big a perf impact is this? I guess given it's OS/2 only it's may not be a big deal. From what I saw the _control87 call isn't trivial. This scenario could possibly occur in other OS's.
*** Bug 225346 has been marked as a duplicate of this bug. ***
Attached patch js patch v1.1Splinter Review
Updated patch based on comments. Also, are the masks for _control87() in jsnum.h correct?
Attachment #135213 - Attachment is obsolete: true
Attachment #135213 - Flags: review?(brendan)
Attachment #135339 - Flags: review?(brendan)
Comment on attachment 135339 [details] [diff] [review] js patch v1.1 Use ok, not rc (or rv), for such JSBool variables (see the jsdtoa.c changes). I have no idea what args to _control87 are correct for OS2, you tell me ;-). Did you try volatile to avoid deoptimizing jsnum.o? Where is _control87 declared currently, and can you cause that declaration to be volatile, or otherwise force the issue? Fix the s/rc/ok/g nit and r=brendan@mozilla.org. /be
Attachment #135339 - Flags: review?(brendan) → review+
I tried everything I could to work around the optimizer bug, but got nowhere. I'll keep trying, and see if the optimizer can get fixed.
This was the patch that was checked in, with changes based on Brendan's comments.
I'd like to keep this bug open while I investigate whether this is actually a compiler bug, as I suspect. If it is and I can get the compiler fixed, then we'll just back this crap out. Otherwise, I'll close this bug out.
I hadn't gotten a nightly build in a while so I got this one: Mozilla/5.0 (OS/2; U; Warp 4; en-US; rv:1.6b) Gecko/20031205 I used this build for a short time & the problem does not appear in this build. I have carefully NOT said the bug is cured. I can't speak to that because my reading of this discussion is that it is pretty subtle & deep & I am not qualified to speak to any of the internal issues. All I can say is that I am no longer observing what I was observing. Unfortunately, due to Bug 225472, I have dropped back to the 1.6 Alpha.
Ok, it looks like this needs to be fixed in Mozilla and cannot be fixed in the compiler or system. Closing as fixed.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Javier, can you say more about what you learned? Why is this necessary on OS/2 but not on similar OS/compiler platforms? Thanks. /be
Well, first I should mention that we (OS/2 Mozilla developers) are not the only ones that are seeing this, and our compiler (GCC) isn't the only one affected. The current speculation seems to point to the loading of some Presentation Manager DLLs is changing the FPU control word. At the moment, no one has found a possible workaround for this issue, other than what we are doing here.
This bug is PC, OS/2. If there are other platforms affected, don't we need other ifdefs? I know, gcc may not be relevant, but I was really asking whether the problem was in OS/2, and how the conversation went about fixing the PM.... /be
Sorry about that. Yes, this is OS/2 only. The conversation is more or less still on going, although no one is optimistic that it can be resolved by fixing the system (Presentation Manager). It looks like on OS/2, developers will have to be very careful when dealing with floating points, more so than on other platforms.
Flags: testcase-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: