Closed Bug 500277 Opened 13 years ago Closed 13 years ago

Older processors (AMD K6 and via) appear to crash on startup

Categories

(Core :: JavaScript Engine, defect, P3)

x86
Windows XP
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: benjamin, Assigned: gal)

References

Details

(Keywords: relnote)

Attachments

(1 file)

Ted and I were looking at crashes with EXCEPTION_ILLEGAL_INSTRUCTION in beta/RC builds. There appear to be a set of crashes which happen very early in startup in JITted code on certain older processor with the following IDs:

AuthenticAMD family 5 model 8 stepping 12
-- and various other model/stepping values: 8/0 9/1 13/4
-- this is an AMD K6 processor, although I can't seem to find charts of which exact models these numbers represent

CentaurHauls family 6 model 7 stepping 3
-- also 7/2 and 8/9
-- this is a Via processor

There are also a handful of crashes in GenuineIntel family 5 model 4 stepping 3
-- which is a pentium or pentium-2, if Google is correct

All of these crashes have very low uptime and the page URL is normally the whatsnew page, which means the user won't even be able to go to about:config and disable JIT as a workaround. Nominating for blocking because these crashes appear in b99 builds, which means that users updated from some previous build where Firefox worked correctly, so this is likely a regression.

Graydon, did you have a way of using QEMU to disable processor features to emulate really old processors?

Even if this doesn't block I think we may want to relnote that using Firefox 3.5 with really old processors may bite users.
Flags: blocking1.9.1?
OS: Linux → Windows XP
Moh, can you tell us which processor model "GenuineIntel family 5 model 4 stepping 3" corresponds to and which instructions/features we might be tripping?
Assignee: general → gal
We use 2 advanced processor features: CMOVs and SSE2. We detect SSE2, and enable both if we have SSE2. We should check whether there are any processors that have SSE2, but not CMOVs. I will google around.
Pretty sure none of these processors support SSE2. However, I did find something about PAE which looks similar: http://www.virtualdub.org/blog/pivot/entry.php?id=30

Although I don't know how nanojit would be using physical address extensions.
CMOV was introduced in Pentium-Pro (P6) which was after Pentium (P5) and before Pentium-2. SSE2, on the other hand, was introduced way later in Pentium-4 which subsumes Pentium-2. So, CMOV is available on Intel processors that have SSE2.

I'll get back to you regarding the precise details of "GenuineIntel family 5 model 4 stepping 3"
The QEMU environment I'm running XP in can pretend to be any of the following: 

x86           qemu64
x86           phenom
x86         core2duo
x86           qemu32
x86          coreduo
x86              486
x86          pentium
x86         pentium2
x86         pentium3
x86           athlon
x86             n270

So "yes", to some extent. Not every possible model, but I think you'll hit the +CMOV boundary with pentium -> pentium 2, and +SSE with -> pentium 3, and -> SSE2 with athlon.
Trying to get qemu to build on my mac.
This is an Intel document describing CPUID:
http://developer.intel.ru/download/design/Xeon/applnots/24161826.pdf

According to Table 4 on page 20, Family 5 - Model 4 is a "Pentium processor with MMX™ technology". So, it has MMX but not CMOV nor SSE2.
How about trying something simple like patching the code to always assume no CMOV and SSE2 to test the fallback code and make sure it is just not always crashing?
#9: First thing I did. Works like a charm. This has to be the detection code, or we accidentally emit XMM or CMOV code even the the flag is off.
It might be good to add an assert on the availability of CMOV/SSE2 features at the codegen time.

Also, here's a link to a more recent version of the doc in #8 for future use:
http://www.intel.com/Assets/PDF/appnote/241618.pdf
Tested on simulated-pentium-MMX (family 5, model 4, stepping 3), JIT appears to work, firefox starts up, runs ok.
Also, note that the availability of the SSE/SSE2 feature, as indicated by the CPUID instruction, is not sufficient for correct execution of SSE/SSE2 instructions. Operating system must also have support for SSE/SSE2 states save and restore on context switches to ensure consistent application behavior. This is described, for example, in page 3-3 and 3-4 of:
http://download.intel.com/design/PentiumII/manuals/24512701.pdf

The suggested approach is to execute an SSE/SSE2 instruction and trap for an
exception if one occurs.

This could happen in some older versions of Windows (i.e., the processor has SSE/SSE2 support while the OS does not). 

I don't, however, believe this to be the case in this example. Windows XP is very recent.
Is there any way to find out what the illegal instruction in question is?
(Safe mode will let them start up and change about:config, perhaps?)
Ed, do you guys have experience with these specific machines? Maybe we should borrow the SSE detection code Adobe uses. I think ours is custom currently.
(In reply to comment #14)
> Is there any way to find out what the illegal instruction in question is?

I don't think so, since it's some random heap address, and Breakpad doesn't record any heap data.
Yes, safe mode would work because we completely disable JIT in safe mode. Perhaps that's an acceptable workaround for 3.5. I'm going to see if I can find somebody with an actual K6 processor to test with.
Bob Tracy (CCed) commented in bug 497455 / bug 480822, and uses a K6.
(In reply to comment #16)
> Ed, do you guys have experience with these specific machines? Maybe we should
> borrow the SSE detection code Adobe uses. I think ours is custom currently.

Here's what we've got

http://hg.mozilla.org/tamarin-redux/file/ea1cfc4f4c4c/platform/win32/win32cpuid.cpp#l66

I can't vouch that it's as precise as what youre looking for, but feel free to take it for a spin.  I'm for putting this (or yours or a blend) right into nanojit, what do you think.
https://bugzilla.mozilla.org/show_bug.cgi?id=480822#c34 was a K6-III and these appear to all be the original K6.
I've gotten confirmation that K6-II (early Athlon) doesn't crash. Still looking for somebody with an original K6.
The Via C3 claimed to be 686-compatible but didn't support CMOV.  Has anybody got one to test this on?
Yeah, we should move this code into nanojit.

Looking at Adobe's code, and reading a bit about cpuid detection I believe we are actually dying on the cpuid instruction itself. It is not supported by early K6 and some pentiums. We have to check whether cpuid is supported first, before using it to detect SSE2. I will upload a patch in a bit. Can we try to get some with a k6 to test for us?
Attached patch patchSplinter Review
Patch, untested on windows
Ok, we need testing:

- On pre CPUID machines (K6, early pentiums).
- Review & testing on Windows. I am not familiar with win32 assembly style (David, can you help?)

Note that I am not checking that the OS supports SSE2. Even Win98 Second Edition does. Win95 is ... well 14 years ago.
I don't think we're dying on CPUID: the crash occurs while running JITted code

mzz on IRC has the via chip and we're trying to get information from that, stay tuned.
(In reply to comment #17)
> (In reply to comment #14)
> > Is there any way to find out what the illegal instruction in question is?
> I don't think so, since it's some random heap address, and Breakpad doesn't
> record any heap data.

On Windows, if you open the app binary and launch it from MS Visual Studio, when it crashes, you will be taken directly to the disassembly of the code causing the crash, most of the time precisely at the crash point. You don't even need to have the debug symbols which is the case for the JIT'ed code.
We don't have to die on the CPUID instruction. Its enough if the detection isn't reliable and tells us that SSE2 is present when it isn't.
Moh: if we were in contact with someone who could reproduce this, we could do a lot of things. :)
(In reply to comment #30)
> Moh: if we were in contact with someone who could reproduce this, we could do a
> lot of things. :)

don't you have access to the system that reported the crash?
We reproduced *something* on a Via C3 running Linux. This chip doesn't support PAE or the NX bit (which means that all readable pages are automatically executable). The kernel had PaX protection instead (see http://en.wikipedia.org/wiki/PaX).

The mprotect call failed because PaX doesn't allow a page to be writable and executable at the same time. If we remove the abort on mprotect failure http://mxr.mozilla.org/mozilla-central/source/js/src/nanojit/Nativei386.cpp#271 
the program crashes the first time we try to run JIT code (because PaX prevents execution). Disabling PaX allows the JS shell to run correctly.

Since the Via C3 is still in production I'm going to see if MoCo can buy one and get Windows up on it.
Depends on: 500430
(In reply to comment #31)
> don't you have access to the system that reported the crash?

No, these are crash reports from random users.
(In reply to comment #33)
> (In reply to comment #31)
> > don't you have access to the system that reported the crash?
> 
> No, these are crash reports from random users.

Automated crash reports (in case Moh doesn't grok the "breakpad" references).

/be
Thanks, I got it. So, you neither have the system nor the workload, and there's no way to pass the contents of the carsh address to Breakpad. I'll try to see if we still have one of those systems in our lab.
I have not found such a system yet, but browsing the code, I see:

LIns* LirWriter::ins_choose(LIns* cond, LIns* iftrue, LIns* iffalse)
{
		...

	if (true/*avmplus::AvmCore::use_cmov()*/)
	{
		return ins2((iftrue->isQuad() || iffalse->isQuad()) ? LIR_qcmov : LIR_cmov, cond, ins2(LIR_2, iftrue, iffalse));
	}

Isn't "if (true)" a problem? I.e., we use cmov unconditionally.
Where did you find that code? This should be fixed on release branch and m-c and TM.
(In reply to comment #37)
> Where did you find that code? This should be fixed on release branch and m-c
> and TM.

This is from trunk taken on May 19. I have to refresh.
We found the avmshell crashes on p3 w/o SSE2 with jit enabled.  see bug 500466
Suggested relnote:

Users with original Pentium or Athalon processors may experience repeated crashes on startup. To fix this, start Firefox in Safe Mode, navigate to "about:config" and change the value of the "javascript.options.jit.content" to false.

Sound OK?
Flags: wanted1.9.1.x?
Flags: blocking1.9.1?
Flags: blocking1.9.1-
Since we only have one Pentium report, let's limit it to AMD K6 and VIA C3 processors:

"Users with AMD K6 or Athlon processors may..."
Dammit, I can't type.

"Users with AMD K6 or older Via C3 processors may..."
I did not find a Pentium MMX, but tried RC3 on two different Pentium III Xeon systems (family 6, model 7) running Windows XP and Windows Server 2003.  Firefox starts and the JIT works fine on both. Note that SSE2 is not supported on Pentium III.
So for what it's worth, I just tried rc3 on my Pentium III Linux system (Coppermine, family 6, model 8, stepping 3, no sse2 but has cpuid).  Startup seems to be ok, and the jit seems to work (based on some quick performance tests I ran).

If there's anything else I can test on that system, just tell me!
mrt did some browsing in a reclaimed AMD K6 with Windows NT and didn't experience any crashes... which is good, but mysterious. Anyway I'll watch these crash reports but it's probably not such a big deal.
I was looking at the crashes of JITed code in 3.5b99:
http://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A3.5b99&platform=windows&query_search=signature&query_type=exact&query=&date=&range_value=1&range_unit=weeks&do_query=1&signature=js_MonitorLoopEdge%28JSContext%2A%2C%20unsigned%20int%26%29

It seems we also have several ACCESS_VIOLATIONs. If by mistake we jump to an address that is valid and in the range but not a correct jump target, we may end up with ILLEGAL_INSTRUCTIONs. This is just a guess.
(In reply to comment #47)
> mrt did some browsing in a reclaimed AMD K6 with Windows NT and didn't
> experience any crashes... which is good, but mysterious. Anyway I'll watch
> these crash reports but it's probably not such a big deal.

Windows XP, actually.  Would sticking SP3 on there affect anything? (thinking about parallels to PaX in comment 32 - SP3 changed some internals, security-wise, no?)
Priority: -- → P3
It's worth a try...
I crashed this morning, with planet.mozilla.org as one of four tabs. Fx was minimised at the time, and the PC was having SP1 installed (apparently a prerequisite for installing SP3, wtf?).  Unfortunately, I'd left it to install so I can't tell you any more.

http://crash-stats.mozilla.com/report/index/58c01ed0-c86a-4745-b5ac-780112090627

It crashed in XPCCallContext, not js_MonitorLoopEdge, and with an access violation, not an illegal instruction.

I wasn't able to duplicate this (browsing /.,, digg, gmail and planet.m.o)
(In reply to comment #51)

> It crashed in XPCCallContext, not js_MonitorLoopEdge, and with an access
> violation, not an illegal instruction.

That would appear to be a separate issue.  There seem to be many other reports of access violations in XPCCallContext and they seem to NOT be restricted to older CPUs.  I do not seem to be able to find a bug filed on this issue, however.
Testing with 3.5b99 (same buildid as the other crash reports), and not 3.5rc3, I crash on startup:

http://crash-stats.mozilla.com/report/index/a6b1c1db-b2d9-48bc-91c2-8f1e52090627
http://crash-stats.mozilla.com/report/index/1ca2e52b-ced3-48f9-be5a-4cedc2090627
http://crash-stats.mozilla.com/report/index/fd1ce36c-66c1-4ec3-a591-382412090627
http://crash-stats.mozilla.com/report/index/60ba1f64-62b6-41d4-b84f-36dd42090627
http://crash-stats.mozilla.com/report/index/d0c985f1-6209-4a76-96d6-529c72090627
http://crash-stats.mozilla.com/report/index/66fdaf97-0c92-4920-9cf0-b29272090627

(you probably don't need all those ;)

The first start was with planet.m.o and /. but later starts were just a "Restore session" tab and the 3.5b99 startup page; Fx quit before any content was displayed.

Safe mode works.
Setting javascript.options.jit.content to false works.
And, 3.5RC3 works.
The b99 crash is probably us not disabling conditional moves (cmovxx). We fixed that in rc3. The signature looks the same (in both cases you crash on trace somewhere).
(In reply to comment #51)
> http://crash-stats.mozilla.com/report/index/58c01ed0-c86a-4745-b5ac-780112090627
> 
> It crashed in XPCCallContext, not js_MonitorLoopEdge, and with an access
> violation, not an illegal instruction.

This is bug 500936.
I don't know if this is of any use but libtheora has some cpu identification code that covers a number of different causes with various versions of chipsets:

http://hg.mozilla.org/mozilla-central/file/b4b7eb4407c3/media/libtheora/lib/cpu.c
Are there any more things I can try with this system, or can it go back in the
garage?

(Contact me off bug if you want to give it a new home!)
This bug was filed with b99 builds, which don't have the CMOV fix. I suspect that they are just manifestations of that bug. Resolving WFM... beltzner I think you should feel free to remove the relnote as well.
Status: NEW → RESOLVED
Closed: 13 years ago
Flags: wanted1.9.1.x?
Resolution: --- → WORKSFORME
It's not just with old processors ..

I have a Dell Latitude 610 with an Intel DuoCore CPU (T7300 @ 2 GHz) and 3.5 crshes every time I launch it.  I reinstalled 3 which is stable.
Glenn, that's a different bug. Please check about:crashes and search/file a bug.
You need to log in before you can comment on or make changes to this bug.