Closed Bug 51267 Opened 20 years ago Closed 20 years ago

Intermittent failure loading CSS from JARs

Categories

(Core :: Networking, defect, P1)

x86
Windows 2000
defect

Tracking

()

VERIFIED FIXED

People

(Reporter: jrgmorrison, Assigned: sspitzer)

References

Details

(Whiteboard: [nsbeta3+] fix in hand)

Attachments

(4 files)

For a while now, when starting mozilla on Windows (at least) there has been
this error message in the console for debug builds. 

CSSLoaderImpl::DidLoadStyle: Load of URL 'chrome://communicator/skin/menu.css
' failed.  Error code: 16389

I haven't seen this error in optimized builds, but there have been a number of
bugs reported recently that basically boil down to a failure to load or
resolve style on widgets and menus.

There seems to be some failure in loading the style sheets from the JAR
files. Here is the stack at the point of the error message, although the 
actual error may be on a different thread (I don't know). 
[build pulled ~9pm 09/03 win2k].

CSSLoaderImpl::DidLoadStyle(nsIStreamLoader * 0x02729660, nsString * 0x00000000, 
SheetLoadData * 0x02729518, unsigned int 2147500037) line 918
SheetLoadData::OnStreamComplete(SheetLoadData * const 0x02729518, 
nsIStreamLoader * 0x02729660, nsISupports * 0x00000000, unsigned int 2147500037, 
unsigned int 0, const char * 0x100a45e8 gCommonEmptyBuffer) line 644
nsStreamLoader::OnStopRequest(nsStreamLoader * const 0x02729664, nsIChannel * 
0x0272ef38, nsISupports * 0x00000000, unsigned int 2147500037, const unsigned 
short * 0x100a45e8 gCommonEmptyBuffer) line 121 + 78 bytes
nsJARChannel::OnStopRequest(nsJARChannel * const 0x0272ef3c, nsIChannel * 
0x0272fe40, nsISupports * 0x00000000, unsigned int 2147500037, const unsigned 
short * 0x100a45e8 gCommonEmptyBuffer) line 925 + 53 bytes
nsOnStopRequestEvent::HandleEvent(nsOnStopRequestEvent * const 0x026f22e0) line 
302
nsStreamListenerEvent::HandlePLEvent(PLEvent * 0x026f2358) line 97 + 12 bytes
PL_HandleEvent(PLEvent * 0x026f2358) line 589 + 10 bytes
PL_ProcessPendingEvents(PLEventQueue * 0x00b6fce0) line 526 + 9 bytes
_md_EventReceiverProc(HWND__ * 0x003d04c8, unsigned int 49339, unsigned int 0, 
long 11992288) line 1059 + 9 bytes
USER32! 77e13eb0()
USER32! 77e1401a()
USER32! 77e192da()
nsAppShellService::Run(nsAppShellService * const 0x00bebf68) line 379
main1(int 1, char * * 0x00957a98, nsISupports * 0x00000000) line 958 + 32 bytes
main(int 1, char * * 0x00957a98) line 1139 + 37 bytes
mainCRTStartup() line 338 + 17 bytes
KERNEL32! 77e87903()
if this is tbe clue to bugs like bug 51225 and bug 51164, severity should be raised.
Of course, after I went home, it occurred to me that perhaps the error message
was not bogus ...

http://lxr.mozilla.org/seamonkey/search?string=communicator%2Fskin%2Fmenu.css

shows the only places that this .css file is referenced, but when you track
down the actual file, 
http://lxr.mozilla.org/seamonkey/source/themes/modern/communicator/menu.css,
you find that it is a zero-length file (hence no style to load). 

Shunting this bug over to ben to see if that @import (and jar.mn line) should 
be removed (maybe, maybe not).
Assignee: warren → ben
updating summary from 'mozilla failing to load some stylesheets from the JAR 
files.' 
Summary: mozilla failing to load some stylesheets from the JAR files. → remove(?) @import & jar.mn of communicator/skin/menu.css (blue,modern skins)
That particular one is bogus, but the bug is much worse.  I intermittently have 
random stylesheet load failures, and I've seen this on practically every 
stylesheet we load (from the main global skin to others).

The menu.css message is distinct and seems to be inaccurate.  It should be the 
subject of a separate bug.
Assignee: ben → warren
Fixing summary.
Summary: remove(?) @import & jar.mn of communicator/skin/menu.css (blue,modern skins) → Intermittent failure loading CSS from JARs
Oops ... filed the specific menu.css issue as bug 51317. Although, looking at
this again, it is curious that this is reported as an error when loading a 
zero-length CSS file form a JAR in a debug build. It is not an error in an 
optimized build (on windows), or from the file system in a debug build (linux).
*** Bug 51334 has been marked as a duplicate of this bug. ***
*** Bug 51335 has been marked as a duplicate of this bug. ***
Nominating for nsbeta3, this has been seen on Windows 98 as well.
Keywords: nsbeta3
See bug 51511 - it may be related or not, but it looks like this is a probable 
explanation. I will try to see if any errors pop up in the console when this 
occurs.
This is the entire console output on optimized build when symptoms described in 
bug 51511 manifest themselves:

stdout directed to dynamic console
stderr directed to dynamic console
WEBSHELL+ = 1
 I am inside the initialize
 Hey : You are in QFA Startup
(QFA)Talkback loaded Ok.
WEBSHELL+ = 2
WEBSHELL+ = 3
Setting content window
*** Pulling out the charset
in SetSecurityButton
has multiple monitor apis is 1
WEBSHELL+ = 4
Document http://www.mozilla.org/ loaded successfully
Attached an image.  This was seen by hammerly in today's build.  after talking 
to hyatt, he said that this may also be a symptom of not reading in some CSS 
properly.  Notice that the place for the dropmarker for forward, back, and print 
buttons is not appearing, but an identical image to the button itself is 
instead.  This may be because menubutton.css is not being read properly.

This is crucial for using skins in mozilla.  The end user should never see the 
screenshot i just posted.  Affirming nsbeta3. 
*** Bug 51925 has been marked as a duplicate of this bug. ***
adding myself to the cc list.

I'm wondering if this is related to the intermittent failure to load the
am-server.js file.  (which has been seen on win32 and linux)

see bug #51546
I've seen this too. It's fairly frequent.
I get a failure on one or another chrome css files every couple of times that I 
launch - usually menu.css, but sometimes several at once, like:

CSSLoaderImpl::DidLoadStyle: Load of URL 'chrome://global/skin/splitter.css' 
failed.  Error code: 16389
CSSLoaderImpl::DidLoadStyle: Load of URL 'chrome://global/skin/tree.css' failed.  
Error code: 16389
CSSLoaderImpl::DidLoadStyle: Load of URL 'chrome://global/skin/radio.css' 
failed.  Error code: 16389

on this launch. The browser looks like crap when this happens, and I think that 
this is really a pretty high priority bug.
Don't know how I'm going to get to this, but we need to fix it. nsbeta3+
Whiteboard: [nsbeta3+]
Should the priority of this bug be higher than a P3?
My $0.02 from a marketing perspective is that this is definitely a P1.  
Ok, this may be redundant, but yes, this should be a higher priority.   This
intermittent problem makes Mozilla seem, well, a bit unprofessional.   "Gee, it
works most of the time. . . . "
for a pretty reproducable case of this, see bug #51546
*** Bug 50652 has been marked as a duplicate of this bug. ***
Warren, is there anything I can do to help get this narrowed down or fixed? I 
have a few spare cycles and can reproduce it pretty regularly...
Mark: You should try to coordinate with Seth who's working on 51546.
No longer blocks: 51546
mark / warren: 

it turns out 51546 is not related to this.  but I'm more than willing to help 
debug this beast, too.
marking this P1, so it stays on the radar.  (everyone seems to agree with p1 on
this beast.)

I'm doing some debugging on this today to see if I can make some headway.
Priority: P3 → P1
I'm getting somewhere, I think.

it looks like we are processing multiple loads of the css file (example:
menu.css) in parallel.

we fail because the realsize (in the nsZipItem is 0).

I'm trying to figure out why the nsZipItem gets corrupted.

I'll continue to debug.  (all this code is new to me...)
I just noticed that there are *two* menu.css files getting packaged in the jar,
one of length zero.

it's in the top level jar.nm in modern.  I'm going to take it out, repackage and
see if that makes the problem go away.

keep your fingers crossed!
that seems to fix it for me.  I was able to reproduce this bug easily, and now I
can't.  I'm going to fix the jar.mn files and then remove the zero length
menu.css from the tree.

after I get a reviewer, of course.
What about the other files that fail sometimes? See my comment on 2000-09-12 
17:48. I'm really glad you fixed the menu.css one though, that was the most 
common.
It looks like john morrison figured out the problem 10 days ago.

since other people have reported other css failures (tree.css, radio.css. etc)
I'll look into those problems before I mark this bug fixed. 
*** Bug 52685 has been marked as a duplicate of this bug. ***
adding nbhatla to the cc list.
mark:  I'm going to go try to reproduce the tree.css,radio.css,splitter.css
bugs, and look for similar jar problems (that caused the menu.css bug.)

ok, after some painful debugging, here's what I've found:

the random failure to load files from JARs are caused by "corruption" errors.

we fail to verify the crc32.

see approx line 1145 on nsZipArchive.cpp

nsZipArchive::InflateItem(const nsZipItem * 0x01a82b50, PRFileDesc * 0x00000000,
char * 0x00d13020) line 1145
nsZipArchive::ReadInit(const char * 0x0270a500, nsZipRead * * 0x0270a564) line
410 + 18 bytes
nsJARInputStream::Init(nsJAR * 0x00d1c8c0, const char * 0x02709c20) line 110 +
29 bytes
nsJAR::GetInputStream(nsJAR * const 0x00d1c8c0, const char * 0x02709c20,
nsIInputStream * * 0x0270ae2c) line 321 + 16 bytes
nsJARChannel::GetInputStream(nsJARChannel * const 0x02709ae8, nsIInputStream * *
0x0270ae2c) line 1035 + 37 bytes
nsFileTransport::Process() line 407 + 71 bytes
nsFileTransport::Run(nsFileTransport * const 0x0270add4) line 362
nsThreadPoolRunnable::Run(nsThreadPoolRunnable * const 0x026b2a40) line 689 + 12
bytes
nsThread::Main(void * 0x026b29f0) line 84 + 26 bytes
_PR_NativeRunThread(void * 0x026b2880) line 399 + 13 bytes
_threadstartex(void * 0x026b26d0) line 212 + 13 bytes
KERNEL32! 77f04ee8()

now I'm off to see why that is.
adding sgehani to the cc list.

according the blame logs, he added the crc checking code, so he might be able to
help figure this one out.
well, we think we've got it.

nsZipArchive is not thread safe, but when the error occurs, we see that we
have two threads calling into zlib to inflate an entry.

samir and I are going to try to put a lock to prevent this, and see if it
prevents the problem.

and kill jar loading performance, I'm sure.

samir's guess on why this doesn't show up on XPInstall is that XPInstall does
all extraction on one thread.
*** Bug 52300 has been marked as a duplicate of this bug. ***
fix in hand.  I'll attach the patch.  I'm not seeing any failures anymore, and
none of my failure breakpoints are getting hit.

the downside:  I've added a nsAutoLock to nsJAR which means performance goes down.

correctness, then performance.
whoops, a few extra debugging assertions in that patch.

I'll re-attach an official patch. (with comments, too.)
r=dveditz for what you've got, looking around to see if ::LoadEntry() and 
::GetInputStream() are really the only ones you need to lock.

nsJAR::Open() has a small unreleated problem. If you try to open twice 
mZip.OpenArchiveWithFileDesc() will return an error, and the open NSPR file 
will never be closed. Does the open need to be locked? The chances of double 
simultaneous use are small, but really it should. Ditto the Close().
Should nsJAR::Extract() be locked as well?

Since you lock nsJAR::LoadEntry() and it calls nsJAR::GetInputStream() which 
you also lock, don't you get an NSPR error trying to lock the same lock twice?
taking this bug from warren.

I've gotten a review from warren, so I'll check this in when the tree opens.

Assignee: warren → sspitzer
accepting.  just waiting for the tree to open...
Status: NEW → ASSIGNED
Whiteboard: [nsbeta3+] → [nsbeta3+] fix in hand
dan, thanks for catching my mistake.

I've got a new and improved patch, I'll attach it now.
marking m18.
Target Milestone: --- → M18
didn't get a chance to check this in tonight, I'll check it in tomorrow.
*** Bug 52754 has been marked as a duplicate of this bug. ***
r=dveditz when you get to it.
fixed.  thanks to dveditz and samir for the help.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
changing QA contact to jrgm for verification
QA Contact: tever → jrgm
After several days launching builds (debug and opt) I have seen neither the
error messages nor the failure to resolve style in chrome. So vrfy fixed.
Status: RESOLVED → VERIFIED
*** Bug 50519 has been marked as a duplicate of this bug. ***
Mass removing self from CC list.
You need to log in before you can comment on or make changes to this bug.