Closed Bug 506926 Opened 15 years ago Closed 15 years ago

reftests crash during test run in CompareCanvas

Categories

(Core :: Graphics: Canvas2D, defect)

ARM
Windows Mobile 6 Professional
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
status1.9.2 --- beta5-fixed

People

(Reporter: jmaher, Assigned: crowderbt)

References

Details

Attachments

(1 file, 13 obsolete files)

7.66 KB, patch
crowderbt
: review+
crowderbt
: superreview+
beltzner
: approval1.9.2+
Details | Diff | Splinter Review
While trying to run reftests on windows mobile builds for fennec, I get a crash in xul.dll which points me to this source:
 http://mxr.mozilla.org/mozilla-central/source/dom/base/nsDOMWindowUtils.cpp#639

This is while running the layout/reftets/image/reftest.list batch of reftests (3 total, first test is problematic).

Running reftests have worked in the past (3 weeks ago) on windows mobile, so something has regressed.  I am running these without the python script, but instead by launching them directory with cli arguments to fennec.exe.
What are the values/variables there?
I put a debug build on my device and this is what I see:
v = memcmp(img1->Data(), img2->Data(), size.width * size.height * 4);

img1->mRawPtr->mData = null
img2->mRawPtr->mData = null

Both img1 and img2 have identical mSize attributes(800x1000).

There is no call stack other than:
0x03f9ded8
xul.dll!nsDOMWindowUtils::CompareCanvases(aCanvas1 = 0x00000000, aCanvas2 = 0x0030d3fc, aMaxDifference = 0x00000000, retVal = 0x35bbeab8)
asmXPTC_InvokeByIndex
I am still seeing this on a build from 20090910, from my debugger:

>	xul.dll!nsDOMWindowUtils::CompareCanvases(nsIDOMHTMLCanvasElement* aCanvas1 = 0x01bd6a3c, nsIDOMHTMLCanvasElement* aCanvas2 = 0x01de7a4c, unsigned int* aMaxDifference = 0x2dbbbb38, unsigned int* retVal = 0x2dbbbb48) Line: 639, Byte Offsets: 0x284	C++
joel, can you reproduce this one test case without running any other ref test and without browsing any other pages?  The reason I ask, is that I am wondering if this is a Out-of-memory (OOM) probably caused by stuff that leads up to the test.
this is actually reproduced by doing this:

\tests\fennec\fennec.exe --environ:NO_EM_RESTART=1 -no-remote -profile \tests\reftest\temp1\ -reftest \tests\reftest\tests\layout\reftests\image\reftest.list

I can put whatever path or reftest.list in here and it seems to crash.  This is before a reftest even loads, and on the device the titlebar says reftest: 0/3 (0%), but the main display area has never drawn a fennec window, it is the same background as before running the test.
Assignee: nobody → crowder
Attached patch helps (obsolete) — Splinter Review
This was working for a few moments today, I was able to run reftests, but was crashing on shutdown.  Something I did broke that, though, and now I am getting a crash in pixman_walk_composite_region() where it is calling *compositeRect with what seems to be a bogus function ptr.  More later...
Oh, I think I have an idea what's wrong here.  Something is still trying to use the cached data, following the resize.
Attached patch better (obsolete) — Splinter Review
This one doesn't crash until shutdown, in the cycle-collector somewhere.  Still diagnosing.
Attachment #406579 - Attachment is obsolete: true
Incidentally, I was wrong in comment #7.  The issue in comment #6 was that we were building a pixman image for the DDraw surface which was sized at the source image size, not the size of the DDraw surface.
I'll expound on the jemalloc mayhem tomorrow, as well as do the obvious cleanup this patch needs.
Attachment #406697 - Attachment is obsolete: true
Comment on attachment 406821 [details] [diff] [review]
This one survives startup and shutdown

> #if defined(MOZ_MEMORY_WINCE) && !defined(MOZ_MEMORY_WINCE6)
>-#define	CHUNK_2POW_DEFAULT	21
>+#define	CHUNK_2POW_DEFAULT	20
> #else
> #define	CHUNK_2POW_DEFAULT	20
> #endif

For windows mobile we need to reserve 2Mb (or more) at a time in order to not be limited to the 32Mb of per-process address space.  If we take this change knocking it down to 1Mb we'll end up with the "checkerboard of death" all over again.
I don't see why/how VirtualAlloc()ing larger chunks somehow allows us to exceed the 32MB limit.
After consulting the bones, ritually sacrificing two goats, a small and very cute bunny, and one chicken, I have discovered this dark virtual-memory magic.  The mystical number 21 remains ascendant, though I still don't understand how its invocation improves our sorcery.
Attachment #406821 - Attachment is obsolete: true
Comment on attachment 406868 [details] [diff] [review]
the correct incantation for VirtualFree()

With this patch, I can run the image reftests in debug and release with no crashes.
Attachment #406868 - Flags: superreview?(pavlov)
Attachment #406868 - Flags: review?(bugmail)
Comment on attachment 406868 [details] [diff] [review]
the correct incantation for VirtualFree()

>Bug 506926 - unable to run reftests on Windows Mobile
>
>diff --git a/memory/jemalloc/jemalloc.c b/memory/jemalloc/jemalloc.c
>--- a/memory/jemalloc/jemalloc.c
>+++ b/memory/jemalloc/jemalloc.c
>@@ -2215,7 +2215,12 @@ pages_unmap(void *addr, size_t size)
> 		if (GetLastError() == ERROR_INVALID_PARAMETER) {
> 			MEMORY_BASIC_INFORMATION info;
> 			VirtualQuery(addr, &info, sizeof(info));
>-			if (VirtualFree(info.AllocationBase, 0, MEM_RELEASE))
>+
>+      // Had trouble!  Decommit unilaterally, then release.
>+      if (VirtualFree(info.AllocationBase, 
>+                      size + ((char *)info.BaseAddress -
>+                      (char *)info.AllocationBase), MEM_DECOMMIT) &&
>+          VirtualFree(info.AllocationBase, 0, MEM_RELEASE))
> 				return;
> 		}

two nits, white space.  Align the if with the VirtualQuery. Also I think changing "Decommit unilaterally" to  "Force decommit." would be more clear.

Also, I'd like to note that according to MSDN's documentation, decommitting before releasing is redundant since VirtualFree called with MEM_RELEASE is supposed to decommit the pages then release them.

Finally, since it wasn't clear to you why we use 2Mb chunks, we should probably have a comment about that. The relevant documentation is here: http://msdn.microsoft.com/en-us/library/aa908768.aspx Specifically this section from the remarks
> Before Windows Embedded CE 6.0 , if you call VirtualAlloc with dwSize >= 2 MB, flAllocationType  set to MEM_RESERVE, and flProtect set to PAGE_NOACCESS, it automatically reserves memory at the shared memory region. This preserves per-process virtual memory.

To translate a bit, prior to WinCe 6.0 (Windows Mobile uses WinCe 5.x), all calls to VirtualAlloc with a size <2Mb reserve memory in the process's private address space. That space is only 32Mb, so we're limited to a total of 32Mb of allocations with a chunk size less than 2Mb.

I want to run with this patch a  bit before reviewing the widget part of it.
(In reply to comment #15)
> two nits, white space.  Align the if with the VirtualQuery. Also I think
> changing "Decommit unilaterally" to  "Force decommit." would be more clear.

Mind showing me how you want the if () clause with both virtual queries aligned?

> Also, I'd like to note that according to MSDN's documentation, decommitting
> before releasing is redundant since VirtualFree called with MEM_RELEASE is
> supposed to decommit the pages then release them.

The problem is (I think) that because of the way we've aligned our memory with two allocations (one of which leaves a large chunk of pages set w/ protected-memory flags), one deallocation isn't enough to release and free all of it.  I tried a number of different techniques for releasing memory both inside and outside of jemalloc (in my own app, as you suggested) and this is the only one that seems to be consistently effective against our weird alloc->alloc-aligned code.
 

> Finally, since it wasn't clear to you why we use 2Mb chunks, we should
> probably have a comment about that.

Agreed.


> I want to run with this patch a  bit before reviewing the widget part of it.

Fair enough.
(In reply to comment #16)

> Mind showing me how you want the if () clause with both virtual queries
> aligned?
this is what I had in mind:

	if (VirtualFree(addr, 0, MEM_RELEASE) == 0) {
#if defined(MOZ_MEMORY_WINCE) && !defined(MOZ_MEMORY_WINCE6)
		if (GetLastError() == ERROR_INVALID_PARAMETER) {
			MEMORY_BASIC_INFORMATION info;
			VirtualQuery(addr, &info, sizeof(info));

			// Had trouble!  Force decommit, then release.
			if (VirtualFree(info.AllocationBase, 
					size + ((char *)info.BaseAddress -
						(char *)info.AllocationBase), 
					MEM_DECOMMIT) &&
			    VirtualFree(info.AllocationBase, 0, MEM_RELEASE))
				return;
		}
#endif


> 
> > Also, I'd like to note that according to MSDN's documentation, decommitting
> > before releasing is redundant since VirtualFree called with MEM_RELEASE is
> > supposed to decommit the pages then release them.
> 
> The problem is (I think) that because of the way we've aligned our memory with
> two allocations (one of which leaves a large chunk of pages set w/
> protected-memory flags), one deallocation isn't enough to release and free all
> of it.  I tried a number of different techniques for releasing memory both
> inside and outside of jemalloc (in my own app, as you suggested) and this is
> the only one that seems to be consistently effective against our weird
> alloc->alloc-aligned code.
> 

It wouldn't be the first time that the msdn documentation wasn't spot on, so I trust you.  Decommitting before releasing shouldn't break anything even if the APIs behaved exactly as described, so I think this change is pretty safe. Also, the expectation is that the unaligned allocation is the odd case since once we get an aligned allocation (2Mb + offset), the system allocator should be aligned to our liking, so it shouldn't really hurt us in terms of performance either.
Attached patch cleanup (obsolete) — Splinter Review
I cleaned up the EnsureSize routine as we discussed on IRC and am offering what I hope is a cleaner stab at the whitespacing.
Attachment #406868 - Attachment is obsolete: true
Attachment #407346 - Flags: superreview?
Attachment #407346 - Flags: review?(bugmail)
Attachment #406868 - Flags: superreview?(pavlov)
Attachment #406868 - Flags: review?(bugmail)
Attachment #407346 - Flags: superreview? → superreview?(pavlov)
Fixing tabs in the rest of this function
Attachment #407346 - Attachment is obsolete: true
Attachment #407358 - Flags: superreview?(pavlov)
Attachment #407358 - Flags: review?(bugmail)
Attachment #407346 - Flags: superreview?(pavlov)
Attachment #407346 - Flags: review?(bugmail)
Comment on attachment 407358 [details] [diff] [review]
whitespace fixing for the rest of this function

Argh, missed some.
Attachment #407358 - Attachment is obsolete: true
Attachment #407358 - Flags: superreview?(pavlov)
Attachment #407358 - Flags: review?(bugmail)
Attached patch one more swing (obsolete) — Splinter Review
Attachment #407359 - Flags: superreview?(pavlov)
Attachment #407359 - Flags: review?(bugmail)
Attached patch mayhem! (obsolete) — Splinter Review
Attachment #407359 - Attachment is obsolete: true
Attachment #407363 - Flags: superreview?(pavlov)
Attachment #407363 - Flags: review?(bugmail)
Attachment #407359 - Flags: superreview?(pavlov)
Attachment #407359 - Flags: review?(bugmail)
Attachment #407363 - Attachment is obsolete: true
Attachment #407364 - Flags: superreview?(pavlov)
Attachment #407364 - Flags: review?(bugmail)
Attachment #407363 - Flags: superreview?(pavlov)
Attachment #407363 - Flags: review?(bugmail)
Attachment #407364 - Attachment is obsolete: true
Attachment #407365 - Flags: superreview?(pavlov)
Attachment #407365 - Flags: review?(bugmail)
Attachment #407364 - Flags: superreview?(pavlov)
Attachment #407364 - Flags: review?(bugmail)
Attachment #407365 - Flags: review?(bugmail) → review+
Comment on attachment 407365 [details] [diff] [review]
With the 1 << 21 chunksize comment

this patch doesn't compile
Attachment #407365 - Flags: review+ → review-
Attached patch compiles (obsolete) — Splinter Review
Attachment #407365 - Attachment is obsolete: true
Attachment #407365 - Flags: superreview?(pavlov)
yeah, I'd fixed the compilation bug locally, but hadn't uploaded yet.  This is better (bugzilla interdiff works)
Attachment #407767 - Flags: superreview?(pavlov)
Attachment #407767 - Flags: review?(bugmail)
Attached patch tabs restored! (obsolete) — Splinter Review
Attachment #407767 - Attachment is obsolete: true
Attachment #407767 - Flags: superreview?(pavlov)
Attachment #407767 - Flags: review?(bugmail)
Attachment #407885 - Flags: review?(bugmail)
I can run the reftests without the changes to jemalloc in this patch. Perhaps I'm missing something, but if you can as well I'd prefer not to change what doesn't need to be changed.
Attachment #407885 - Flags: review?(vladimir)
Comment on attachment 407885 [details] [diff] [review]
tabs restored!

r=me without the changes to jemalloc
Attachment #407885 - Flags: review?(bugmail) → review+
Comment on attachment 407885 [details] [diff] [review]
tabs restored!

This looks mostly fine, just a few questions; no comment on the jemalloc stuff, as that's all black magic :-)

>diff --git a/widget/src/windows/nsWindow.h b/widget/src/windows/nsWindow.h
>--- a/widget/src/windows/nsWindow.h
>+++ b/widget/src/windows/nsWindow.h
>@@ -470,9 +470,6 @@ protected:
>   // Graphics
>   HDC                   mPaintDC; // only set during painting
> 
>-  static nsAutoPtr<PRUint8> sSharedSurfaceData;
>-  static gfxIntSize     sSharedSurfaceSize;
>-

Any reason to pull these out of here?  They were in here to avoid polluting the global space, EnsureSharedSurface() could become static in here to give it the right access.  (Or, the globals should just be static.)

>--- a/widget/src/windows/nsWindowGfx.cpp
>+++ b/widget/src/windows/nsWindowGfx.cpp

>+#define WORDSSIZE(x) ((x).width * (x).height)

This scares me a little; are we sure that width*height will never overflow?  (That is, it won't ever come from a user, right?)

>+PRBool
>+EnsureSharedSurfaceSize(gfxIntSize size)
>+{
>+  gfxIntSize screenSize;
>+  screenSize.height = GetSystemMetrics(SM_CYSCREEN);
>+  screenSize.width = GetSystemMetrics(SM_CXSCREEN);
>+
>+  if (WORDSSIZE(screenSize) > WORDSSIZE(size))
>+    size = screenSize;
>+
>+  if (!sSharedSurfaceData || (WORDSSIZE(size) > WORDSSIZE(sSharedSurfaceSize))) {
>+    sSharedSurfaceSize = size;
>+    sSharedSurfaceData = nsnull;
>+    sSharedSurfaceData = (PRUint8 *)malloc(WORDSSIZE(sSharedSurfaceSize) * 4);

No need to assign nsnull first.

Also, for surfaces that are bigger than the screen (which shouldn't happen), this means that we'll keep around a surface of that large size -- that seems bad.  Are we hitting that case?  I'd be more comfortable if there was at least a debug build warning in here that said if size > screensize to warn that that shouldn't happen, and we should figure out why it's happening.  (I don't think it ever makes sense to paint an area larger than the screen in this path.)
Attachment #407885 - Flags: review+ → review?
I'll make the Ensure function and the two globals all static.  Thanks for catching that.

Surface width and height should never overflow, the data always comes from the engine, not from a user (ie., we don't use these for canvas surfaces in content, afaik -- if that's wrong, please show me how).

Actually, I -do- want to assign nsnull first, otherwise (please check my order-of-evaluation and logic here), the data won't be released until after the malloc (that is, the assignment will happen after malloc(), and the original memory won't be released until after the new bits are allocated), which could yield greater fragmentation than otherwise, or even (in limited memory situations) a failure to allocate memory where we might otherwise have been able to, had we released the surface earlier.  Since these surfaces are very large allocations (the largest we ever make in my debugging), I think the extra work/code is worth this potential improvement.

I believe we only get surfaces larger than the screen while running reftests...  does this code need to be smarter about releasing large surfaces where possible?

I found myself wishing that the gfx code actually held a reference to these surfaces, would that be a better approach (for a follow-up bug)?
As discussed on IRC, I'll gut the jemalloc changes before landing.  They can come in a follow-up bug, if I'm able to reproduce the shutdown crash I was seeing before.  As it is, I cannot currently reproduce because of a separate nanojit bug while running my debug build.
Attached patch no jemalloc changes at all (obsolete) — Splinter Review
Carrying forward blassey's review, and this also includes the static changes discussed previously.
Attachment #407885 - Attachment is obsolete: true
Attachment #408901 - Flags: superreview?
Attachment #408901 - Flags: review+
Attachment #407885 - Flags: review?(vladimir)
Attachment #407885 - Flags: review?
Attachment #408901 - Flags: superreview? → superreview?(vladimir)
Attachment #408901 - Flags: superreview?(vladimir) → superreview+
Comment on attachment 408901 [details] [diff] [review]
no jemalloc changes at all

Looks fine, but please add the NS_WARNING if the requested size is bigger than the screen size.
Blocks: 525058
No longer blocks: 525058
Carrying reviews over.
Attachment #408901 - Attachment is obsolete: true
Attachment #408920 - Flags: superreview+
Attachment #408920 - Flags: review+
pushed http://hg.mozilla.org/mozilla-central/rev/1225e71411ad
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Does this need to land on 1.9.2 as well?
(In reply to comment #38)
> Does this need to land on 1.9.2 as well?
Yes, I think so.  We're running all the tests for winmo and wince from 1.9.2 since that is the branch they will be released from.
Attachment #408920 - Flags: approval1.9.2?
Comment on attachment 408920 [details] [diff] [review]
the thing to land

a192=beltzner
Attachment #408920 - Flags: approval1.9.2? → approval1.9.2+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: