Closed Bug 940087 Opened 11 years ago Closed 10 years ago

Shutdown crash in _cairo_hash_table_remove during Android 4.0 Debug mochitest-2

Categories

(Core :: Graphics, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla31

People

(Reporter: gbrown, Assigned: esawin)

References

Details

Attachments

(2 files)

On Android 4.0 Debug (running on Cedar only), mochitest-2 usually completes successfully except for a crash on shutdown.

From https://tbpl.mozilla.org/php/getParsedLog.php?id=30616114&tree=Cedar&full=1#error0:

17:14:36     INFO -  32866 INFO TEST-START | Shutdown
17:14:36     INFO -  32867 INFO Passed:  31746
17:14:36     INFO -  32868 INFO Failed:  0
17:14:36     INFO -  32869 INFO Todo:    489
17:14:36     INFO -  32870 INFO Slowest: 45367ms - /tests/content/html/content/test/forms/test_input_typing_sanitization.html
17:14:36     INFO -  32871 INFO SimpleTest FINISHED
17:14:36     INFO -  INFO | automation.py | Application ran for: 0:09:29.475872
17:14:36     INFO -  INFO | zombiecheck | Reading PID log: /tmp/tmpBJclAgpidlog
17:14:37     INFO -  mozcrash INFO | Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/cedar-android-debug/1384556438/fennec-28.0a1.en-US.android-arm.crashreporter-symbols.zip
17:14:47     INFO -  /data/anr/traces.txt not found
17:14:47  WARNING -  PROCESS-CRASH | Shutdown | application crashed [@ _cairo_hash_table_remove]
17:14:47     INFO -  Crash dump filename: /tmp/tmpMJCrH5/46711953-75d7-b9ea-2e734f1d-7baf1814.dmp
17:14:47     INFO -  Operating system: Android
17:14:47     INFO -                    0.0.0 Linux 3.2.0+ #2 SMP PREEMPT Thu Nov 29 08:06:57 EST 2012 armv7l pandaboard/pandaboard/pandaboard:4.0.4/IMM76I/5:eng/test-keys
17:14:47     INFO -  CPU: arm
17:14:47     INFO -       2 CPUs
17:14:47     INFO -  Crash reason:  SIGSEGV
17:14:47     INFO -  Crash address: 0x0
17:14:47     INFO -  Thread 13 (crashed)
17:14:47     INFO -   0  libxul.so!_cairo_hash_table_remove [cairo-hash.c:e8087e6cbf4d : 488 + 0x2]
17:14:47     INFO -       r4 = 0x635bd3c4    r5 = 0x0000002b    r6 = 0x65a0b060    r7 = 0x730810c0
17:14:47     INFO -       r8 = 0x63743408    r9 = 0x5be05e20   r10 = 0x635bd248    fp = 0x85bdd17b
17:14:47     INFO -       sp = 0x5d466918    lr = 0x629849bf    pc = 0x629849c4
17:14:47     INFO -      Found by: given as instruction pointer in context
17:14:47     INFO -   1  libxul.so!_cairo_scaled_font_map_destroy [cairo-scaled-font.c:e8087e6cbf4d : 417 + 0x7]
17:14:47     INFO -       r4 = 0x65c86000    r5 = 0x730810c0    r6 = 0x63743408    r7 = 0x635bebf8
17:14:47     INFO -       r8 = 0x635be9a4    r9 = 0x635be984   r10 = 0x626f9f5d    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d466948    pc = 0x629955c5
17:14:47     INFO -      Found by: call frame info
17:14:47     INFO -   2  libxul.so!_moz_cairo_debug_reset_static_data [cairo-debug.c:e8087e6cbf4d : 64 + 0x3]
17:14:47     INFO -       r4 = 0x69916000    r5 = 0x69916018    r6 = 0x69916020    r7 = 0x5be592b4
17:14:47     INFO -       r8 = 0x5be595e8    r9 = 0x5d466a24   r10 = 0x626f9f5d    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d466970    pc = 0x629813bb
17:14:47     INFO -      Found by: call frame info
17:14:47     INFO -   3  libxul.so!gfxPlatform::~gfxPlatform() [gfxPlatform.cpp:e8087e6cbf4d : 588 + 0x3]
17:14:47     INFO -       r4 = 0x69916000    r5 = 0x69916018    r6 = 0x69916020    r7 = 0x5be592b4
17:14:47     INFO -       r8 = 0x5be595e8    r9 = 0x5d466a24   r10 = 0x626f9f5d    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d466978    pc = 0x6276b807
17:14:47     INFO -      Found by: call frame info
17:14:47     INFO -   4  libxul.so!gfxAndroidPlatform::~gfxAndroidPlatform() [gfxAndroidPlatform.cpp:e8087e6cbf4d : 127 + 0x9]
17:14:47     INFO -       r4 = 0x69916000    r5 = 0x69916074    r6 = 0x6398f3c0    r7 = 0x5be592b4
17:14:47     INFO -       r8 = 0x5be595e8    r9 = 0x5d466a24   r10 = 0x626f9f5d    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d466988    pc = 0x62749771
17:14:47     INFO -      Found by: call frame info
17:14:47     INFO -   5  libxul.so!gfxAndroidPlatform::~gfxAndroidPlatform() [gfxAndroidPlatform.cpp:e8087e6cbf4d : 127 + 0x3]
17:14:47     INFO -       r4 = 0x69916000    r5 = 0x5be59550    r6 = 0x00000038    r7 = 0x5be592b4
17:14:47     INFO -       r8 = 0x5be595e8    r9 = 0x5d466a24   r10 = 0x626f9f5d    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d466998    pc = 0x62749785
17:14:47     INFO -      Found by: call frame info
17:14:47     INFO -   6  libxul.so!gfxPlatform::Shutdown() [gfxPlatform.cpp:e8087e6cbf4d : 565 + 0x5]
17:14:47     INFO -       r4 = 0x69274c70    r5 = 0x5be59550    r6 = 0x00000038    r7 = 0x5be592b4
17:14:47     INFO -       r8 = 0x5be595e8    r9 = 0x5d466a24   r10 = 0x626f9f5d    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d4669a0    pc = 0x6276beed
17:14:47     INFO -      Found by: call frame info
17:14:47     INFO -   7  libxul.so!nsComponentManagerImpl::KnownModule::~KnownModule() [nsComponentManager.h:e8087e6cbf4d : 224 + 0x1]
17:14:47     INFO -       r4 = 0x5beb3140    r5 = 0x5be59550    r6 = 0x00000038    r7 = 0x5be592b4
17:14:47     INFO -       r8 = 0x5be595e8    r9 = 0x5d466a24   r10 = 0x626f9f5d    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d4669b8    pc = 0x6271497d
17:14:47     INFO -      Found by: call frame info
17:14:47     INFO -   8  libxul.so!nsAutoPtr<nsComponentManagerImpl::KnownModule>::~nsAutoPtr() [nsAutoPtr.h:e8087e6cbf4d : 77 + 0x5]
17:14:47     INFO -       r4 = 0x5beb3140    r5 = 0x5be59550    r6 = 0x00000038    r7 = 0x5be592b4
17:14:47     INFO -       r8 = 0x5be595e8    r9 = 0x5d466a24   r10 = 0x626f9f5d    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d4669c0    pc = 0x627179cf
17:14:47     INFO -      Found by: call frame info
17:14:47     INFO -   9  libxul.so!nsTArray_Impl<nsAutoPtr<nsComponentManagerImpl::KnownModule>, nsTArrayInfallibleAllocator>::RemoveElementsAt(unsigned int, unsigned int) [nsTArray.h:e8087e6cbf4d : 534 + 0x3]
17:14:47     INFO -       r4 = 0x00000000    r5 = 0x5be59554    r6 = 0x00000038    r7 = 0x5be592b4
17:14:47     INFO -       r8 = 0x5be595e8    r9 = 0x5d466a24   r10 = 0x626f9f5d    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d4669d0    pc = 0x62717a2f
17:14:47     INFO -      Found by: call frame info
17:14:47     INFO -  10  libxul.so!nsComponentManagerImpl::Shutdown() [nsComponentManager.cpp:e8087e6cbf4d : 808 + 0x7]
17:14:47     INFO -       r4 = 0x5be59200    r5 = 0x626ed107    r6 = 0x63743408    r7 = 0x63743408
17:14:47     INFO -       r8 = 0x63203f61    r9 = 0x5d466a24   r10 = 0x626f9f5d    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d4669f0    pc = 0x62717b81
17:14:47     INFO -      Found by: call frame info
17:14:47     INFO -  11  libxul.so!mozilla::ShutdownXPCOM(nsIServiceManager*) [nsXPComInit.cpp:e8087e6cbf4d : 778 + 0x3]
17:14:47     INFO -       r4 = 0x5d466a1c    r5 = 0x00000000    r6 = 0x63383be8    r7 = 0x63743408
17:14:47     INFO -       r8 = 0x63203f61    r9 = 0x5d466a24   r10 = 0x626f9f5d    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d466a08    pc = 0x626f10f5
17:14:47     INFO -      Found by: call frame info
17:14:47     INFO -  12  libxul.so!ScopedXPCOMStartup::~ScopedXPCOMStartup() [nsAppRunner.cpp:e8087e6cbf4d : 1132 + 0x5]
17:14:47     INFO -       r4 = 0x5d466a4c    r5 = 0x5be3b144    r6 = 0x63743408    r7 = 0x00000000
17:14:47     INFO -       r8 = 0x63743408    r9 = 0x00000000   r10 = 0x630a6a63    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d466a48    pc = 0x61a11717
17:14:47     INFO -      Found by: call frame info
17:14:47     INFO -  13  libxul.so!XREMain::XRE_main(int, char**, nsXREAppData const*) [nsAppRunner.cpp:e8087e6cbf4d : 4069 + 0x5]
17:14:47     INFO -       r4 = 0x5d466a94    r5 = 0x00000000    r6 = 0x5be3b144    r7 = 0x00000000
17:14:47     INFO -       r8 = 0x63743408    r9 = 0x00000000   r10 = 0x630a6a63    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d466a60    pc = 0x61a161d5
17:14:47     INFO -      Found by: call frame info
17:14:47     INFO -  14  libxul.so!XRE_main [nsAppRunner.cpp:e8087e6cbf4d : 4246 + 0x3]
17:14:47     INFO -       r4 = 0x00000000    r5 = 0x5bc3ec90    r6 = 0x5be35148    r7 = 0x0000000c
17:14:47     INFO -       r8 = 0x5d466ba8    r9 = 0x5bc3ec90   r10 = 0x630a6a63    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d466a90    pc = 0x61a16355
17:14:47     INFO -      Found by: call frame info
17:14:47     INFO -  15  libxul.so!GeckoStart [nsAndroidStartup.cpp:e8087e6cbf4d : 73 + 0xf]
17:14:47     INFO -       r4 = 0x5d466ba4    r5 = 0x5be53180    r6 = 0x63743408    r7 = 0x00000000
17:14:47     INFO -       r8 = 0x5d466ba8    r9 = 0x5bc3ec90   r10 = 0x630a6a63    fp = 0x5d466c04
17:14:47     INFO -       sp = 0x5d466ba0    pc = 0x61a101dd
17:14:47     INFO -      Found by: call frame info
Assignee: nobody → snorp
Assignee: snorp → esawin
There is a similar crash in some crashtests, again, only on Android 4.0 Debug:

https://tbpl.mozilla.org/php/getParsedLog.php?id=33337466&tree=Cedar&full=1#error20
The crash is reproducible on optimal builds with enabled cairo_debug_reset_static_data() at http://dxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxPlatform.cpp#599 , but it's non-deterministic:
https://tbpl.mozilla.org/?tree=Try&rev=1b3596cfebec
Cache entries may expire and be removed from the hash table before the reset, so we need to check whether the entry for a given key still exists before removing it. Besides solving the bug, this patch makes _cairo_hash_table_remove more safe and idiomatic for a hash table remove-by-key function.

Opt tests: https://tbpl.mozilla.org/?tree=Try&rev=608da6bc6c6b
Opt (with activated cairo reset) tests: https://tbpl.mozilla.org/?tree=Try&rev=7175c6adb6e6
Attachment #8367270 - Flags: review?(jmuizelaar)
Comment on attachment 8367270 [details] [diff] [review]
cairo-hash-table-remove.patch

Review of attachment 8367270 [details] [diff] [review]:
-----------------------------------------------------------------

I'd rather not change the semantics of this function without also doing it upstream. Can we just fix the caller instead?
Attachment #8367270 - Flags: review?(jmuizelaar) → review-
(In reply to Jeff Muizelaar [:jrmuizel] from comment #5)
> Comment on attachment 8367270 [details] [diff] [review]
> cairo-hash-table-remove.patch
> 
> Review of attachment 8367270 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I'd rather not change the semantics of this function without also doing it
> upstream. Can we just fix the caller instead?

We can move the check out of the function into the caller, but that would introduce overhead due to the linear probing in the hash table lookup and leave the remove function unsafe.

That leaves us with the options:
1. Move the check up into caller: additional lookup per entry overhead.
2. Move the check up into caller and add remove-by-entry function to cairo: no overhead, cairo extension.
3. Keep the check in the remove function: no overhead, safer remove function, changed semantics (silent ignore instead of crash).
4. Keep the check in the remove function and let it return whether it was successful: no overhead, safer remove function, changed semantics, changed interface, most idiomatic behavior for a remove-by-key function.

I think solution 4 is the "right way", with 2 being the workaround without overhead. But since it's just the reset function, option 1 is also acceptable.
(In reply to Eugen Sawin (:esawin) <esawin@mozilla.com> from comment #6)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #5)
> > Comment on attachment 8367270 [details] [diff] [review]
> > cairo-hash-table-remove.patch
> > 
> > Review of attachment 8367270 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > I'd rather not change the semantics of this function without also doing it
> > upstream. Can we just fix the caller instead?
> 
> We can move the check out of the function into the caller, but that would
> introduce overhead due to the linear probing in the hash table lookup and
> leave the remove function unsafe.
> 
> That leaves us with the options:
> 1. Move the check up into caller: additional lookup per entry overhead.
> 2. Move the check up into caller and add remove-by-entry function to cairo:
> no overhead, cairo extension.
> 3. Keep the check in the remove function: no overhead, safer remove
> function, changed semantics (silent ignore instead of crash).
> 4. Keep the check in the remove function and let it return whether it was
> successful: no overhead, safer remove function, changed semantics, changed
> interface, most idiomatic behavior for a remove-by-key function.
> 
> I think solution 4 is the "right way", with 2 being the workaround without
> overhead. But since it's just the reset function, option 1 is also
> acceptable.

In this case the caller is a debug destruction helper so doing an extra lookup is not a problem.

However, one thing that's not clear to me is how fonts in the holdover table manage to not have entries in the hash table. How does that happen?
(In reply to Jeff Muizelaar [:jrmuizel] from comment #7)
> (In reply to Eugen Sawin (:esawin) <esawin@mozilla.com> from comment #6)
> > (In reply to Jeff Muizelaar [:jrmuizel] from comment #5)
> > > Comment on attachment 8367270 [details] [diff] [review]
> > > cairo-hash-table-remove.patch
> > > 
> > > Review of attachment 8367270 [details] [diff] [review]:
> > > -----------------------------------------------------------------
> > > 
> > > I'd rather not change the semantics of this function without also doing it
> > > upstream. Can we just fix the caller instead?
> > 
> > We can move the check out of the function into the caller, but that would
> > introduce overhead due to the linear probing in the hash table lookup and
> > leave the remove function unsafe.
> > 
> > That leaves us with the options:
> > 1. Move the check up into caller: additional lookup per entry overhead.
> > 2. Move the check up into caller and add remove-by-entry function to cairo:
> > no overhead, cairo extension.
> > 3. Keep the check in the remove function: no overhead, safer remove
> > function, changed semantics (silent ignore instead of crash).
> > 4. Keep the check in the remove function and let it return whether it was
> > successful: no overhead, safer remove function, changed semantics, changed
> > interface, most idiomatic behavior for a remove-by-key function.
> > 
> > I think solution 4 is the "right way", with 2 being the workaround without
> > overhead. But since it's just the reset function, option 1 is also
> > acceptable.
> 
> In this case the caller is a debug destruction helper so doing an extra
> lookup is not a problem.
> 
> However, one thing that's not clear to me is how fonts in the holdover table
> manage to not have entries in the hash table. How does that happen?

Further investigations have revealed that the holdover table becomes inconsistent during the execution of cairo_scaled_font_destroy: https://tbpl.mozilla.org/?tree=Try&rev=e38a109ef5fc.
Since the error is not deterministically reproducible, I assume this inconsistency is caused by improper mutex locking and some of the code comments do hint at such a possibility.

Regarding code comments, the comment at http://dxr.mozilla.org/mozilla-central/source/gfx/cairo/cairo/src/cairo-scaled-font.c#320 does not imply a strong coupling between the hash table entries and the holdovers. This would make the check as proposed by options 3 and 4 in my previous comment necessary. Another comment at http://dxr.mozilla.org/mozilla-central/source/gfx/cairo/cairo/src/cairo-hash.c#474 does propose that the proposed solution 4 would actually implement the remove function as it was intended.

Since the tests show that the errors happen within Cairo functions and not between calls, I don't see a way to avoid upstreaming and I would be inclined to go with option 4 and correct the semantics of the remove function.
This is the line, where we add a font to the holdovers, which is not necessarily contained in the hash table: http://dxr.mozilla.org/mozilla-central/source/gfx/cairo/cairo/src/cairo-scaled-font.c#1261

I think we can take the description at http://dxr.mozilla.org/mozilla-central/source/gfx/cairo/cairo/src/cairo-scaled-font.c#320 literally and consider the holdovers and the hash table as decoupled.
Flags: needinfo?(jmuizelaar)
(In reply to Eugen Sawin (:esawin) <esawin@mozilla.com> from comment #8)
> Further investigations have revealed that the holdover table becomes
> inconsistent during the execution of cairo_scaled_font_destroy:
> https://tbpl.mozilla.org/?tree=Try&rev=e38a109ef5fc.
> Since the error is not deterministically reproducible, I assume this
> inconsistency is caused by improper mutex locking and some of the code
> comments do hint at such a possibility.

I don't think there are any cases where we use cairo on multiple threads on Android so this shouldn't be related to mutex locking.

(In reply to Eugen Sawin (:esawin) <esawin@mozilla.com> from comment #9)
> This is the line, where we add a font to the holdovers, which is not
> necessarily contained in the hash table:
> http://dxr.mozilla.org/mozilla-central/source/gfx/cairo/cairo/src/cairo-
> scaled-font.c#1261

Under what conditions would the scaled_font not be in the hash table? From my reading, all live fonts should be in the hash table and so remain in the hash table when added to the holdover table in scaled_font_destroy.

> I think we can take the description at
> http://dxr.mozilla.org/mozilla-central/source/gfx/cairo/cairo/src/cairo-
> scaled-font.c#320 literally and consider the holdovers and the hash table as
> decoupled.

I'm not convinced of this.
Flags: needinfo?(jmuizelaar)
Eugen, what's the plan here?
Flags: needinfo?(esawin)
The tests showed that we try to delete fonts (and consequently put them into the holdovers), which are not contained in the hash table. The previously proposed patch fixed that by making the hash table remove function safe.

However, the real issue is that on Android, we call |cairo_debug_reset_static_data()| twice for debug builds. First time we call it at http://dxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxAndroidPlatform.cpp#136 and the second time in its base class destructor at http://dxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxPlatform.cpp#530.

According to code comments, calling |cairo_debug_reset_static_data()| before |SkGraphics::Term()| is unsafe, which adds to the issue. If calling |FT_Done_Library(gPlatformFTLibrary)| before |cairo_debug_reset_static_data()| is safe, we can simply remove the cairo resetting in |~gfxAndroidPlatform()|, otherwise we need to add a preprocessor switch in |~gfxPlatform()|.

Results show green for the first solution: https://tbpl.mozilla.org/?tree=Try&rev=61b9f49a980b.
I will create the proper patches for both solutions for review next.
Flags: needinfo?(esawin) → needinfo?(jmuizelaar)
This patch prevents double resetting of Cairo data on debug builds by removing resetting for optimal builds.
Attachment #8392198 - Flags: review?(jmuizelaar)
Flags: needinfo?(jmuizelaar)
Attachment #8392198 - Flags: review?(jmuizelaar) → review+
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/ee01dea0ae3b
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: