Intermittent hang in ShaderProgramOGL::CreateShader during Android 4.0 tests

RESOLVED FIXED in Firefox 34

Status

()

defect
RESOLVED FIXED
5 years ago
9 months ago

People

(Reporter: gbrown, Assigned: glandium)

Tracking

unspecified
mozilla36
ARM
Android
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox33 unaffected, firefox34+ fixed, firefox35+ fixed, firefox36+ fixed, firefox38 fixed, firefox-esr24 unaffected, firefox-esr31 wontfix, fennec34+)

Details

Attachments

(3 attachments, 5 obsolete attachments)

In bug 1054292 (and perhaps bug 1054456 and some others, which look similar), Android 4.0 tests frequently fail with a timeout on startup.

Bug 1054292 affects nearly all Android 4.0 test jobs and was the #1 test failure on this week's War on Orange report.

https://tbpl.mozilla.org/?tree=Try&rev=4c002148f80b adds logging to trace the hang. https://tbpl.mozilla.org/php/getParsedLog.php?id=46938795&tree=Try&full=1 reproduces the hang and shows:

05:53:04     INFO -  08-28 05:47:04.523 I/Gecko   ( 2100): GYB: CompositorOGL::Initialize
05:53:04     INFO -  08-28 05:47:04.523 I/Gecko   ( 2100): Attempting load of libEGL.so
05:53:04     INFO -  08-28 05:47:04.531 I/Gecko   ( 2100): Can't find symbol '_Z35eglQueryStringImplementationANDROIDPvi'.
05:53:04     INFO -  08-28 05:47:04.625 I/Gecko   ( 2100): GYB: Initialize calls GetShaderProgramFor
05:53:04     INFO -  08-28 05:47:04.625 I/Gecko   ( 2100): GYB GetShaderProgramFor 1
05:53:04     INFO -  08-28 05:47:04.625 I/Gecko   ( 2100): GYB: ShaderProgramOGL::Initialize
05:53:04     INFO -  08-28 05:47:04.625 I/Gecko   ( 2100): GYB: ShaderProgramOGL::Initialize 1
05:53:04     INFO -  08-28 05:47:04.625 I/Gecko   ( 2100): GYB: ShaderProgramOGL::CreateProgram
05:53:04     INFO -  08-28 05:47:04.625 I/Gecko   ( 2100): GYB: CreateShader 1

...then no more activity from Gecko for 5+ minutes.

That suggests the hang is in ShaderProgramOGL::CreateShader, during mGL->fCompileShader(sh) -- http://hg.mozilla.org/mozilla-central/annotate/2a15dc07ddaa/gfx/layers/opengl/OGLShaderProgram.cpp#l445.
OS: Linux → Android
Hardware: x86_64 → ARM
Great investigative work here - thank you! :-)
#1 (and likely #3, and likely a bunch more - see deps) orange needs an owner. Milan, can you help please?
:snorp, is this something that's up your alley?
Flags: needinfo?(milan) → needinfo?(snorp)
(In reply to Milan Sreckovic [:milan] from comment #3)
> :snorp, is this something that's up your alley?

Maybe, but I might have trouble finding cycles for it.

It looks like the panda GL driver might just be crapping out? Not sure what we'll be able to do about that. I wish we knew which shader was hanging, though.
Flags: needinfo?(snorp)
George, would you have cycles to look at this otherwise?
I have a Panda in my possession, so really I should just have a look.
Assignee: nobody → snorp
I was able to reproduce the hang (I think), after many tries running reftests. The stack I have in the compositor thread is:

#0  0x400c61f8 in __futex_syscall3 () from /Users/snorp/source/jimdb-arm/lib/0123456789ABCDEF/system/lib/libc.so
#1  0x400cadb4 in pthread_mutex_lock () from /Users/snorp/source/jimdb-arm/lib/0123456789ABCDEF/system/lib/libc.so
#2  0x400ccd5c in dlmalloc () from /Users/snorp/source/jimdb-arm/lib/0123456789ABCDEF/system/lib/libc.so
#3  0x400d000a in malloc () from /Users/snorp/source/jimdb-arm/lib/0123456789ABCDEF/system/lib/libc.so
#4  0x4014e274 in inflateInit2_ () from /Users/snorp/source/jimdb-arm/lib/0123456789ABCDEF/system/lib/libz.so
#5  0x5c11c194 in SeekableZStream::DecompressChunk (this=0x5c32e160, where=0x61c73000, chunk=1555, length=16384) at /Users/snorp/source/gecko-dev/mozglue/linker/SeekableZStream.cpp:84
#6  0x5c11b1de in MappableSeekableZStream::ensure (this=0x5c32e150, addr=<optimized out>) at /Users/snorp/source/gecko-dev/mozglue/linker/Mappable.cpp:531
#7  0x5c11a20a in SEGVHandler::handler (signum=11, info=0x6c8fdb40, context=0x6c8fdbc0) at /Users/snorp/source/gecko-dev/mozglue/linker/ElfLoader.cpp:1138
#8  0xffff0514 in ?? ()
#9  0xffff0514 in ?? ()

If accurate, this seems to be a problem in the on-demand library loading. Investigating a little more.
Ah, so we are attempting to acquire the malloc() mutex here within a signal handler. As we know from doing profiler work, this is bad, because the lock can already be held elsewhere. With all threads suspended during signal handler execution, we have a deadlock. I am not sure if it's possible to avoid a malloc() here, but that looks like the best way to solve this.
Flags: needinfo?(mh+mozilla)
I thought it might be possible to use the zalloc and zfree hooks in the z_stream to return stack allocated stuff, so I started playing with that. It looks like it needs at most two allocations in order to inflate one chunk. The first one is the internal inflate_state struct, which it uses to (surprise!) keep state. That seems to be a fixed size of 7116 bytes. Sometimes there is also a second allocation, which is related to the sliding window. I am not sure why I don't see that every time, but it always seems to be 32k.

Unfortunately, when I return stack allocated buffers for these two, I get a crash. I'm not sure why yet, but it seems like this should be doable.
Posted patch wip (obsolete) — Splinter Review
This doesn't work, but I don't really understand why not. It crashes after a dozen or so chunks have been inflated.
This version just preallocates the buffers on the heap during SeekableZStream construction instead of messing around with the stack. Seems to work locally, try run going here: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=77332663dffc
Attachment #8483919 - Attachment is obsolete: true
Attachment #8484357 - Flags: review?(mh+mozilla)
Assignee

Comment 12

5 years ago
Stepping back a bit:

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #8)
> With all threads suspended during signal handler execution, we have a deadlock.

This actually shouldn't be the case. That is, other threads should continue running during signal handler execution (and I just checked it's the case, although I didn't try on an arm device, but I don't think this should matter).

With that eliminated, the only way to get in that decompression code is in case a segfault occurred 
in the address space of a library loaded by our linker. That only happens if running code from that library, or read/writing data from/to it.

Now, afaik, the signal handler for a segfault runs on the thread that segfaulted, so the only way we'd get in the decompression code is if code on that thread is calling code from e.g. libxul, or read/writing data from/to it.

And for a deadlock to happen, that code would need to be holding the system malloc lock at the same time. How can that happen?

So, we'd be in a situation where the signal is not dispatched to the thread that segfaulted?

I'd rather have a clear view of the situation before saying anything else.
Flags: needinfo?(mh+mozilla)
Assignee

Comment 13

5 years ago
Comment on attachment 8484357 [details] [diff] [review]
Avoid malloc in on-demand linker

Review of attachment 8484357 [details] [diff] [review]:
-----------------------------------------------------------------

Considering several threads could be running a decompression in parallel, it's not safe to use the same buffer for all of them.
Attachment #8484357 - Flags: review?(mh+mozilla) → review-
(In reply to Mike Hommey [:glandium] (out from Sep 6 to Sep 22) from comment #12)
> Stepping back a bit:
> 
> (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #8)
> > With all threads suspended during signal handler execution, we have a deadlock.
> 
> This actually shouldn't be the case. That is, other threads should continue
> running during signal handler execution (and I just checked it's the case,
> although I didn't try on an arm device, but I don't think this should
> matter).

Indeed, this does appear to be the case, and docs back that up.

> 
> With that eliminated, the only way to get in that decompression code is in
> case a segfault occurred 
> in the address space of a library loaded by our linker. That only happens if
> running code from that library, or read/writing data from/to it.
> 
> Now, afaik, the signal handler for a segfault runs on the thread that
> segfaulted, so the only way we'd get in the decompression code is if code on
> that thread is calling code from e.g. libxul, or read/writing data from/to
> it.
> 
> And for a deadlock to happen, that code would need to be holding the system
> malloc lock at the same time. How can that happen?
> 
> So, we'd be in a situation where the signal is not dispatched to the thread
> that segfaulted?

Yeah, weirdness. I'm going to try to reproduce the hang again and get a stack trace for all threads.
I attached a file containing the stack traces of all threads during the hang. Here we can see there are actually several threads waiting on the malloc mutex. Thread 35 is the Compositor thread, and malloc is called from within libz. The stack trace goes south after that for some reason, but my guess is that we're in the signal handler. Threads 11 and 9 are Dalvik-owned, and are calling malloc from a dalvik/Android stuff. Thread 1 is the Android UI thread, and is calling malloc from a dalvik function.

I think it is generally known that using pthread mutexes from signal handlers is Bad(tm), but I'm having trouble finding the specific reason. I think it's mostly due to reentrancy issues (trying to lock a mutex from a signal handler in a thread that already has the mutex locked). I see that we set SA_NODEFER on the linker signal handler, which means reentrancy in the signal handler itself is allowed. There is no way anything in CustomElf could use something that CustomElf itself loaded, though, so I don't see how that could be an issue.
One interesting thing I noticed is that the address we segfaulted at is 0x6542884d. ElfLoader::GetHandleByPtr() think that's in libxul, but 'info sharedlibrary' disagrees. It claims libxul is from 0x63ef1140 to 0x65283914. Maybe the gdb info isn't accurate?
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #17)
> One interesting thing I noticed is that the address we segfaulted at is
> 0x6542884d. ElfLoader::GetHandleByPtr() think that's in libxul, but 'info
> sharedlibrary' disagrees. It claims libxul is from 0x63ef1140 to 0x65283914.
> Maybe the gdb info isn't accurate?

I suppose it's possible the current fault would end up just expanding that range.
Assignee

Comment 19

5 years ago
Which one is the crashing thread? #35?
(In reply to Mike Hommey [:glandium] (out from Sep 6 to Sep 22) from comment #19)
> Which one is the crashing thread? #35?

You mean which one is running the SEGV handler? That's 35, yes.
Assignee

Comment 21

5 years ago
And which one is the one that actually crashed?
(In reply to Mike Hommey [:glandium] (out from Sep 6 to Sep 22) from comment #21)
> And which one is the one that actually crashed?

Nothing crashed. The SEGV handler is just the fault from the on-demand linker. It's deadlocked on the malloc mutex.
Assignee

Comment 23

5 years ago
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #22)
> (In reply to Mike Hommey [:glandium] (out from Sep 6 to Sep 22) from comment
> #21)
> > And which one is the one that actually crashed?
> 
> Nothing crashed. The SEGV handler is just the fault from the on-demand
> linker. It's deadlocked on the malloc mutex.

It can't be locked on the malloc mutex *and* have triggered the on-demand linker. So the question remains: which thread is it that caused the segfault?
(In reply to Mike Hommey [:glandium] (out from Sep 6 to Sep 22) from comment #23)
> (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #22)
> > (In reply to Mike Hommey [:glandium] (out from Sep 6 to Sep 22) from comment
> > #21)
> > > And which one is the one that actually crashed?
> > 
> > Nothing crashed. The SEGV handler is just the fault from the on-demand
> > linker. It's deadlocked on the malloc mutex.
> 
> It can't be locked on the malloc mutex *and* have triggered the on-demand
> linker. So the question remains: which thread is it that caused the segfault?

I see what you're saying. We lose the stack before the segfault handler, so I have no idea where we were. I agree that it's unlikely we are inside malloc() there.
(In reply to Mike Hommey [:glandium] from comment #13)
> Comment on attachment 8484357 [details] [diff] [review]
> Avoid malloc in on-demand linker
> 
> Review of attachment 8484357 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Considering several threads could be running a decompression in parallel,
> it's not safe to use the same buffer for all of them.

So, I think this is actually alright. We have a set of preallocated buffers for each zstream, and you cannot decompress a chunk from one stream in two threads (MappableSeekableZStream::ensure() has a mutex).

However, your point about why we hang on the malloc mutex in the first place is still bugging me. Can't explain that one. Whatever the reason, avoiding the malloc lock will fix it. What do you think?
Flags: needinfo?(mh+mozilla)
This version is the same, but increases the size for the state allocation. Apparently it's larger on some platforms.

As explained earlier, I think this patch should be safe because you cannot have parallell DecompressChunk calls in a single SeekableZStream.
Attachment #8484357 - Attachment is obsolete: true
Attachment #8511075 - Flags: review?(mh+mozilla)
releng really needs a solution for these failures, so we should just disable the on-demand linker if another fix can't be found.
Attachment #8512003 - Flags: review?(mwu)
Attachment #8512003 - Attachment is obsolete: true
Attachment #8512003 - Flags: review?(mwu)
Attachment #8512004 - Flags: review?(mwu)
Assignee

Comment 29

5 years ago
Comment on attachment 8512004 [details] [diff] [review]
Disable the on-demand linker by default

Review of attachment 8512004 [details] [diff] [review]:
-----------------------------------------------------------------

You can't wait, can you?
Attachment #8512004 - Flags: review?(mwu) → review-
(In reply to Mike Hommey [:glandium] from comment #29)
> You can't wait, can you?

2 months seems pretty patient to me.
Assignee

Comment 31

5 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #30)
> (In reply to Mike Hommey [:glandium] from comment #29)
> > You can't wait, can you?
> 
> 2 months seems pretty patient to me.

The other patch, that tries to fix the problem, is from friday.
Flags: needinfo?(mh+mozilla)
Assignee

Comment 32

5 years ago
If you _really_ want to disable on-demand decompression, set MOZ_LINKER_ONDEMAND=0 from java. But disabling it would make any attempt at fixing the actual problem irrelevant since nothing will tell if it was, indeed, fixed.
Assignee

Comment 33

5 years ago
Comment on attachment 8511075 [details] [diff] [review]
Pre-allocate zlib buffers for on-demand linker decompression

See upcoming patch which addresses comments I'd have about this one. Figured this would make things landed faster than going through another cycle. Also, since I had a PoC patch for bug 828845 that also hooked the zstream allocator, I went the two birds one stone way and took parts of that patch.
Attachment #8511075 - Attachment is obsolete: true
Attachment #8511075 - Flags: review?(mh+mozilla)
Assignee

Comment 34

5 years ago
Original patch from James Willcox <snorp@snorp.net>
Attachment #8512479 - Flags: review?(nfroyd)
Assignee

Updated

5 years ago
Assignee: snorp → mh+mozilla
Status: NEW → ASSIGNED
Assignee

Comment 35

5 years ago
Note, I'm still not comfortable with the fact that there are threads SEGV'ing in malloc().
(In reply to Mike Hommey [:glandium] from comment #31)
> (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #30)
> > (In reply to Mike Hommey [:glandium] from comment #29)
> > > You can't wait, can you?
> > 
> > 2 months seems pretty patient to me.
> 
> The other patch, that tries to fix the problem, is from friday.

It's the same patch, I just increased the state buffer since that seeems to vary across platforms.
(In reply to Mike Hommey [:glandium] from comment #29)
> Comment on attachment 8512004 [details] [diff] [review]
> Disable the on-demand linker by default
> 
> Review of attachment 8512004 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> You can't wait, can you?

Sorry if you're offended, but releng really needs this fixed. Thanks for finishing up the patch.
(In reply to Mike Hommey [:glandium] from comment #29)
> Comment on attachment 8512004 [details] [diff] [review]
> Disable the on-demand linker by default
> 
> Review of attachment 8512004 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> You can't wait, can you?

Two months without a review is really unacceptable. Complaining when the review gets switched is baffling.

(In reply to Mike Hommey [:glandium] from comment #32)
> If you _really_ want to disable on-demand decompression, set
> MOZ_LINKER_ONDEMAND=0 from java. But disabling it would make any attempt at
> fixing the actual problem irrelevant since nothing will tell if it was,
> indeed, fixed.

That doesn't really make sense. Disabling it while its causing intermittent orange removes the pain for RelEng. Re-enabling it when a fix is available would give you a good idea if the fix worked.
Attachment #8512479 - Flags: review?(nfroyd) → review+
https://hg.mozilla.org/integration/fx-team/rev/44c6d392bbc8

(In reply to Mike Hommey [:glandium] from comment #35)
> Note, I'm still not comfortable with the fact that there are threads
> SEGV'ing in malloc().

Sounds like fodder for a follow-up bug?
Whiteboard: [fixed-in-fx-team]
[Tracking Requested - why for this release]: regression
tracking-fennec: --- → ?
https://hg.mozilla.org/mozilla-central/rev/44c6d392bbc8

Early results on fx-team are looking promising. We'll see how things look over the next day or so now that this is merging around.
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Whiteboard: [fixed-in-fx-team]
Target Milestone: --- → mozilla36
Assignee

Comment 42

5 years ago
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #36)
> > The other patch, that tries to fix the problem, is from friday.
> 
> It's the same patch, I just increased the state buffer since that seeems to
> vary across platforms.

But there was no r? flag until friday.

> > You can't wait, can you?
> 
> Sorry if you're offended, but releng really needs this fixed. Thanks for
> finishing up the patch.

(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #38)
> Two months without a review is really unacceptable. Complaining when the
> review gets switched is baffling.

I'm sorry, but there is absolutely no context in this bug indicating this was so much of pain, and that it was pressing to have it fixed. Whatever email exchange led to this haste is unknown to me. All I saw was snorp re-asking for review on friday for a possible fix, and then, because somehow that the review didn't happen over the week end was not quick enough, attached another patch to entirely disable on-demand decompression.
Seems to be holding up pretty well so far. Can we please get Aurora and Beta approval requests?
Assignee

Comment 44

5 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #43)
> Seems to be holding up pretty well so far. Can we please get Aurora and Beta
> approval requests?

I'd rather see what's up with bug 1091118 first, since it's related to the code added here.
Assignee

Updated

5 years ago
Depends on: 1091118
Assignee

Updated

5 years ago
Attachment #8512004 - Attachment is obsolete: true
Assignee

Comment 45

5 years ago
Comment on attachment 8512479 [details] [diff] [review]
Pre-allocate zlib inflate buffers in faulty.lib

Approval Request Comment
[Feature/regressing bug #]: bug 848764
[User impact if declined]: Unstable android testing. This probably also affects Firefox on user devices.
[Describe test coverage new/current, TBPL]: Reduced android test instability.
[Risks and why]: Not very high. If the patch had ill effects, like making the wrong chunk unpacked, or corrupted data, it would create more instability. That said, it's the kind of instability that is very hard to measure without the law of large numbers. On paper, it's safe. The good thing is that it has a very low risk of conflict, so can be backed out easily in case of disaster.
[String/UUID change made/needed]: None
Attachment #8512479 - Flags: approval-mozilla-beta?
Attachment #8512479 - Flags: approval-mozilla-aurora?
Comment on attachment 8512479 [details] [diff] [review]
Pre-allocate zlib inflate buffers in faulty.lib

Beta+
Aurora+

ESR31 is marked as unaffected but bug 848764 was fixed in Firefox 26. Is this correct?
Flags: needinfo?(mh+mozilla)
Attachment #8512479 - Flags: approval-mozilla-beta?
Attachment #8512479 - Flags: approval-mozilla-beta+
Attachment #8512479 - Flags: approval-mozilla-aurora?
Attachment #8512479 - Flags: approval-mozilla-aurora+
Depends on: 1091664
tracking-fennec: ? → 34+
Assignee

Comment 48

5 years ago
(In reply to Lawrence Mandel [:lmandel] (use needinfo) from comment #46)
> Comment on attachment 8512479 [details] [diff] [review]
> Pre-allocate zlib inflate buffers in faulty.lib
> 
> Beta+
> Aurora+
> 
> ESR31 is marked as unaffected but bug 848764 was fixed in Firefox 26. Is
> this correct?

Technically, it's affected as much as other branches, but maybe that doesn't show up as test instability.
Flags: needinfo?(mh+mozilla)
Assignee

Comment 49

5 years ago
(or, the root cause of this bug, which we actually only worked around, did appear on esr31, I don't know)
If there isn't a good case to take this fix on ESR31, let's not bother. Our plan is only to support Android on this branch until mid Jan.
Duplicate of this bug: 1056634
Duplicate of this bug: 1059114
Duplicate of this bug: 1059115
Duplicate of this bug: 1059205
Duplicate of this bug: 1059227
Duplicate of this bug: 1059228
Duplicate of this bug: 1058453
Duplicate of this bug: 1055882
Bug 1113416 is believed to have fixed the underlying cause for this. Since the patch here seems to have triggered a startup regression (?), maybe we should back it out and uplift bug 1113416. Mike what do you think?
Flags: needinfo?(mh+mozilla)
Assignee

Comment 60

5 years ago
Let's see how things pan out in bug 1091664 first. That said, bug 1113416 is probably good to uplift, independently of whether we back out this one or not.
Flags: needinfo?(mh+mozilla)
Assignee

Comment 61

4 years ago
The landed fix actually didn't change the allocation pattern at all.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee

Comment 62

4 years ago
This applies on top of what already landed.

Note that the patch that landed increased the buffer sizes, so there *are* a few more allocated bytes, which also means more poisoned bytes when deallocating, which /might/ explain bug 1113416.
Attachment #8561926 - Flags: review?(nfroyd)
Comment on attachment 8561926 [details] [diff] [review]
Really share the same zlib inflate buffers for SeekableZStream chunks

Review of attachment 8561926 [details] [diff] [review]:
-----------------------------------------------------------------

I trawled a bit through the linked bugs, and I don't understand why we're doing this.  Before this patch, we had preallocated buffers for everything, which as I understand it, was necessary to avoid calling malloc() at inopportune times.  Now with this patch, we have really preallocated buffers for decompression, but we malloc() for everything else?  Why are we not keeping the preallocated buffers for everything else?

::: mozglue/linker/Zip.h
@@ +31,5 @@
>  public:
> +  /* Forward declaration */
> +  class StaticAllocator;
> +
> +  zxx_stream(StaticAllocator *allocator_=nullptr)

Nit: maybe |explicit| this?
Attachment #8561926 - Flags: review?(nfroyd) → review+
https://hg.mozilla.org/mozilla-central/rev/aac7de995646
Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → FIXED
OK, so I guess we were allocating those buffers once per chunk instead of once per stream. Oops. This seems to have fixed the startup regression, so we should uplift to at least 36.
I'll let Mike request the uplift, he can probably speak better about the risk it entails.
Flags: needinfo?(mh+mozilla)
Assignee

Comment 68

4 years ago
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #66)
> OK, so I guess we were allocating those buffers once per chunk instead of
> once per stream. Oops. This seems to have fixed the startup regression, so
> we should uplift to at least 36.

Technically, the original patch didn't change much, it just made half of those allocations that were already happening slightly larger. So in that sense, it may have been the cause of the regression. The new patch does actually remove those allocations, so, more than fixing the regression, it actually does what was meant to be done. If the goal is to make the regression go away, I'd rather backout the original patch from older branches, and let the new patch bake some more. James, what do you think?
Flags: needinfo?(mh+mozilla) → needinfo?(snorp)
(In reply to Mike Hommey [:glandium] from comment #68)
> (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #66)
> > OK, so I guess we were allocating those buffers once per chunk instead of
> > once per stream. Oops. This seems to have fixed the startup regression, so
> > we should uplift to at least 36.
> 
> Technically, the original patch didn't change much, it just made half of
> those allocations that were already happening slightly larger. So in that
> sense, it may have been the cause of the regression. The new patch does
> actually remove those allocations, so, more than fixing the regression, it
> actually does what was meant to be done. If the goal is to make the
> regression go away, I'd rather backout the original patch from older
> branches, and let the new patch bake some more. James, what do you think?

It looks like we've been shipping the regression since 34, so I don't think I agree with backing it out now unless we also push something (the new patch) to fix the original malloc hang. If you want to just let this ride the trains, I'm alright with that.
Flags: needinfo?(snorp)
Assignee

Comment 70

4 years ago
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #69)
> It looks like we've been shipping the regression since 34, so I don't think
> I agree with backing it out now unless we also push something (the new
> patch) to fix the original malloc hang. If you want to just let this ride
> the trains, I'm alright with that.

The thing is, I don't see how that actually fixed the original malloc hang, since the amount of malloc was the same.
(In reply to Mike Hommey [:glandium] from comment #70)
> (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #69)
> > It looks like we've been shipping the regression since 34, so I don't think
> > I agree with backing it out now unless we also push something (the new
> > patch) to fix the original malloc hang. If you want to just let this ride
> > the trains, I'm alright with that.
> 
> The thing is, I don't see how that actually fixed the original malloc hang,
> since the amount of malloc was the same.

The amount was (roughly) the same, but it did it before the signal handler -- not inside of it.
Assignee

Comment 72

4 years ago
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #71)
> The amount was (roughly) the same, but it did it before the signal handler
> -- not inside of it.

AFAICS, the mallocs were still inside the signal handler.
You need to log in before you can comment on or make changes to this bug.