Closed Bug 1428072 Opened 4 years ago Closed 4 years ago

Crash in a debug build when opening the devtools, when the profiler is active

Categories

(Core :: Gecko Profiler, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox-esr52 --- unaffected
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- fixed

People

(Reporter: julienw, Assigned: julienw)

References

Details

(Keywords: regression)

Attachments

(3 files)

STR:
1. Have a custom debug build of firefox.
2. Run it, go to https://perf-html.io and install the profiler addon.
3. Start the profiler.
4. Open the devtools (I use: ctrl - shift - k)

Crashes on this assertion:

Assertion failure: self->debugToken == DEBUG_TOKEN, at /home/julien/travail/git/mozilla-central/caps/nsJSPrincipals.h:36

This doesn't happen on a normal nightly. I haven't tried on a custom non-debug build yet.
I have tried to reproduce this on macOS with a debug build from a recent mozilla-central revision:
mozregression --build-type debug --repo mozilla-central --launch 81362f7306fe

But I didn't hit the assertion. Does this way of reproducing it work on Linux?
Indeed I don't reproduce directly with this command, but I reproduce if I add ",Worker" to the list of threads to profile.

I _think_ I reproduced without this on my own debug build, but I'm not sure.
(In reply to Julien Wajsberg [:julienw] from comment #2)
> Indeed I don't reproduce directly with this command, but I reproduce if I
> add ",Worker" to the list of threads to profile.

Great, then it should be possible to find a regression window.
Some code expects debugToken to equal to DEBUG_TOKEN: [1].
But as we can see in [2] [3] the token can also be kJSPrincipalsDebugToken.
So I believe the assertions in [1] needs to check for this value too.

[1] https://searchfox.org/mozilla-central/rev/f42618c99dcb522fb674221acfbc68c2d92e7936/caps/nsJSPrincipals.h#34-43
[2] https://searchfox.org/mozilla-central/rev/f42618c99dcb522fb674221acfbc68c2d92e7936/caps/nsJSPrincipals.cpp#93
[3] https://searchfox.org/mozilla-central/rev/f42618c99dcb522fb674221acfbc68c2d92e7936/dom/workers/Principal.cpp#38
I made something like this: https://github.com/julienw/mozilla-central/commit/1a13aadb147dd0179537c9ee0a0ed87e5d7c6d4a
Not sure this is the right approach, and also I haven't tried it yet. All I know is it builds and runs in non-debug, which is already something ;)
This doesn't build with--enable-debug but I'm on it.
Assignee: nobody → felash
Comment on attachment 8941098 [details]
Bug 1428072 - 1/3 - Change the implementation of isChromeJSScript so that it doesn't crash for workers

https://reviewboard.mozilla.org/r/211368/#review217158

Adding some comments for reviewers ;)

::: caps/nsJSPrincipals.h:45
(Diff revision 1)
> -    MOZ_ASSERT_IF(self, self->debugToken == DEBUG_TOKEN);
> +    MOZ_ASSERT_IF(self,
> +                  self->debugToken == DEBUG_TOKEN
> +                  || self->debugToken == mozilla::dom::workers::WorkerPrincipal::DEBUG_TOKEN);

I don't really like this but I don't know how to do otherwise. Ideally we should have a list of existing debug tokens in a common place, but we also don't want this line to be costly obviously. Happy to hear suggestions.

::: dom/workers/Principal.h:13
(Diff revision 1)
>  #define mozilla_dom_workers_principal_h__
>  
> -#include "Workers.h"
> +#include "jsapi.h"
> +#include "mozilla/Assertions.h"
>  
> -BEGIN_WORKERS_NAMESPACE
> +namespace mozilla { namespace dom { namespace workers {

I don't want to import `Workers.h` for just this, that's why I replaced the macro with the replaced value.

This is not nice. Is it better to move the namespace macro from `Workers.h` to a separate header so that we can import it?

::: dom/workers/Principal.h:22
(Diff revision 1)
> +  // Random unique constant to facilitate JSPrincipal debugging
> +  static const uint32_t DEBUG_TOKEN = 0x7e2df9d2;

I decided to move the token as a static field to `WorkerPrincipal`, and name it `DEBUG_TOKEN` to make it more consistent with the other DEBUG_TOKEN. But admittedly I could have moved just the constant here. Tell me what you think.

::: dom/workers/Principal.cpp:30
(Diff revision 1)
>    int32_t prevRefcount = sPrincipal.refcount++;
>    if (prevRefcount > 0) {
>      --sPrincipal.refcount;
>    } else {
>  #ifdef DEBUG
> -    sPrincipal.debugToken = kJSPrincipalsDebugToken;
> +    sPrincipal.debugToken = WorkerPrincipal::DEBUG_TOKEN;

I wondered if we should use `setDebugToken` instead of assigning it directly. In this case we could likely remove the #ifdef as they're already in `setDebugToken` (see [1]). Although I don't know if the macro DEBUG is the same as JS_DEBUG.

[1] https://searchfox.org/mozilla-central/rev/cf149b7b63ff97023e28723167725e38cf5df757/js/public/Principals.h#35-39


It also made this code slightly more difficult to find.

I decided to not change this, but happy to use setDebugToken if you think it's a good idea.

::: dom/workers/moz.build:29
(Diff revision 1)
>      'WorkerRunnable.h',
>      'WorkerScope.h',
>  ]
>  
>  EXPORTS.mozilla.dom.workers += [
> +    'Principal.h',

It needs to be exported here so that we can import from JSPrincipals.
Comment on attachment 8941098 [details]
Bug 1428072 - 1/3 - Change the implementation of isChromeJSScript so that it doesn't crash for workers

Hey Markus, what do you think of the approach? Is it too complex to fix this?

Also do you have an idea for a test?

Lastly, who could be a good reviewer for this?
Attachment #8941098 - Flags: feedback?(mstange)
(In reply to Julien Wajsberg [:julienw] from comment #4)
> Some code expects debugToken to equal to DEBUG_TOKEN: [1].

Do you have a stack for this crash? Maybe [1] isn't supposed to be called for workers?

I don't know any of this code. I'd recommend bz as a reviewer, but you'll need a better commit message first :)

> But as we can see in [2] [3] the token can also be kJSPrincipalsDebugToken.
> So I believe the assertions in [1] needs to check for this value too.
> 
> [1] https://searchfox.org/mozilla-central/rev/f42618c99dcb522fb674221acfbc68c2d92e7936/caps/nsJSPrincipals.h#34-43
> [2] https://searchfox.org/mozilla-central/rev/f42618c99dcb522fb674221acfbc68c2d92e7936/caps/nsJSPrincipals.cpp#93
> [3] https://searchfox.org/mozilla-central/rev/f42618c99dcb522fb674221acfbc68c2d92e7936/dom/workers/Principal.cpp#38
The thing that has kJSPrincipalsDebugToken is NOT an nsJSPrincipals.  nsJSPrincipals is mainthread-only; calling nsJSPrincipals::get on a non-main thread is generally not OK (and is what the asserts are meant to prevent).

What's the stack to the assertion when it fires?  What is done with the return value of get() in those cases?
Flags: needinfo?(felash)
Here is the stack:

0x00007fffe0fc19d6 in nsJSPrincipals::get (principals=0x7fffed8f4ad0 <mozilla::dom::workers::GetWorkerPrincipal()::sPrincipal>) at /home/julien/travail/git/mozilla-central/caps/nsJSPrincipals.h:36
36	    MOZ_ASSERT_IF(self, self->debugToken == DEBUG_TOKEN);
(gdb) bt
#0  0x00007fffe0fc19d6 in nsJSPrincipals::get(JSPrincipals*) (principals=0x7fffed8f4ad0 <mozilla::dom::workers::GetWorkerPrincipal()::sPrincipal>)
    at /home/julien/travail/git/mozilla-central/caps/nsJSPrincipals.h:36
#1  0x00007fffe61c3864 in IsChromeJSScript(JSScript*) (aScript=0x7fffc8371f68) at /home/julien/travail/git/mozilla-central/tools/profiler/core/ProfileBuffer.cpp:148
#2  0x00007fffe61c3ab4 in ProfileBufferCollector::CollectPseudoEntry(js::ProfileEntry const&) (this=0x7fff97dfe800, aEntry=...)
    at /home/julien/travail/git/mozilla-central/tools/profiler/core/ProfileBuffer.cpp:202
#3  0x00007fffe61c896a in MergeStacks(uint32_t, bool, ThreadInfo const&, Registers const&, NativeStack const&, ProfilerStackCollector&) (aFeatures=646, aIsSynchronous=false, aThreadInfo=..., aRegs=..., aNativeStack=..., aCollector=...) at /home/julien/travail/git/mozilla-central/tools/profiler/core/platform.cpp:899
#4  0x00007fffe61c92dd in DoSharedSample(PSLockRef, bool, ThreadInfo&, mozilla::TimeStamp const&, Registers const&, ProfileBuffer::LastSample*, ProfileBuffer&) (aLock=..., aIsSynchronous=false, aThreadInfo=..., aNow=..., aRegs=..., aLS=0x7fffd7fbe6b4, aBuffer=...) at /home/julien/travail/git/mozilla-central/tools/profiler/core/platform.cpp:1294
#5  0x00007fffe61c944d in DoPeriodicSample(PSLockRef, ThreadInfo&, mozilla::TimeStamp const&, Registers const&, int64_t, int64_t) (aLock=..., aThreadInfo=..., aNow=..., aRegs=..., aRSSMemory=0, aUSSMemory=0)
    at /home/julien/travail/git/mozilla-central/tools/profiler/core/platform.cpp:1330
#6  0x00007fffe61ca7ae in SamplerThread::<lambda(const Registers&)>::operator()(const Registers &) const (__closure=0x7fff97dfee50, aRegs=...)
    at /home/julien/travail/git/mozilla-central/tools/profiler/core/platform.cpp:1987
#7  0x00007fffe61d4e78 in Sampler::SuspendAndSampleAndResumeThread<SamplerThread::Run()::<lambda(const Registers&)> >(PSLockRef, const ThreadInfo &, const SamplerThread::<lambda(const Registers&)> &) (this=0x7fffa3e08b30, aLock=..., aThreadInfo=..., aProcessRegs=...) at /home/julien/travail/git/mozilla-central/tools/profiler/core/platform-linux-android.cpp:356
#8  0x00007fffe61caaa1 in SamplerThread::Run() (this=0x7fffa3e08b30) at /home/julien/travail/git/mozilla-central/tools/profiler/core/platform.cpp:1988
#9  0x00007fffe61caf9d in ThreadEntry(void*) (aArg=0x7fffa3e08b30) at /home/julien/travail/git/mozilla-central/tools/profiler/core/platform-linux-android.cpp:396
#10 0x00007ffff7bc3494 in start_thread (arg=0x7fff97dff700) at pthread_create.c:333
#11 0x00007ffff6e64aff in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97


We come from [1].

I thought it was OK because in [2] we were using the WorkerPrincipal's debug token, although I now see we're implementing a method for JSPrincipals.

Is the solution that we should dispatch different calls depending on the Principal type ?

[1] https://searchfox.org/mozilla-central/rev/03877052c151a8f062eea177f684a2743cd7b1d5/tools/profiler/core/ProfileBuffer.cpp#148
[2] https://searchfox.org/mozilla-central/rev/03877052c151a8f062eea177f684a2743cd7b1d5/caps/nsJSPrincipals.cpp#93


Note we have the exact same code in [3], do you think we may have the same issue?

[3] https://searchfox.org/mozilla-central/rev/03877052c151a8f062eea177f684a2743cd7b1d5/toolkit/components/backgroundhangmonitor/ThreadStackHelper.cpp#218
Flags: needinfo?(felash)
This instance of the IsChromeJSScript function was introduced in bug 1380286.

(In reply to Julien Wajsberg [:julienw] from comment #13)
> I thought it was OK because in [2] we were using the WorkerPrincipal's debug
> token, although I now see we're implementing a method for JSPrincipals.
> 
> Is the solution that we should dispatch different calls depending on the
> Principal type ?

I don't know. Deferring to bz again. bz, how should IsChromeJSScript be implemented for worker threads? Can nsScriptSecurityManager even be used outside the main thread?

IsChromeJSScript is called by ProfileBufferCollector::CollectPseudoEntry, which is called by MergeStacks, which has the sampled thread's ThreadInfo. On the ThreadInfo we could store a JSContext type ("main thread" or "worker") and use that to make CollectPseudoEntry behave differently for worker threads.

> Note we have the exact same code in [3], do you think we may have the same
> issue?

Probably, but I'm not sure if that code is ever used to sample worker threads.
Blocks: 1380286
Flags: needinfo?(bzbarsky)
> how should IsChromeJSScript be implemented for worker threads?

What definition of "chrome" are you aiming for?  Chrome worker, or system principal?  They're sort of orthogonal...

In any case, worker JSScripts don't have a useful concept of "principal" attached to them.  Principals are mainthread-only objects in Gecko.

> Can nsScriptSecurityManager even be used outside the main thread?

Not really.  For example, it doesn't have threadsafe refcounting.  It's just luck that this code is not refcounting it.

It's also just luck that all nsScriptSecurityManager::IsSystemPrincipal does is a pointer compare on the passed-in pointer t a singleton system principal pointer.  So while the nsJSPrincipals::get in IsChromeJSScript is completely bogus, and casts an object to a class that it's not, we're not actually calling any virtual methods on it, so it ends up OK in an opt build.  If that code did "return nsJSPrincipals::get(principals)->GetIsSystemPrincipal();", with the virtual function call that entails, you would have noticed quite immediately in opt builds that there is a problem here...

Anyway, you could test js::IsSystemCompartment on the script's compartment.  That will be equivalent to checking the principal for being system principal on mainthread, and will always return false on workers.

Alternately, you could detect worker threads and use GetCurrentThreadWorkerPrivate()->UsesSystemPrincipal() if you really want to properly detect "system principal" workers.  I _think_ that just calling GetCurrentThreadWorkerPrivate() (after ensuring you're not NS_IsMainThread()) and assuming "not a worker thread" if you get null should work as worker thread detection...
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #15)
> > how should IsChromeJSScript be implemented for worker threads?
> 
> What definition of "chrome" are you aiming for?

"Scripts which are under our control, whose function names and script URLs can be captured in telemetry without privacy implications".
Hmm.  Can webextensions run system code?
I don't know. Nika, I think you know a bit more about the rationale here, maybe you want to comment.

(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #15)
> Anyway, you could test js::IsSystemCompartment on the script's compartment. 
> That will be equivalent to checking the principal for being system principal
> on mainthread, and will always return false on workers.

This sounds like the reasonable thing to do. We're not sending stacks from worker threads to telemetry anyway.

> Alternately, you could detect worker threads and use
> GetCurrentThreadWorkerPrivate()->UsesSystemPrincipal() if you really want to
> properly detect "system principal" workers.  I _think_ that just calling
> GetCurrentThreadWorkerPrivate() (after ensuring you're not
> NS_IsMainThread()) and assuming "not a worker thread" if you get null should
> work as worker thread detection...

This code runs on the sampler thread, not on the worker thread, so we'd have to call GetCurrentThreadWorkerPrivate()->UsesSystemPrincipal() in the worker thread when it registers itself with the profiler. But the other solution seems simpler and is good enough.

Thanks for the explanations and the pointers, Boris!


And we really need to add a mochitest that runs the profiler on a DOM worker thread...
Flags: needinfo?(nika)
(In reply to Markus Stange [:mstange] from comment #18)
> I don't know. Nika, I think you know a bit more about the rationale here,
> maybe you want to comment.

I think you have the correct answer here. The sort of thing we're specifically interested in is the source of the code. If the code is coming bundled with the browser, then we want to collect stack information for it, while if the code is provided by a webextension or content, we don't want to collect that information when privacy is enabled, as it could violate the user's privacy.

> This sounds like the reasonable thing to do. We're not sending stacks from
> worker threads to telemetry anyway.

I'm fine with this. The big place we're using this privacy flag is in BHR, and we don't collect BHR for worker threads, so...

> And we really need to add a mochitest that runs the profiler on a DOM worker
> thread...

Agreed.

(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #17)
> Hmm.  Can webextensions run system code?

So,

I don't _think_ that webextensions can run system code. However, historically we've captured stacks for all code with the system principal even when we supported running system addons. So there is some precedent for not hiding the add-on stack frames.

Basically I think it'd be better to not capture webext stack frames, but if we do capture them, it probably isn't the end of the world.
Flags: needinfo?(nika)
Attachment #8941098 - Flags: feedback?(mstange)
Julien, any progress here?
Flags: needinfo?(felash)
Hey Liz,

I have a patch that fixes the issue, following the previous comments, but:

1. I'd like take the time to work on a mochitest, and for this it would be good to have the patch for bug 1428076 first.
2. I found another crash in another location, but it looks intermittent and I'd like to qualify it before deciding if it should be in a separate bug.

I think this is fine if it's wontfix in 59 (or any future version), but maybe others might think differently.

Please tell me if you need anything more from me.
Flags: needinfo?(felash)
I got a mochitest permafailing in trunk: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3427d657a529d20beb9ad187c58110d024be95eb&selectedJob=159534977

Now let's see if this is fixed with this patch :)
Ah that's sad, the mochitest also triggers bug 1355566 so we can't land it :(
Comment on attachment 8947072 [details]
Bug 1428072 - 2/3 - Reorganize mochitest files for the profiler

https://reviewboard.mozilla.org/r/216868/#review222680


Static analysis found 1 defect in this patch.
 - 1 defect found by mozlint

You can run this analysis locally with:
 - `./mach lint check path/to/file` (Python/Javascript/wpt)


If you see a problem in this automated review, please report it here: http://bit.ly/2y9N9Vx


::: tools/profiler/tests/chrome/test_profile_worker_bug_1428076.html:52
(Diff revision 1)
>    return new Promise(resolve => {
>      setTimeout(resolve, 50);
>    });
>  }
>  
> -SimpleTest.waitForExplicitFinish();
> +runTest(settings, workload);

Error: 'runTest' is not defined. [eslint: no-undef]
I should have fixed this in the last push ^
Comment on attachment 8947072 [details]
Bug 1428072 - 2/3 - Reorganize mochitest files for the profiler

https://reviewboard.mozilla.org/r/216868/#review222684


Static analysis found 1 defect in this patch.
 - 1 defect found by mozlint

You can run this analysis locally with:
 - `./mach lint check path/to/file` (Python/Javascript/wpt)


If you see a problem in this automated review, please report it here: http://bit.ly/2y9N9Vx


::: tools/profiler/tests/chrome/test_profile_worker_bug_1428076.html:52
(Diff revision 2)
>    return new Promise(resolve => {
>      setTimeout(resolve, 50);
>    });
>  }
>  
> -SimpleTest.waitForExplicitFinish();
> +runTest(settings, workload);

Error: 'runTest' is not defined. [eslint: no-undef]
Ah no, I'm wrong, it's fixed in the 3rd patch. I'll move the fix to the second.
Comment on attachment 8947072 [details]
Bug 1428072 - 2/3 - Reorganize mochitest files for the profiler

https://reviewboard.mozilla.org/r/216868/#review222698

::: tools/profiler/tests/chrome/profiler_test_utils.js:48
(Diff revision 3)
> +    await startProfiler(settings);
> +    await workload();
> +    await getProfile();
> +    await stopProfiler();
> +    await end();

This is not perfect because we can't access the profile here, to be able to analyze it. Maybe we'll want to pass a "analyze" callback too (and change the parameters to an option parameter instead of a list of parameters).
Attachment #8941098 - Flags: review?(mstange) → review?(bzbarsky)
Comment on attachment 8947072 [details]
Bug 1428072 - 2/3 - Reorganize mochitest files for the profiler

https://reviewboard.mozilla.org/r/216868/#review223684
Attachment #8947072 - Flags: review?(mstange) → review+
Comment on attachment 8941098 [details]
Bug 1428072 - 1/3 - Change the implementation of isChromeJSScript so that it doesn't crash for workers

https://reviewboard.mozilla.org/r/211368/#review223682

r=me, though it might be good to have some JS peer OK the jsfriendapi change.

::: tools/profiler/core/ProfileBuffer.cpp:143
(Diff revision 5)
>  static bool
>  IsChromeJSScript(JSScript* aScript)
>  {
>    // WARNING: this function runs within the profiler's "critical section".
> -
> -  nsIScriptSecurityManager* const secman =
> +  auto compartment = js::GetScriptCompartment(aScript);
> +  NS_ENSURE_TRUE(compartment, false);

I think you can safely assume that a JSScript always has a compartment...
Attachment #8941098 - Flags: review?(bzbarsky) → review+
Comment on attachment 8947073 [details]
Bug 1428072 - 3/3 - Add a mochitest to test profiling better

https://reviewboard.mozilla.org/r/216870/#review223688
Attachment #8947073 - Flags: review?(mstange) → review+
Attachment #8941098 - Flags: review?(sphink)
Comment on attachment 8941098 [details]
Bug 1428072 - 1/3 - Change the implementation of isChromeJSScript so that it doesn't crash for workers

https://reviewboard.mozilla.org/r/211368/#review223772

I'm generally not crazy about further exposing compartments, but this does seem like the right way to do this. The alternative would be IsSystemScript(), but given that we already use compartments for this sort of thing, I think the patch is correct.
Attachment #8941098 - Flags: review?(sphink) → review+
Thanks all, I slightly updated the commit logs to explain that this exposes a script's compartment.
Pushed by jwajsberg@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b5a327bb02d4
1/3 - Change the implementation of isChromeJSScript so that it doesn't crash for workers r=bz,sfink
https://hg.mozilla.org/integration/autoland/rev/9acb3d3eb5bc
2/3 - Reorganize mochitest files for the profiler r=mstange
https://hg.mozilla.org/integration/autoland/rev/b39c21c2a7a4
3/3 - Add a mochitest to test profiling better r=mstange
Ah I didn't know we detect leaks in opt builds too. This is good, but does that mean we can't run the test in CI because of bug 1355566 ?

What do you think Markus?
Flags: needinfo?(felash) → needinfo?(mstange)
I looked at the log and I think this is a different issue than bug 1355566. I've pushed an experimental patch to try, let's see how it goes: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4f4109076be22abe16ae1b206491cd8dbb427414
Flags: needinfo?(mstange)
Looking good! I'm going to attempt a reland, now that bug 1434968 has landed.
Oh no, I can't reland it using autoland because mozreview won't let me reopen the review request.
I've also created a try push that has the new test enabled in debug builds: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b531207fb0ed557f42ab405fc57d3ee51fe03818
Looks like you can enable the test everywhere now.
Ah perfect, thanks! Indeed I also suspected another leak somewhere, because even when we stopped the gecko profiler before shutting down firefox we still had leaks.
Here is a try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a7b854bb8cde
We find leaks on Windows but this seems to be a known bug (bug 1401883). I'll try landing once more then.
Pushed by jwajsberg@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/26289a580d8d
1/3 - Change the implementation of isChromeJSScript so that it doesn't crash for workers r=bz,sfink
https://hg.mozilla.org/integration/autoland/rev/dd5c228ace0d
2/3 - Reorganize mochitest files for the profiler r=mstange
https://hg.mozilla.org/integration/autoland/rev/e6d5ac399a43
3/3 - Add a mochitest to test profiling better r=mstange
(In reply to Julien Wajsberg [:julienw] from comment #61)
> Here is a try:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=a7b854bb8cde
> We find leaks on Windows but this seems to be a known bug (bug 1401883).

Hmm, this is a bit concerning, though. I've retriggered the Win10 x64 c3 job a few more times, and we have a failure rate of 5/9. That's a lot higher than the 7/735 rate in bug 1401883.
This seems like something that can just ride the trains.
Flags: in-testsuite+
You need to log in before you can comment on or make changes to this bug.