Closed Bug 1623943 Opened 5 months ago Closed 1 month ago

Advance toolkit.shutdown.fastShutdownStage to 3 on Nightly

Categories

(Core :: XPCOM, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla80
Tracking Status
firefox80 --- fixed

People

(Reporter: dthayer, Assigned: dthayer)

References

(Blocks 1 open bug)

Details

(Whiteboard: [ci-costs-2020:todo])

Attachments

(3 files, 2 obsolete files)

Just getting this bug in to make it clear that it's the next step after bug 1623668, provided the telemetry from it turns up clean.

This will currently do nothing, since we end up calling WaitOnWriteThread
from other calls into the StartupCache before we get here. However, since
we're going to be skipping the guard in xpcom-shutdown, it's prudent to
add this here now so we don't inadvertently start missing writing it in
the future.

Assignee: nobody → dothayer
Status: NEW → ASSIGNED

Of several million lateWrites entries on Nightly telemetry, there
are 5 that are non-empty. We don't have symbols for any of their
xul entries, so they appear to be non-standard builds. With those
numbers I feel comfortable advancing this pref forward.

Depends on D69865

Whiteboard: [ci-costs-2020:todo]
Attachment #9138625 - Attachment is obsolete: true
Depends on: 1638062
Depends on: 1638166
Attachment #9138624 - Attachment is obsolete: true
Depends on: 1644260
See Also: → 1644265

Hey Dana (it's me again!)

There are a few stacks that I wanted to get your take on which we're still seeing in late write telemetry. Could you let me know if my understanding here is correct?

0.06% of shutdowns (205 of 337,000) see this stack:

[
        "(xul.pdb) mozilla::IOInterposeObserver::Observation::Report()",
        "(xul.pdb) `anonymous namespace'::InterposedNtWriteFile(void*, void*, void (*)(void*, _IO_STATUS_BLOCK*, unsigned long), void*, _IO_STATUS_BLOCK*, void*, unsigned long, _LARGE_INTEGER*, unsigned long*)",
        "(kernelbase.pdb) WriteFile",
        "(nss3.pdb) winWrite(sqlite3_file*, void const*, int, long long)",
        "(nss3.pdb) pagerAddPageToRollbackJournal(PgHdr*)",
        "(nss3.pdb) pager_write(PgHdr*)",
        "(nss3.pdb) sqlite3BtreeDelete(BtCursor*, unsigned char)",
        "(nss3.pdb) sqlite3VdbeExec(Vdbe*)",
        "(nss3.pdb) sqlite3_step(sqlite3_stmt*)",
        "(softokn3.pdb) sdb_destroyAnyObject(SDBStr*, char const*, unsigned long, char const*)",
        "(softokn3.pdb) sftkdb_DestroyObject(SFTKDBHandleStr*, unsigned long, unsigned long)",
        "(softokn3.pdb) sftk_DeleteObject(SFTKSessionStr*, SFTKObjectStr*)",
        "(softokn3.pdb) NSC_DestroyObject(unsigned long, unsigned long)",
        "(nss3.pdb) nssToken_DeleteStoredObject(nssCryptokiInstanceStr*)",
        "(nss3.pdb) nssPKIObject_DeleteStoredObject(nssPKIObjectStr*, NSSCallbackStr*, int)",
        "(nss3.pdb) SEC_DeletePermCertificate(CERTCertificateStr*)",
        "(xul.pdb) IntermediatePreloadingHealerCallback(nsITimer*, void*)",
        "(xul.pdb) nsTimerImpl::Fire(int)",
        "(xul.pdb) nsTimerEvent::Run()",
        "(xul.pdb) mozilla::TaskQueue::Runner::Run()",
        "(xul.pdb) nsThreadPool::Run()",
        "(xul.pdb) nsThread::ProcessNextEvent(bool, bool*)",
        "(xul.pdb) NS_ProcessNextEvent(nsIThread*, bool)",
        "(xul.pdb) mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*)",
        "(xul.pdb) MessageLoop::RunHandler()",
        "(xul.pdb) MessageLoop::Run()",
        "(xul.pdb) static nsThread::ThreadFunc(void*)",
        "(nss3.pdb) _PR_NativeRunThread(void*)",
        "(nss3.pdb) pr_root(void*)",
        "(ucrtbase.pdb) thread_start<unsigned int (__cdecl*)(void *),1>",
        "(kernel32.pdb) BaseThreadInitThunk",
        "(ntdll.pdb) RtlUserThreadStart"
]

It looks to me like this is both skippable and interruptible? All of the stacks we observe at least are in sqlite3_step, which ought to be safe to interrupt, but it's not obvious to me whether IntermediatePreloadingHealerCallback could leave things in a bad state more broadly if it doesn't get to finish all of the writes it wants to make.

The next (representing 0.007% of shutdowns) is this:

[
        "(xul.pdb) mozilla::IOInterposeObserver::Observation::Report()",
        "(xul.pdb) `anonymous namespace'::InterposedNtWriteFile(void*, void*, void (*)(void*, _IO_STATUS_BLOCK*, unsigned long), void*, _IO_STATUS_BLOCK*, void*, unsigned long, _LARGE_INTEGER*, unsigned long*)",
        "(kernelbase.pdb) WriteFile",
        "(xul.pdb) std::io::Write::write_all<std::fs::File>(std::fs::File*, slice<u8>*)",
        "(xul.pdb) cert_storage::{{impl}}::run<(),closure-0>(cert_storage::SecurityStateTask<(), closure-0>*)",
        "(xul.pdb) moz_task::{{impl}}::allocate::Run(xpcom::interfaces::idl::nsIRunnable*)",
        "(xul.pdb) mozilla::TaskQueue::Runner::Run()",
        "(xul.pdb) nsThreadPool::Run()",
        "(xul.pdb) nsThread::ProcessNextEvent(bool, bool*)",
        "(xul.pdb) NS_ProcessNextEvent(nsIThread*, bool)",
        "(xul.pdb) mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*)",
        "(xul.pdb) MessageLoop::RunHandler()",
        "(xul.pdb) MessageLoop::Run()",
        "(xul.pdb) static nsThread::ThreadFunc(void*)",
        "(nss3.pdb) _PR_NativeRunThread(void*)",
        "(nss3.pdb) pr_root(void*)",
        "(ucrtbase.pdb) thread_start<unsigned int (__cdecl*)(void *),1>",
        "(kernel32.pdb) BaseThreadInitThunk",
        "(ntdll.pdb) RtlUserThreadStart"
]

A cursory glance at this makes me think this is something we want to ensure we run if we've queued it. Is this correct?

The last stack is this (representing 0.002% of shutdowns):

[
        "(xul.pdb) mozilla::IOInterposeObserver::Observation::Report()",
        "(xul.pdb) `anonymous namespace'::InterposedNtFlushBuffersFile(void*, _IO_STATUS_BLOCK*)",
        "(kernelbase.pdb) FlushFileBuffers",
        "(kernel32.pdb) FlushFileBuffersImplementation",
        "(nss3.pdb) winSync(sqlite3_file*, int)",
        "(nss3.pdb) sqlite3PagerSync(Pager*, char const*)",
        "(nss3.pdb) sqlite3PagerCommitPhaseOne(Pager*, char const*, int)",
        "(nss3.pdb) sqlite3BtreeCommitPhaseOne(Btree*, char const*)",
        "(nss3.pdb) vdbeCommit(sqlite3*, Vdbe*)",
        "(nss3.pdb) sqlite3VdbeHalt(Vdbe*)",
        "(nss3.pdb) sqlite3VdbeExec(Vdbe*)",
        "(nss3.pdb) sqlite3_step(sqlite3_stmt*)",
        "(softokn3.pdb) sdb_complete(SDBStr*, char const*)",
        "(softokn3.pdb) sftkdb_SetAttributeValue(SFTKDBHandleStr*, SFTKObjectStr*, CK_ATTRIBUTE const*, unsigned long)",
        "(softokn3.pdb) sftk_forceAttribute(SFTKObjectStr*, unsigned long, void const*, unsigned int)",
        "(softokn3.pdb) NSC_SetAttributeValue(unsigned long, unsigned long, CK_ATTRIBUTE*, unsigned long)",
        "(nss3.pdb) nssCKObject_SetAttributes(unsigned long, CK_ATTRIBUTE*, unsigned long, nssSessionStr*, NSSSlotStr*)",
        "(nss3.pdb) nssToken_ImportCertificate(NSSTokenStr*, nssSessionStr*, <unnamed-tag>, NSSItemStr*, char const*, NSSItemStr*, NSSItemStr*, NSSItemStr*, NSSItemStr*, char*, int)",
        "(nss3.pdb) PK11_ImportCert(PK11SlotInfoStr*, CERTCertificateStr*, unsigned long, char const*, int)",
        "(xul.pdb) mozilla::detail::RunnableFunction<`lambda at /builds/worker/checkouts/gecko/security/certverifier/NSSCertDBTrustDomain.cpp:1841:9'>::Run()",
        "(xul.pdb) nsThread::ProcessNextEvent(bool, bool*)",
        "(xul.pdb) NS_ProcessNextEvent(nsIThread*, bool)",
        "(xul.pdb) mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*)",
        "(xul.pdb) MessageLoop::RunHandler()",
        "(xul.pdb) MessageLoop::Run()",
        "(xul.pdb) static nsThread::ThreadFunc(void*)",
        "(nss3.pdb) _PR_NativeRunThread(void*)",
        "(nss3.pdb) pr_root(void*)",
        "(ucrtbase.pdb) thread_start<unsigned int (__stdcall*)(void *)>",
        "(kernel32.pdb) BaseThreadInitThunk",
        "(ntdll.pdb) __RtlUserThreadStart",
        "(ntdll.pdb) _RtlUserThreadStart"
]

I would think importing a cert isn't critical if we miss it? Is it critical that we flush any of these that have been started?

Flags: needinfo?(dkeeler)

(In reply to Doug Thayer [:dthayer] from comment #3)

Hey Dana (it's me again!)

Hi Doug!

There are a few stacks that I wanted to get your take on which we're still seeing in late write telemetry. Could you let me know if my understanding here is correct?

0.06% of shutdowns (205 of 337,000) see this stack:

        "(xul.pdb) IntermediatePreloadingHealerCallback(nsITimer*, void*)",

It looks to me like this is both skippable and interruptible? All of the stacks we observe at least are in sqlite3_step, which ought to be safe to interrupt, but it's not obvious to me whether IntermediatePreloadingHealerCallback could leave things in a bad state more broadly if it doesn't get to finish all of the writes it wants to make.

Yes, this should be safe to skip/interrupt. It's a periodic healer task that removes certificates from the NSS certdb that shouldn't be necessary any longer because they're stored elsewhere now (cert_storage/intermediate preloading). Is there a graceful way for that task to coordinate with shutdown, though?

The next (representing 0.007% of shutdowns) is this:

        "(xul.pdb) std::io::Write::write_all<std::fs::File>(std::fs::File*, slice<u8>*)",
        "(xul.pdb) cert_storage::{{impl}}::run<(),closure-0>(cert_storage::SecurityStateTask<(), closure-0>*)",

A cursory glance at this makes me think this is something we want to ensure we run if we've queued it. Is this correct?

It would be best to not interrupt this - it's saving crlite state (so, certificate revocation data).

The last stack is this (representing 0.002% of shutdowns):

        "(nss3.pdb) PK11_ImportCert(PK11SlotInfoStr*, CERTCertificateStr*, unsigned long, char const*, int)",
        "(xul.pdb) mozilla::detail::RunnableFunction<`lambda at /builds/worker/checkouts/gecko/security/certverifier/NSSCertDBTrustDomain.cpp:1841:9'>::Run()",

I would think importing a cert isn't critical if we miss it? Is it critical that we flush any of these that have been started?

This is the task that caches intermediate certificates from verified TLS handshakes that aren't in intermediate preloading. Again this should be safe to interrupt (and, again, if the implementation could coordinate better with the shutdown mechanism, I would be in favor of that).

Flags: needinfo?(dkeeler)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #4)

Is there a graceful way for that task to coordinate with shutdown, though?

What would you most like to see? If it should be safe to skip or interrupt things, then I think it's important that we don't unnecessarily block shutdown with them. I concede that it's a small portion of shutdowns which would actually be blocked by this, but there will be more and more of these as we advance fast shutdown forward.

And in general as an organization I would like for us to seek safety by validating that writes are safely interruptible rather than by ensuring that we block shutdown waiting for them, because a graceful shutdown is only one way that the browser can close, and on Android for instance it's very likely that our processes will just be killed by the low memory killer.

Flags: needinfo?(dkeeler)

Hey Molly, (unrelated to the above comments) I'm seeing this stack in late write telemetry (representing about 0.01% of shutdowns):

        "(xul.pdb) mozilla::IOInterposeObserver::Observation::Report()",
        "(xul.pdb) `anonymous namespace'::InterposedNtWriteFile(void*, void*, void (*)(void*, _IO_STATUS_BLOCK*, unsigned long), void*, _IO_STATUS_BLOCK*, void*, unsigned long, _LARGE_INTEGER*, unsigned long*)",
        "(kernelbase.pdb) WriteFile",
        "(nss3.pdb) PR_MD_WRITE(PRFileDesc*, void const*, int)",
        "(nss3.pdb) FileWrite(PRFileDesc*, void const*, int)",
        "(xul.pdb) nsFileStreamBase::Write(char const*, unsigned int, unsigned int*)",
        "(xul.pdb) nsAtomicFileOutputStream::Write(char const*, unsigned int, unsigned int*)",
        "(xul.pdb) XPTC__InvokebyIndex",
        "(xul.pdb) static XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode)",
        "(xul.pdb) XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*)",
        "(xul.pdb) js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason)",
        "(xul.pdb) Interpret(JSContext*, js::RunState&)",
        "(xul.pdb) js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason)",
        "(xul.pdb) JS_CallFunctionValue(JSContext*, JS::Handle<JSObject *>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>)",
        "(xul.pdb) nsXPCWrappedJS::CallMethod(unsigned short, nsXPTMethodInfo const*, nsXPTCMiniVariant*)",
        "(xul.pdb) PrepareAndDispatch(nsXPTCStubBase*, unsigned int, unsigned long long*, unsigned long long*, double*)",
        "(xul.pdb) SharedStub",
        "(xul.pdb) nsUpdateProcessor::UpdateDone()",
        "(xul.pdb) mozilla::detail::RunnableMethodImpl<(anonymous namespace)::Logger *,void ((anonymous namespace)::Logger::*)(),0,mozilla::RunnableKind::Standard>::Run()",
        "(xul.pdb) nsThread::ProcessNextEvent(bool, bool*)",
        "(xul.pdb) NS_ProcessNextEvent(nsIThread*, bool)",
        "(xul.pdb) nsThread::Shutdown()",
        "(xul.pdb) nsThreadManager::Shutdown()",
        "(xul.pdb) mozilla::ShutdownXPCOM(nsIServiceManager*)",
        "(xul.pdb) ScopedXPCOMStartup::~ScopedXPCOMStartup()",
        "(xul.pdb) XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&)",
        "(xul.pdb) XRE_main(int, char**, mozilla::BootstrapConfig const&)",
        "(firefox.pdb) NS_internal_main(int, char**, char**)",
        "(firefox.pdb) wmain(int, wchar_t**)",
        "(firefox.pdb) __scrt_common_main_seh()",
        "(kernel32.pdb) BaseThreadInitThunk",
        "(ntdll.pdb) RtlUserThreadStart"

For a bit more context, we're working on moving "fast shutdown" (shutdown via _exit(0)) forward to the xpcom-shutdown notification. So essentially we're planning on skipping everything that runs during and after that notification. As part of that we're collecting stacks of all writes which occur during and after that point, in an attempt to validate that it is safe to go forward with skipping it. The above stack is one such collected stack.

In any case, it looks like the above stack is collected from writeStatusFile inside refreshUpdateStatus in UpdateService.jsm. I assume that we will hit this if an update starts applying and the user initiates a shutdown at a similar time, and the update happens to finish applying after or during the xpcom-shutdown notification. It looks like the updater is set up to be fine with this situation? I think the update will simply not be applied, and will be applied on the next restart. I don't think we run any risk of interrupting this process, as it occurs on the main thread - we simply run the risk of skipping it. Am I correct in thinking the updater is fine with this being skipped in the rare case where the timings line up to skip it? Or do we need to add something to ensure that the update finishes applying and this method runs to completion before we go ahead with an _exit(0)?

Flags: needinfo?(mhowell)

If this is coming from refreshUpdateStatus then I think we're looking at this call; there's one other writeStatusFile call in there but it's an unusual failure case, and the likelihood of that case lining up with this timing scenario surely must be too remote.

If that's correct, then the consequence of skipping this write will be that during the next startup, Firefox will start the updater the same as it would have if the write had succeeded, and then inside the updater this check will fail when we meant for it to pass. The consequence of that in turn should be that we get into this branch, which will mean that completing the staged update will not use the maintenance service and will instead present the user with a UAC prompt. Again, this is during startup, so a UAC prompt appearing there when it typically doesn't is not great; we sometimes get complaints when those happen because they feel a bit sketchy. But if it's that rare, I'm included to think it's acceptable.

Flags: needinfo?(mhowell)

(In reply to Doug Thayer [:dthayer] from comment #5)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #4)

Is there a graceful way for that task to coordinate with shutdown, though?

What would you most like to see? If it should be safe to skip or interrupt things, then I think it's important that we don't unnecessarily block shutdown with them. I concede that it's a small portion of shutdowns which would actually be blocked by this, but there will be more and more of these as we advance fast shutdown forward.

For example, the healer can delete up to 20 certificates at a time. If that code is in the middle of deleting a handful of certificates and it gets notified (or it checks) that shutdown is imminent, it can bail from the loop. I guess my question is "is there a way to know if we're shutting down soon without necessarily blocking shutdown?"

And in general as an organization I would like for us to seek safety by validating that writes are safely interruptible rather than by ensuring that we block shutdown waiting for them, because a graceful shutdown is only one way that the browser can close, and on Android for instance it's very likely that our processes will just be killed by the low memory killer.

Good point - makes sense.

Flags: needinfo?(dkeeler)
Depends on: 1648142

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #8)

(In reply to Doug Thayer [:dthayer] from comment #5)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #4)
For example, the healer can delete up to 20 certificates at a time. If that code is in the middle of deleting a handful of certificates and it gets notified (or it checks) that shutdown is imminent, it can bail from the loop. I guess my question is "is there a way to know if we're shutting down soon without necessarily blocking shutdown?"

Yeah! I can put a patch in for that!

Nathan, at this point, as far as unaccounted-for[1] writes go, we are seeing 170 late writes per million shutdowns. Of those, all of them are from JS using nsAtomicFileOutputStream or nsSafeFileOutputStream. Given the extremely low frequency, and the fact that these streams are intended to be safe for these purposes anyway, I feel comfortable advancing fast shutdown to xpcom-shutdown after the upcoming merge. What are your opinions?

[1]: unaccounted-for here means writes which I cannot with high confidence say are okay based on an understanding of what the code is trying to do. Here are some examples of accounted-for writes:

        "(xul.pdb) mozilla::IOInterposeObserver::Observation::Report()",
        "(xul.pdb) `anonymous namespace'::InterposedNtWriteFile(void*, void*, void (*)(void*, _IO_STATUS_BLOCK*, unsigned long), void*, _IO_STATUS_BLOCK*, void*, unsigned long, _LARGE_INTEGER*, unsigned long*)",
        "(kernelbase.pdb) WriteFile",
        "(TmUmEvt64.pdb) undefined",
        "(TmUmEvt64.pdb) undefined",
        "(TmUmEvt64.pdb) undefined",
        "(TmUmEvt64.pdb) undefined",
        "(TmUmEvt64.pdb) undefined",
        "(tmmon64.pdb) undefined",
        "(tmmon64.pdb) undefined",
        "(tmmon64.pdb) undefined",
        "(tmmon64.pdb) undefined",
        "(kernelbase.pdb) VirtualProtectEx",
        "(kernelbase.pdb) VirtualProtect",
        "(xul.pdb) js::jit::ReprotectRegion(void*, unsigned long long, js::jit::ProtectionSetting, js::jit::MustFlushICache)",
        "(xul.pdb) static js::jit::ExecutableAllocator::poisonCode(JSRuntime*, mozilla::Vector<js::jit::JitPoisonRange,0,js::SystemAllocPolicy>&)",
        "(xul.pdb) JSFreeOp::~JSFreeOp()",
        "(xul.pdb) js::gc::GCRuntime::performSweepActions(js::SliceBudget&)",
        ...

I don't know precisely why the OS is doing a WriteFile inside VirtualProtectEx, but I assume that if it is for some reason actually writing to disk (which it probably isn't), it's not a problem.

        "(xul.pdb) mozilla::IOInterposeObserver::Observation::Report()",
        "(xul.pdb) `anonymous namespace'::InterposedNtWriteFile(void*, void*, void (*)(void*, _IO_STATUS_BLOCK*, unsigned long), void*, _IO_STATUS_BLOCK*, void*, unsigned long, _LARGE_INTEGER*, unsigned long*)",
        "(kernelbase.pdb) WriteFile",
        "(TmUmEvt64.pdb) undefined",
        "(TmUmEvt64.pdb) undefined",
        "(TmUmEvt64.pdb) undefined",
        "(TmUmEvt64.pdb) undefined",
        "(TmUmEvt64.pdb) undefined",
        "(tmmon64.pdb) undefined",
        "(tmmon64.pdb) undefined",
        "(tmmon64.pdb) undefined",
        "(tmmon64.pdb) undefined",
        "(ntdll.pdb) LdrpUnmapModule",
        "(ntdll.pdb) LdrpDereferenceModule",
        "(ntdll.pdb) LdrUnloadDll",
        "(tmmon64.pdb) undefined",
        "(kernelbase.pdb) FreeLibrary",
        "(xul.pdb) mozilla::dom::WinWebAuthnManager::~WinWebAuthnManager()",
        "(xul.pdb) mozilla::ClearOnShutdown_Internal::PointerClearer<mozilla::StaticAutoPtr<mozilla::dom::WinWebAuthnManager> >::Shutdown()",
        "(xul.pdb) mozilla::KillClearOnShutdown(mozilla::ShutdownPhase)",
        ...

Same justification as above - I doubt that we care about the writes from inside LdrUnloadDll.

        "(xul.pdb) mozilla::IOInterposeObserver::Observation::Report()",
        "(xul.pdb) `anonymous namespace'::InterposedNtWriteFile(void*, void*, void (*)(void*, _IO_STATUS_BLOCK*, unsigned long), void*, _IO_STATUS_BLOCK*, void*, unsigned long, _LARGE_INTEGER*, unsigned long*)",
        "(kernelbase.pdb) WriteFile",
        "(nss3.pdb) PR_MD_WRITE(PRFileDesc*, void const*, int)",
        "(nss3.pdb) FileWrite(PRFileDesc*, void const*, int)",
        "(xul.pdb) mozilla::scache::StartupCache::WriteToDisk()",
        "(xul.pdb) mozilla::detail::RunnableFunction<`lambda at /builds/worker/checkouts/gecko/startupcache/StartupCache.cpp:749:53'>::Run()",
        "(xul.pdb) nsThreadPool::Run()",
        "(xul.pdb) nsThread::ProcessNextEvent(bool, bool*)",
        "(xul.pdb) NS_ProcessNextEvent(nsIThread*, bool)",
        "(xul.pdb) mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*)",
        "(xul.pdb) MessageLoop::RunHandler()",
        "(xul.pdb) MessageLoop::Run()",
        "(xul.pdb) static nsThread::ThreadFunc(void*)",
        "(nss3.pdb) _PR_NativeRunThread(void*)",
        "(nss3.pdb) pr_root(void*)",
        "(ucrtbase.pdb) thread_start<unsigned int (__cdecl*)(void *),1>",
        "(kernel32.pdb) BaseThreadInitThunk",
        "(ntdll.pdb) RtlUserThreadStart"

We don't want to interrupt a write to the StartupCache, but that's not what would be happening here: this is a side-effect of the fact that we flush writes to the startup cache when we enable late write checking as well as when we actually do the fast shutdown. So when we advance fast shutdown, we will still be flushing immediately before exit, and thus won't interrupt anything. Skipping a StartupCache write is just fine (especially because of the low frequency.)

Flags: needinfo?(nfroyd)

See https://bugzilla.mozilla.org/show_bug.cgi?id=1623943#c10 for more info,
but in short, this should be ready - there are a very small number of late
writes coming in via telemetry. I have gone through them, and the vast
majority of them are clearly nonissues. Of those remaining, which is on the
order of one hundredth of one percent of shutdowns, they are all going
through atomic file streams.

(In reply to Doug Thayer [:dthayer] from comment #10)

Nathan, at this point, as far as unaccounted-for[1] writes go, we are seeing 170 late writes per million shutdowns. Of those, all of them are from JS using nsAtomicFileOutputStream or nsSafeFileOutputStream. Given the extremely low frequency, and the fact that these streams are intended to be safe for these purposes anyway, I feel comfortable advancing fast shutdown to xpcom-shutdown after the upcoming merge. What are your opinions?

I'm not clear on what you mean by "these streams are intended to be safe for these purposes anyway". Do you mean that such streams are used with the intent to avoid data corruption (via atomic rename or similar)? I agree with that statement, but I'm not sure that it means we can throw away those writes. Presumably they are saving some kind of user data, so my inclination would be to try and save them. Or do you mean something different?

Flags: needinfo?(nfroyd)

(In reply to Nathan Froyd [:froydnj] from comment #14)

(In reply to Doug Thayer [:dthayer] from comment #10)

Nathan, at this point, as far as unaccounted-for[1] writes go, we are seeing 170 late writes per million shutdowns. Of those, all of them are from JS using nsAtomicFileOutputStream or nsSafeFileOutputStream. Given the extremely low frequency, and the fact that these streams are intended to be safe for these purposes anyway, I feel comfortable advancing fast shutdown to xpcom-shutdown after the upcoming merge. What are your opinions?

I'm not clear on what you mean by "these streams are intended to be safe for these purposes anyway". Do you mean that such streams are used with the intent to avoid data corruption (via atomic rename or similar)? I agree with that statement, but I'm not sure that it means we can throw away those writes. Presumably they are saving some kind of user data, so my inclination would be to try and save them. Or do you mean something different?

I mean what you say, but my stance would be that the volume is so low, that this is marginal next to the existing risks associated with power failure, crashes, or explicitly exiting the process via task manager when hanging, especially considering that what we would be talking about is data loss, and not data corruption.

Now, to dive a little further, all of these stacks are annoying because they go through JS. I think it would be prudent as a next step to get JS stack frames as part of late write reporting. However, I don't think that should block this, because we can enumerate all locations which should produce the nsSafeFileOutputStream and nsAtomicFileOutputStream writes from JS: https://searchfox.org/mozilla-central/search?q=openAtomicFileOutputStream%7CopenSafeFileOutputStream&path=&case=false&regexp=true

In order, with what I'm seeing right now:

  • Profile migrator - migrating profiles from other browsers can take annoyingly large amounts of time. It's plausible to me that these late writes are due to users exiting because that migration is taking excessively long, and they want to interrupt it.
  • Devtools memory.js - writing a heap snapshot to disk. I hardly think this is critical if a user is trying to exit while it's going on.
  • Devtools performace io.js - writing a profile to disk. Same as above
  • Devtools adb-binary.js unpacking adb files to somewhere on disk. Not entirely sure what this is going, but it's still devtools and I doubt it's critical
  • Devtools stylesheeteditor.jsm - writing style editor contents to file. Again if the user is trying to close during this I'm not worried about losing it.
  • A bunch of tools to generate stuff outside of normal use of the browser
  • Bookmark exporter - same logic as above
  • About webRTC savepage handler - similar to above
  • Update service - covered by Molly's comment in this bug above.
Flags: needinfo?(nfroyd)

(In reply to Doug Thayer [:dthayer] from comment #15)

(In reply to Nathan Froyd [:froydnj] from comment #14)

(In reply to Doug Thayer [:dthayer] from comment #10)

Nathan, at this point, as far as unaccounted-for[1] writes go, we are seeing 170 late writes per million shutdowns. Of those, all of them are from JS using nsAtomicFileOutputStream or nsSafeFileOutputStream. Given the extremely low frequency, and the fact that these streams are intended to be safe for these purposes anyway, I feel comfortable advancing fast shutdown to xpcom-shutdown after the upcoming merge. What are your opinions?

I'm not clear on what you mean by "these streams are intended to be safe for these purposes anyway". Do you mean that such streams are used with the intent to avoid data corruption (via atomic rename or similar)? I agree with that statement, but I'm not sure that it means we can throw away those writes. Presumably they are saving some kind of user data, so my inclination would be to try and save them. Or do you mean something different?

I mean what you say, but my stance would be that the volume is so low, that this is marginal next to the existing risks associated with power failure, crashes, or explicitly exiting the process via task manager when hanging, especially considering that what we would be talking about is data loss, and not data corruption.

The specific thing I was thinking of was session restore, but I see that uses AsyncShutdown blockers, so maybe that's not relevant here.

Now, to dive a little further, all of these stacks are annoying because they go through JS. I think it would be prudent as a next step to get JS stack frames as part of late write reporting. However, I don't think that should block this, because we can enumerate all locations which should produce the nsSafeFileOutputStream and nsAtomicFileOutputStream writes from JS: https://searchfox.org/mozilla-central/search?q=openAtomicFileOutputStream%7CopenSafeFileOutputStream&path=&case=false&regexp=true

I am mildly impressed that all accesses from JS do seem to go through the FileUtils.jsm functions.

In order, with what I'm seeing right now:

  • Profile migrator - migrating profiles from other browsers can take annoyingly large amounts of time. It's plausible to me that these late writes are due to users exiting because that migration is taking excessively long, and they want to interrupt it.

Is that "annoyingly large amounts of time" because the profiles are rather large, or because the conversion process is inefficient?

Just to be clear here, we're going to interrupt a profile migrating because it's taking too long so we can shutdown. Thus when we start up again, the profile migration is not going to be done, and the user's going to be disgruntled?

  • Devtools memory.js - writing a heap snapshot to disk. I hardly think this is critical if a user is trying to exit while it's going on.
  • Devtools performace io.js - writing a profile to disk. Same as above
  • Devtools adb-binary.js unpacking adb files to somewhere on disk. Not entirely sure what this is going, but it's still devtools and I doubt it's critical
  • Devtools stylesheeteditor.jsm - writing style editor contents to file. Again if the user is trying to close during this I'm not worried about losing it.
  • A bunch of tools to generate stuff outside of normal use of the browser
  • About webRTC savepage handler - similar to above
  • Update service - covered by Molly's comment in this bug above.

These do all seem relatively safe to drop on the floor.

  • Bookmark exporter - same logic as above

This seems to fall under the same concern as the profile migration interruption.

I guess 0.017% probably falls beneath almost any reasonable threshold for moving forward with these changes, but I'm curious if we can't somehow mitigate the profile migration and bookmarks export.

Flags: needinfo?(nfroyd)

(In reply to Nathan Froyd [:froydnj] from comment #16)

Is that "annoyingly large amounts of time" because the profiles are rather large, or because the conversion process is inefficient?

Because the conversion process is inefficent. I actually worked on that a bit to get it better, and it is certainly better than it was, but then it was dropped as it was deemed unimportant. I'm not sure about that decision but that's where we're at.

EDIT: to be clear, it's really a mix of large profiles and being less efficient than we could be. "Regular people profiles" seem to not be terrible, but they are not fast enough to do quietly in the background, as we would be janking the browser without it being clear to the user that we're doing something big. The project to migrate profiles quietly in the background was dropped, so the effort to optimize it such that we could do it in the background was thus dropped.

I guess 0.017% probably falls beneath almost any reasonable threshold for moving forward with these changes, but I'm curious if we can't somehow mitigate the profile migration and bookmarks export.

On taking a closer look, first, we can exclude the profile migrator case. This particular profile migrator is not doing this async, so it will simply block the main thread. That's obviously not great, but it has nothing to do with this bug.

Regarding the bookmarks, I'm honestly just worried about making things worse by blocking on it, because we currently don't wait for it anyway. The write can occur after xpcom-shutdown which will trip our sensors, but it could easily be skipped. Fast shutdown will slightly increase the odds that we will skip it, because we will just be giving it less time, but it's not a qualitative change of behavior. If we instead forcibly block until the task is done, we may frustrate users if the writes are too large. Also, the writes all appear to be on the main thread anyway - we just interrupt every now and then to asynchronously request favicon info - this would mean that really slow writes to this file likely jank the whole browser, which may be what the user trying to get away from anyway.

Flags: needinfo?(nfroyd)

(In reply to Doug Thayer [:dthayer] from comment #17)

Regarding the bookmarks, I'm honestly just worried about making things worse by blocking on it, because we currently don't wait for it anyway. The write can occur after xpcom-shutdown which will trip our sensors, but it could easily be skipped. Fast shutdown will slightly increase the odds that we will skip it, because we will just be giving it less time, but it's not a qualitative change of behavior. If we instead forcibly block until the task is done, we may frustrate users if the writes are too large. Also, the writes all appear to be on the main thread anyway - we just interrupt every now and then to asynchronously request favicon info - this would mean that really slow writes to this file likely jank the whole browser, which may be what the user trying to get away from anyway.

I guess the real right thing to do would be to rewrite the bookmarks export to use AsyncShutdown blockers? I kind of shudder at what file is so large that it's going to still be doing writes into xpcom-shutdown, but....

Thanks for the thorough responses to these questions. I think we can proceed.

Flags: needinfo?(nfroyd)
Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e5b7a0f84f39
Advance fastShutdownStage to 3 on Nightly r=froydnj
https://hg.mozilla.org/integration/autoland/rev/dfbfe6b93b45
Early exit from healer task if app is shutting down r=keeler
https://hg.mozilla.org/integration/autoland/rev/a75ab65ddf86
Exit from IdleSaveIntermediateCerts if shutting down r=keeler
Status: ASSIGNED → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla80
You need to log in before you can comment on or make changes to this bug.