Open Bug 1363586 Opened 8 years ago Updated 2 years ago

[meta] Creating the XPCOM system info service is super expensive

Categories

(Core :: XPCOM, enhancement)

enhancement

Tracking

()

Performance Impact medium

People

(Reporter: ehsan.akhgari, Unassigned)

References

(Depends on 3 open bugs, Blocks 1 open bug)

Details

(4 keywords, Whiteboard: [fxperfsize:S])

Attachments

(1 obsolete file)

See bug 1363421 comment 0 https://perfht.ml/2qNRVEc where this is taking 21ms. Nathan, can we please make this faster?
Flags: needinfo?(nfroyd)
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #0) > See bug 1363421 comment 0 https://perfht.ml/2qNRVEc where this is taking > 21ms. > > Nathan, can we please make this faster? I guess we could add a function (GetOSVersionInfo?) to just retrieve the information nsHttpHandler needs. That would avoid spinning up the entire service and querying things that we don't need immediately (geolocation?!). Or we could make nsSystemInfo smarter and lazily init all the (expensive) fields. That would also help, since it looks like GetCountryCode/GetGeoInfo are really the expensive things; initializing the system info service is pretty fast, all things considered.
Flags: needinfo?(nfroyd)
maybe dragana would do the getosversioninfo-only method...
Flags: needinfo?(dd.mozilla)
I prefer GetOSVersionInfo only method. If that is not fast enough, I could try to do this check on socketThread, it is possible.
Flags: needinfo?(dd.mozilla)
In nsHttpHandler::InitUserAgentComponents we already have some code querying OS version directly from platform dependent API without going through system-info. https://searchfox.org/mozilla-central/rev/d66b9f27d5630a90b2fce4d70d4e9050f43df9b4/netwerk/protocol/http/nsHttpHandler.cpp#964 By tracing the platform dependent API and nsSystemInfo implementation, we should be able to use PR_GetSystemInfo for Firefox desktop. https://searchfox.org/mozilla-central/rev/d66b9f27d5630a90b2fce4d70d4e9050f43df9b4/nsprpub/pr/src/misc/prsystem.c#85
Attached patch bug1363586.patch (obsolete) — Splinter Review
People may try this patch to see if avoid system-info really helps the start-up performance. Try result looks good in general: https://treeherder.mozilla.org/#/jobs?repo=try&revision=89fd9f0eebd4bee14ca4bde7f89fa20e085c2d07
Comment on attachment 8867119 [details] [diff] [review] bug1363586.patch Filed bug 1365566 for this possible Necko-specific workaround.
Attachment #8867119 - Attachment is obsolete: true
Priority: -- → P3
Here it takes more than 2s in a startup profile: https://perfht.ml/2ueHnRp on main thread IO for something we don't need. The caller is AppConstants.isPlatformAndVersionAtLeast and we spend our time in nsSystemInfo::GetProfileHDDInfo which only https://searchfox.org/mozilla-central/rev/de5c4376b89c3603341d226a65acea12f8851ec5/toolkit/components/telemetry/TelemetryEnvironment.jsm#1641 uses. We should lazily initialize most of Services.sysinfo, or completely avoid using it before painting the browser UI.
Keywords: perf
Whiteboard: [qf][fxperf]
Whiteboard: [qf][fxperf] → [qf:p1][qf:f64][fxperf]
Whiteboard: [qf:p1][qf:f64][fxperf] → [qf:p1][qf:f64][fxperf:p2]
Whiteboard: [qf:p1][qf:f64][fxperf:p2] → [qf:p1:f64][fxperf:p2]
Otherwise I'd lower the priority of this one, but looks like the fix for bug 1365566 was non-Android only.
Assignee: nobody → dd.mozilla
Status: NEW → ASSIGNED
Does anyone know if AndroidBridge is better? Or any suggestion what should I use?
Whiteboard: [qf:p1:f64][fxperf:p2] → [qf:p2:responsiveness][fxperf:p2]

I guess this should be reprofiled on Android.

Blocks: 1541191

This has gotten worse on Windows, even on my fast SSD machine we now spend several ms inside GetWindowsSecurityCenterInfo which does a bunch of disk IO, presumably trying to work out what AV solution the machine uses. Punting this back into fxperf triage.

Whiteboard: [qf:p2:responsiveness][fxperf:p2] → [qf:p2:responsiveness][fxperf]

(In reply to :Gijs (he/him) from comment #12)

This has gotten worse on Windows, even on my fast SSD machine we now spend several ms inside GetWindowsSecurityCenterInfo which does a bunch of disk IO, presumably trying to work out what AV solution the machine uses. Punting this back into fxperf triage.

GetWindowsSecurityCenterInfo usage was added in bug 1418131.

Dragana, are you still working on this, or would it be OK if I took it?

Flags: needinfo?(dd.mozilla)
Whiteboard: [qf:p2:responsiveness][fxperf] → [qf:p2:responsiveness][fxperf:p2][fxperfsize:S]

(In reply to :Gijs (he/him) from comment #14)

Dragana, are you still working on this, or would it be OK if I took it?

no. you can take it.

Flags: needinfo?(dd.mozilla)
Assignee: dd.mozilla → gijskruitbosch+bugs

Some initial considerations just to have them written down:

  • some things are only needed by telemetry, and might be better off living in telemetry code instead of here.
  • on the other hand, we've had cases in the past where info is only available off telemetry that other code then ends up wanting to use, and that then gets messy (cf. bug 1498213)
  • some things are completely unused and can just be removed
  • some things are only used from other C++ code which could just call the relevant NSPR code instead
  • while initializing-on-first-fetch is relatively easy to do (by hooking some 'lazy' init code into all the get* calls), it would keep some of the related IO on the main thread which seems undesirable
  • we fetch environment bits (like firewall/antivirus/antispyware etc.) for telemetry on every startup, but this is comparatively expensive, and rarely changes so could probably be cached in some way and updated lazily, OMT.

Raw list of everything I could find inside system info service itself (note that it's a writable property bag so other things can put items into the system info service without the implementation living there, but that seems less interesting for what we're doing here), and where it's used, grouped by usage:

"hasWindowsTouchInterface" #unused
"pagesize" #unused (except tests)
"pageshift" #unused (except tests)
"memmapalign" #unused (except tests)

"hasSeccompBPF" about:support only
"hasSeccompTSync" about:support only
"hasUserNamespaces" about:support only
"hasPrivilegedUserNamespaces" about:support only
"canSandboxContent" about:support only
"canSandboxMedia" about:support only

"registeredAntiSpyware" Telemetry and about:support
"registeredAntiVirus" Telemetry and about:support
"registeredFirewall" Telemetry and about:support

"isWowARM64" Telemetry only
"isWow64" Telemetry only

"installYear" Telemetry only
"binHDDModel" Telemetry only
"binHDDRevision" Telemetry only
"binHDDType" Telemetry only
"winHDDModel" Telemetry only
"winHDDRevision" Telemetry only
"winHDDType" Telemetry only

"appleModelId" Telemetry only

"hasAVX" Telemetry only
"hasAVX2" Telemetry only
"hasAES" Telemetry only
"hasEDSP" Telemetry only
"virtualmemsize" Telemetry only
"cpuvendor" Telemetry only
"cpufamily" Telemetry only
"cpustepping" Telemetry only
"cpucachel2" Telemetry only
"cpucachel3" Telemetry only

"hasMMX" Telemetry and update utils
"hasSSE" Telemetry and update utils
"hasSSE2" Telemetry and update utils
"hasSSE3" Telemetry and update utils
"hasSSSE3" Telemetry and update utils
"hasSSE4A" Telemetry and update utils
"hasSSE4_1" Telemetry and update utils
"hasSSE4_2" Telemetry and update utils
"hasARMv6" Telemetry and update utils
"hasARMv7" Telemetry and update utils
"hasNEON" Telemetry and update utils

"cpucount" Telemetry and profiler metadata, IPC prioritization on Linux + Android (where it's just PR_GetNumberOfProcessors() anyway...)
"cpucores" Telemetry and profiler metadata

"countryCode" search service which uses it for telemetry (but, afaict, nothing else). Note ICU also has code for this...

"memsize" Used from Places (bookmarks/history/etc.) code
"name" Used in lots of places, also available as Services.appinfo.OS
"arch" Used in some places, similar info available on Services.appinfo.XPCOMABI
"version" Used lots

"umask" Used by OS.File/OS.Constants and maybe other stuff

"secondaryLibrary" Blocklist/AMO/Updates

Android only (not looked into these, by and large):

"device"
"manufacturer"
"release_version"
"tablet"
"kernel_version"
"hardware"

The main performance problem at the moment is that the service causes mainthread IO and is initialized on the profile-do-change notification early in startup.

The bits that cause main thread IO relate to the antivirus/firewall/antispyware data gathering and the data gathering about the disk type for the user profile. Both of those are used in the telemetry environment object which is used to annotate crashreports.

In order to fix the performance issue, this data gathering needs to happen async / off-main-thread, and later in startup. We could either leave the data blank before then or use a cache stored e.g. in prefs or XULStore or similar.

The SSD/HDD data seems like it got added in bug 1533861 which doesn't look like crash data would care about it. Ditto for the antivirus/antispyware stuff from bug 1418131.

Some of the other data that causes IO in some circumstances (e.g. CPU identification that touches the registry) seems like it'd be fine to cache.

Either way, however, that would mean that early startup crashes would not have this data. I don't believe that would be critical, but it's not my call - I filed bug 1552858 to get some feedback from data science about this.

Blocks: 1533861, 1418131
Depends on: 1552858

(In reply to :Gijs (he/him) from comment #17)

The SSD/HDD data seems like it got added in bug 1533861 which doesn't look like crash data would care about it. Ditto for the antivirus/antispyware stuff from bug 1418131.

I object to the notion that AV information is irrelevant to crash report data.

EDIT: As a compromise, I propose that we file a bug to move collection of AV information for the purposes of crash reports into minidump-analyzer. That way we can move the remaining use case of collecting AV info for telemetry off the main thread.

(In reply to Aaron Klotz [:aklotz] from comment #18)

(In reply to :Gijs (he/him) from comment #17)

The SSD/HDD data seems like it got added in bug 1533861 which doesn't look like crash data would care about it. Ditto for the antivirus/antispyware stuff from bug 1418131.

I object to the notion that AV information is irrelevant to crash report data.

Well, my understanding was that, where crashes are reported, we'd get this information through the 'loaded DLLs' info anyway, rather than through the telemetry environment json dump that's part of the crash report. Is that not the case?

The other thing I noticed was that the information has no versioning, and so presumably we could cache the data across restarts and update it (lazily, OMT) once the UI is up and running, potentially with a fallback that'd cause us to force getting the data earlier if the previous startup crashed. Would that not work?

EDIT: As a compromise, I propose that we file a bug to move collection of AV information for the purposes of crash reports into minidump-analyzer. That way we can move the remaining use case of collecting AV info for telemetry off the main thread.

That's a good idea - though I can't see off-hand how this data fits into what the minidump-analyzer is doing, probably because I know very little about our crashreporting infrastructure. Is it "just" a question of adding extra fields to the JSON data as written in UpdateExtraDataFile ? And, would this be preferable over what I just suggested?

Flags: needinfo?(aklotz)

(In reply to :Gijs (he/him) from comment #19)

Well, my understanding was that, where crashes are reported, we'd get this information through the 'loaded DLLs' info anyway, rather than through the telemetry environment json dump that's part of the crash report. Is that not the case?

Injected DLLs are not the only vector from which AVs can cause problems. They may be causing issues even though they don't appear in "loaded DLLs." The only way to correlate that would be via the AV info we're collecting.

The other thing I noticed was that the information has no versioning, and so presumably we could cache the data across restarts and update it (lazily, OMT) once the UI is up and running, potentially with a fallback that'd cause us to force getting the data earlier if the previous startup crashed. Would that not work?

My concern would be if the AV prevents the browser from ever starting correctly in the first place, thus preventing construction of the cache and/or initial info. It happens, and I have no intention of opening up new blind spots in our AV data collection.

EDIT: As a compromise, I propose that we file a bug to move collection of AV information for the purposes of crash reports into minidump-analyzer. That way we can move the remaining use case of collecting AV info for telemetry off the main thread.

That's a good idea - though I can't see off-hand how this data fits into what the minidump-analyzer is doing, probably because I know very little about our crashreporting infrastructure.

I'll give you an example of where we're already doing this: It is really expensive for us to pull cert information out of every DLL loaded into the process when generating the crash dump. I moved that code to minidump-analyzer so that Firefox can get itself back up and running as quickly as possible, leaving the analyzer to perform those annotations in the background.

Is it "just" a question of adding extra fields to the JSON data as written in UpdateExtraDataFile?

It should be, though to prevent needing to change anything on the Soccorro side, we should ensure that we're using the same format and location in the JSON file as the existing AV annotations that we are already using on release.

And, would this be preferable over what I just suggested?

I think that, based on my previous answers, the answer is yes, minidump-analyzer is the preferred route to go here.

Flags: needinfo?(aklotz)

(In reply to Aaron Klotz [:aklotz] from comment #20)

It should be, though to prevent needing to change anything on the Soccorro side, we should ensure that we're using the same format and location in the JSON file as the existing AV annotations that we are already using on release.

Which are what? You parse it out of the JSON blob that gets sent for TelemetryEnvironment? Because sticking with that would be tricky - we'd need to (from the minidump analyzer) read the current telemetry environment out of the annotations, then append the antivirus/spyware/firewall info, then write it out again... Is that really the only/best option? Wouldn't it be nicer to have it in separate fields anyway?

Flags: needinfo?(aklotz)

(To be clear, from a casual look at the current minidump analyzer code, it looked like all the things it adds get appended to the extant file without reading its contents, but perhaps I'm misreading the code, as I'm not familiar with it.)

Just a note, I believe almost nobody has ever used this information yet for analyzing crashes. Indeed, I think the Telemetry JSON information is almost never used as it's not easily searchable on Socorro.
I see the AV info being useful though, and moving it to be a crash annotation is going to make it actually useful for people analyzing crash data.

(In reply to :Gijs (he/him) from comment #22)

(To be clear, from a casual look at the current minidump analyzer code, it looked like all the things it adds get appended to the extant file without reading its contents, but perhaps I'm misreading the code, as I'm not familiar with it.)

Yes, that's what's happening right now. However I'm in the process of switching the .extra file format to JSON instead of the current INI-like format (bug 1420363). This will happen in the coming weeks and at that point the minidump-analyzer will not blindly append its annotation any more.

(In reply to :Gijs (he/him) from comment #21)

Which are what? You parse it out of the JSON blob that gets sent for TelemetryEnvironment? Because sticking with that would be tricky - we'd need to (from the minidump analyzer) read the current telemetry environment out of the annotations, then append the antivirus/spyware/firewall info, then write it out again... Is that really the only/best option? Wouldn't it be nicer to have it in separate fields anyway?

(In reply to Marco Castelluccio [:marco] from comment #23)

I see the AV info being useful though, and moving it to be a crash annotation is going to make it actually useful for people analyzing crash data.

Yes, the more that I page back in about the current state of things, the more I agree that it makes sense to move the AV stuff over to regular crash annotations. We don't need any changes on the backend to do that.

(In reply to Gabriele Svelto [:gsvelto] from comment #24)

(In reply to :Gijs (he/him) from comment #22)

(To be clear, from a casual look at the current minidump analyzer code, it looked like all the things it adds get appended to the extant file without reading its contents, but perhaps I'm misreading the code, as I'm not familiar with it.)

Yes, that's what's happening right now. However I'm in the process of switching the .extra file format to JSON instead of the current INI-like format (bug 1420363). This will happen in the coming weeks and at that point the minidump-analyzer will not blindly append its annotation any more.

This is also good news, since the AV API actually returns arrays of registered products, so if we were to wait for bug 1420363, recording the AV info as JSON-style annotations would be the best way forward.

Flags: needinfo?(aklotz)

OK, I'm going to split things out.

Assignee: gijskruitbosch+bugs → nobody
Status: ASSIGNED → NEW
Priority: P3 → --
Summary: Creating the XPCOM system info service is super expensive → [meta] Creating the XPCOM system info service is super expensive
Keywords: meta
Depends on: 1553536
Depends on: 1553538
Depends on: 1553540
Depends on: 1553542
Depends on: 1553546
Depends on: 1553558
Performance Impact: --- → P2
Whiteboard: [qf:p2:responsiveness][fxperf:p2][fxperfsize:S] → [fxperf:p2][fxperfsize:S]
Keywords: perf:startup
Whiteboard: [fxperf:p2][fxperfsize:S] → [fxperfsize:S]
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: