Open Bug 810406 Opened 12 years ago Updated 2 years ago

Browser hang while loading Telemetry evolution

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect

Tracking

()

People

(Reporter: lmandel, Unassigned)

Details

(Whiteboard: [Snappy])

I see 85% of the time in mmap under MemoryReporter_Resident::GetAmount, so it doesn't seem like JS engine.  Perhaps njn or jlebar know more about this code?
Whoa, in TelemetryPing.js:

    case "cycle-collector-begin":
      let now = new Date();
      if (!gLastMemoryPoll
          || (TELEMETRY_INTERVAL <= now - gLastMemoryPoll)) {
        gLastMemoryPoll = now;
        this.gatherMemory();
      }
      break;

We get memory info on every CC, and apparently getting RSS in particular is very slow on Mac.  That looks quite bad, in general.  A CC is already very slow, and we're just making it slower here.

Note also that bug 678085 completely invalidates all comparisons of memory telemetry before/after that change.

See also bug 792888, which I believe I was seeing on my Linux box, but I'm not sure.  Probably not the same thing.

But the real question is: Are we actually calling GetResident() no more than once every 60s?  If so, this is /just one/ call to GetResident which is hanging for many seconds.  Which is seriously bad!
GetResident() for Mac is here:

http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsMemoryReporterManager.cpp#271

The profile says that all the samples are in the __mmap call(s) within GetResident().  Get TaskBasicInfo() calls task_info() but doesn't obviously call __mmap.  

jemalloc_purge_freed_pages is only blamed for 4 samples but I suspect the stacks are slightly off and it's actually the problem.  Note especially this comment:

// Purging these pages shouldn't take more than 10ms or so, but we want to
// keep an eye on it since GetResident() is called on each Telemetry ping.

Prescience!

jlebar, it seems the gods really don't want us to get accurate RSS data on Mac :(
The real question is why we are even bothering to listen for cycle collection at all; we call gatherMemory() every time we send a ping:

http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryPing.js#556

so the gatherMemory() at (some) cycle collections doesn't look that useful.
(In reply to Nathan Froyd (:froydnj) from comment #4)
> The real question is why we are even bothering to listen for cycle
> collection at all; we call gatherMemory() every time we send a ping:
> 
> http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/
> TelemetryPing.js#556
> 
> so the gatherMemory() at (some) cycle collections doesn't look that useful.

We only send a ping a few times a day; we want to collect memory stats more often.

> jemalloc_purge_freed_pages is only blamed for 4 samples but I suspect the stacks are 
> slightly off and it's actually the problem.

That's an easy hypothesis to test.  Lawrence, is this reproducible for you if you restart your browser?  Can we send you a tryserver build?

Otherwise, let's add telemetry for how long the purge_freed_pages call takes?
To be clear, as it seems that the conversation here is about the Telemetry client (in product), the slowness that I experienced was when accessing the Telemetry evolution dashboard. If this slowness is coincidentally due to the Telemetry client that's fine. I just wanted to be clear.

I just tried to reproduce the issue (having restarted my browser) and am not able to do so. I'll try again later after my browser has been running for a little while.
> If this slowness is coincidentally due to the Telemetry client that's fine.

That's what it looks like from the profile; it looks like the slowness is due to GC, which is apparently slow because it collects telemetry info, which does something slow -- we're not quite sure what yet.
(In reply to Justin Lebar [:jlebar] from comment #2)
> Whoa, in TelemetryPing.js:
> 
>     case "cycle-collector-begin":
>       let now = new Date();
>       if (!gLastMemoryPoll
>           || (TELEMETRY_INTERVAL <= now - gLastMemoryPoll)) {
>         gLastMemoryPoll = now;
>         this.gatherMemory();
>       }
>       break;
> 
> We get memory info on every CC, and apparently getting RSS in particular is
> very slow on Mac.  That looks quite bad, in general.  A CC is already very
> slow, and we're just making it slower here.

To be clear, we get memory stuff every cc that's TELEMETRY_INTERVAL apart.

> 
> Note also that bug 678085 completely invalidates all comparisons of memory
> telemetry before/after that change.

Yes, but change was done early on in telemetry history.

> But the real question is: Are we actually calling GetResident() no more than
> once every 60s?  If so, this is /just one/ call to GetResident which is
> hanging for many seconds.  Which is seriously bad!

That does sound bad. Someone should fix that. I decided to do memory polling based on cc notifications to avoid setting up a new timer. I checked with njn when I did that, he said it seemed fine.
If this really is a Telemetry problem, should this bug be moved to Toolkit::Telemetry? If the probe should be changed/fixed by someone on the JS team, to whom can this bug be assigned?
Assignee: general → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.