Closed Bug 1030487 Opened 10 years ago Closed 10 years ago

Don't report main-thread I/O to Telemetry unless it lasts more than 50ms

Categories

(Core :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla34

People

(Reporter: vladan, Assigned: rafael.kourdis, Mentored)

References

Details

(Whiteboard: [good first bug][lang=c++])

Attachments

(1 file, 5 obsolete files)

BACKGROUND:

Firefox currently has a single-process architecture where most webpage, UI, and extension JavaScript is execute on a single thread (the "main thread"). Therefore, any long-running operation on the main thread makes Firefox unresponsive. I/O operations in particular have unbounded execution times, so it is very undesirable to do I/O on the main thread.

We want to collect information about all the main thread I/O currently happening in Firefox, so we intercept any I/O operations done on the main thread and report it to the "Telemetry" performance reporting system.

You can see the current results here: http://telemetry.mozilla.org/mainthreadio/

THE PROBLEM:

We currently report *ALL* main-thread I/O to Telemetry, but we really only want to report main-thread I/O that actually causes performance problems. For example, we don't want to inform Telemetry that a file was read on the main thread, if that file was first read on a background thread -- the file will have already been cached in the disk cache so the main thread read is not a performance problem.

Therefore we should only record occurences of main-thread I/O if they took more than 50ms to execute.

CODE:

You can add the check for duration here: http://hg.mozilla.org/mozilla-central/file/3961ab6093b4/xpcom/build/MainThreadIOLogger.cpp#l201
Does 50ms come from anywhere in particular?  Would mapping it to 60fps response time of 16ms make sense?
It's arbitrary. It's based on an estimate of how long a browser response to a user click could be delayed without the user noticing. We use the same threshold for slowSQL (http://telemetry.mozilla.org/slowsql)
And yes 16ms would make sense too
Hey Guys!

I find this a nice first bug for a beginner like me :) 
Can I work on it? I already have something ready, I can upload it tomorrow.

Thanks
Attached patch Patch Version 1 (obsolete) — Splinter Review
Hello,

Since there haven't been any replies, I've decided to upload the patch anyway.

I do believe the patch is correct, however, as this is the first time I've worked with
Mozilla code, I'm not entirely sure.

I really look forward to your feedback!

Thanks.
Attachment #8466741 - Flags: review?(vdjeric)
Mentor: vdjeric
Whiteboard: [good first bug][lang=c++][mentor=vladan] → [good first bug][lang=c++]
Assignee: nobody → rafael.kourdis
Status: NEW → ASSIGNED
Hi Rafel - just saw your comments in #developers - if it helps the "PTO" in the display name of Vladan Djeric (the person from whom you've requested review) means "personal time off" and is a Mozilla-ism (or USA-ism, not sure which) for "on holiday" - and so means Vladan is on holiday 4-5th August. This use of "PTO" catches people out all the time, so it's not just you :-)
Thank you for the patch!
Oh, wouldn't have guessed that, for sure :)
Thanks for your reply, looking forward for the review!
Comment on attachment 8466741 [details] [diff] [review]
Patch Version 1

Review of attachment 8466741 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, just a few small comments. Fix it up and I'll do a final review :)
See here for our full coding style guide: https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style#General_C.2FC.2B.2B_Practices

::: xpcom/build/MainThreadIOLogger.cpp
@@ +199,5 @@
>      return;
>    }
> +
> +  if(TimeStamp::Now() - aObservation.Start() < TimeDuration::FromMilliseconds(50))
> +  {

- Is aObservation.mEnd set at this point? If it is, "aObservation.End() - aObservation.Start()" might be faster and more accurate than "TimeStamp::Now() - aObservation.Start()"
- Define the 50 millisecond threshold as a static constant. For example: http://hg.mozilla.org/mozilla-central/file/de4289ba35d6/toolkit/components/telemetry/Telemetry.cpp#l2489
- Add a space between the "if" and the "("
- the curly brace "{" should be on the same line as the "if"

@@ +201,5 @@
> +
> +  if(TimeStamp::Now() - aObservation.Start() < TimeDuration::FromMilliseconds(50))
> +  {
> +    //Don't report I/O operation to Telemetry if there haven't
> +    //passed 50 ms. 

// Only report slow main-thread I/O to Telemetry (at least 50 ms)
Attachment #8466741 - Flags: review?(vdjeric)
Thanks a lot for the review :) I made the changes, and will upload the patch in a while.

However, I am not 100% sure if mEnd is set when MainThreadIOLoggerImpl::Observe() is called.

Here is what I've found so far :

MainThreadIOLoggerImpl::Observe() is called by CallObservers() in IOInterposer.cpp.
It (as I understand it) created a vector of pointers to observer objects and at line 157 ((*i)->Observe(aObservation);) observes them.

Now, PerThreadData::CallObservers() is only called by IOInterposer::Report() at IOInterposer.cpp. (line 534), and Report() is called by 3 functions.

First of all, the destructor of IOThreadAutoTimer (~IOThreadAutoTimer(), line 102, TelemetryVFS.cpp).
The destructor creates a new Observation, and feeds an End value (Timestamp::Now()), so, no problem here.

Secondly, Report() is called at IOInterposer.cpp, line 437, from inside the Observation class, and IS assigned an end time before Report().

Report() is also called at IOInterposer.cpp, line 590, where a NextStageObservation object is constructed and then gets reported. 

I can see from the constructor of NextStageObservation, that the aShouldReport flag of Observation is set to false, and Report() is not going to be called at IOInterposeObserver::Observation::Report() if the flag isn't set.

Please correct me if I'm wrong.

All these areas suggest that mEnd is set, when Observe() is called, however, if it's possible, I'd like a confirmation from you guys (I'm not experienced at all :) ).

Thanks for your time!
Attached patch Patch Version 2 (obsolete) — Splinter Review
Attachment #8469191 - Flags: review?(vdjeric)
Attachment #8466741 - Attachment is obsolete: true
Comment on attachment 8469191 [details] [diff] [review]
Patch Version 2

Review of attachment 8469191 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, just a few tiny changes. If you're ok with my suggestions, I can make the changes myself and land your patch in the repository

Did you build and test this patch by the way?

::: xpcom/build/MainThreadIOLogger.cpp
@@ +187,5 @@
>    PR_Close(fd);
>  }
>  
> +const double TelemetryReportThreshold = 50;
> +// Milliseconds

Move the comment above the declaration, and use TimeDuration units here to avoid repeating the conversion in MainThreadIOLoggerImpl::Observe

// Threshold for reporting slow main-thread I/O (50 milliseconds)
const TimeDuration kTelemetryReportThreshold = TimeDuration::FromMilliseconds(50);

@@ +201,5 @@
>      // The writer thread isn't running. Don't enqueue any more data.
>      return;
>    }
> +
> +  if (aObservation.End() - aObservation.Start() < TimeDuration::FromMilliseconds(TelemetryReportThreshold)){

this line is too long, but it turns out there's a Duration() method, so

if (aObservation.Duration() < kTelemetryReportThreshold) {
  ..
}
Attachment #8469191 - Flags: review?(vdjeric) → review+
Aaron, will this patch prevent the profiler or automated tests from seeing main-thread I/O taking less than 50ms?
Flags: needinfo?(aklotz)
(In reply to Vladan Djeric (:vladan) from comment #13)
> Aaron, will this patch prevent the profiler or automated tests from seeing
> main-thread I/O taking less than 50ms?

We should be making this change to TelemetryIOInterposeObserver instead of MainThreadIOLogger.

http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Telemetry.cpp#414

I think I mistakenly pointed out the wrong location when we were discussing the requirements for this bug. The good news is that the additional code should translate easily to the other class. Sorry about that. :-(
Flags: needinfo?(aklotz)
Hello,

Thanks for the valuable feedback!
No problem, I'll make the changes and will resubmit the patch.
(It might take a while, building firefox takes at least 1 hour on my computer.)

Vladan, yes, I build it successfully and run normally, however, I wasn't sure about when the function kicks in, and any printf's that I used, didn't show up...
(Could you help me with that?)

Thanks :)
Attached patch Bug1030487v3.patch (obsolete) — Splinter Review
Submitting fixed review based on Vladan's and Aaron's feedback.
Attachment #8469191 - Attachment is obsolete: true
Attachment #8470149 - Flags: review?(vdjeric)
Keep in mind, although it does compile and run normally, I haven't tested it (See comment 15).

Thanks and sorry for being late :)
- You don't need to rebuild the entire Firefox source when you make small changes like these in *.cpp files.
You can just type "mach build xpcom/build/" or "mach build toolkit/components/telemetry" from the top of your source tree. This should only take ~5 minutes at maximum.
- printfs should work as expected.. you could also set breakpoints. How are you launching Firefox after you build it?
Comment on attachment 8470149 [details] [diff] [review]
Bug1030487v3.patch

Looks good, please test it and then we can land it
Attachment #8470149 - Flags: review?(vdjeric) → review+
(Answering comment 18) :
I use ./mach run to launch FF.
I knew about not building the whole app, but wasn't sure about when it can be done...

------------------------
I'm now rebuilding everything after hg pull -u... I just had a problem where FF just hung (Can't see the output, as it's the terminal has a history limit).

I'm pretty sure it's not the patch (removing it didn't fix the problem).
HOWEVER, please wait for the build to finish before pushing the code... I'll leave a comment with the results.

Thanks
Heads up, actually, there is something wrong...
It's the patch that caused the crashes....

More info :
The program crashes with a signal 11.
Output---------------------------
...
UNKNOWN [/home/rafael/firefox-src/mozilla-central/obj-x86_64-unknown-linux-gnu/dist/bin/libxul.so +0x050F60D1]
UNKNOWN [/home/rafael/firefox-src/mozilla-central/obj-x86_64-unknown-linux-gnu/dist/bin/libxul.so +0x04CD44BF]
UNKNOWN 0x7f42d3b59943
Sleeping for 300 seconds.
Type 'gdb /home/rafael/firefox-src/mozilla-central/obj-x86_64-unknown-linux-gnu/dist/bin/firefox 3832' to attach your debugger to this thread.
---------------------------------

GDB OUTPUT----------------------
(process:3886): GLib-CRITICAL **: g_slice_set_config: assertion 'sys_page_size == 0' failed
nsStringStats
 => mAllocCount:             18
 => mReallocCount:           12
 => mFreeCount:              17  --  LEAKED 1 !!!
 => mShareCount:             16
 => mAdoptCount:              1
 => mAdoptFreeCount:          1
 => Process ID: 3886, Thread ID: 140737353918336
---------------------------------

I will post any updates here.
Yes, printf's do, in fact work (the function wasn't called previously, that's why I didn't see any output).

I think I've found the problem.

Printing aOb.DUration().ToMilliseconds prints a really large number (140734225164320) and causes a crash.
I believe mEnd isn't set (comment 10 was about MainThreadIOLoggerImpl::Observe()).

Any ideas?
Yep, mEnd is the problem...
(Ignore the large number, it was a humiliating mistake from my side, I used %lld to print a double :P)

Assertion failure: !IsNull() (Cannot compute with a null value), at ../../dist/include/mozilla/TimeStamp.h:304

From TimeStamp.h:

TimeDuration operator-(const TimeStamp& aOther) const
  {
    MOZ_ASSERT(!IsNull(), "Cannot compute with a null value");
    MOZ_ASSERT(!aOther.IsNull(), "Cannot compute with aOther null value");

Duration() does mEnd - mStart, so mEnd is the problem here...

Could we possibly check if mEnd is empty before the check?

It's kinda late here... I'll see your feedback tomorrow.
Also, sorry for all these replies, I should have used IRC.

Thanks
Hello, 

Seems like checking for aOb.End().IsNull(), does work.

Will test and upload this in a while.

Thanks
Attached patch Bug1030487v4.patch (obsolete) — Splinter Review
Attachment #8470149 - Attachment is obsolete: true
Attachment #8470437 - Flags: review?(vdjeric)
The patch looks good, but let's clear up this assert failure first before landing it.

Aaron, how can Observation::End() return null at this point in the code?
Flags: needinfo?(aklotz)
(In reply to Vladan Djeric (:vladan) from comment #26)
> The patch looks good, but let's clear up this assert failure first before
> landing it.
> 
> Aaron, how can Observation::End() return null at this point in the code?

We would know more definitively if provided a full call stack for the assertion.

I think the only place that this can happen is in NextStageObservation. I think that instead of checking for null mEnd, we should add a line after http://mxr.mozilla.org/mozilla-central/source/xpcom/build/IOInterposer.cpp#379 to set mEnd = mStart. This observation is supposed to be instantaneous, so I think that makes more sense anyway.
Flags: needinfo?(aklotz)
Rafael, can you update your patch with the change Aaron suggested and we'll get it landed?
Flags: needinfo?(rafael.kourdis)
Attached patch Bug1030487v5.patch (obsolete) — Splinter Review
I'm very sorry I didn't respond earlier. I was facing a problem with my computer and had to reinstall my OS.

Aaron, yes, now that I'm thinking about it I certainly should have provided a stack trace. I'm sorry about that, too :)

Thanks.
Attachment #8470437 - Attachment is obsolete: true
Attachment #8470437 - Flags: review?(vdjeric)
Attachment #8472853 - Flags: review?(vdjeric)
Flags: needinfo?(rafael.kourdis)
Comment on attachment 8472853 [details] [diff] [review]
Bug1030487v5.patch

Review of attachment 8472853 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +419,5 @@
>    // We only report main-thread I/O
>    if (!IsMainThread()) {
>      return;
>    }
> +  

remove the trailing whitespace

@@ +422,5 @@
>    }
> +  
> +  if (aOb.Duration() < kTelemetryReportThreshold) {
> +    return;
> +  }

put this if-block after the check for OpNextStage below.. otherwise we won't handle the startup/normal/shutdown stage transition
Attachment #8472853 - Flags: review?(vdjeric) → review+
I've fixed the issues as per your suggestions.

I know I might be wasting your time, but, if it's possible, could you please firmly explain why OpNextStage is an Operation type?
In what way is it similar to an I/O operation?

Thanks
Attachment #8472853 - Attachment is obsolete: true
Attachment #8473360 - Flags: review?(vdjeric)
(In reply to Rafael Kourdis from comment #31)
> I know I might be wasting your time, but, if it's possible, could you please
> firmly explain why OpNextStage is an Operation type?
> In what way is it similar to an I/O operation?
> 
> Thanks

I needed a way for IOInterposer and its observers to be notified when Firefox transitions between stages. Since we already have that code in place for I/O operations, I elected to create OpNextStage as a meta-operation to do that signalling.
Also, since Vladan already gave the previous version of your patch a review+, you don't need to request a review again. Just set the review flag to + on your patch and note in the comments that you have addressed his comments and are carrying forward the review that was granted in the previous version.
Comment on attachment 8473360 [details] [diff] [review]
Bug1030487v5-fixed.patch

Marking patch as r+ (carry forward from vladan)
Attachment #8473360 - Flags: review?(vdjeric) → review+
https://hg.mozilla.org/mozilla-central/rev/24f9ec66094d
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
Thanks a lot guys for the valuable help you provided, and for pushing the code to the tree.
Sorry for not replying earlier. My PC's PSU got fried (you can tell I'm a lucky man :) and I didn't think of using my phone.

So, thanks again for everything. This community is so strong only because of each and every one of you :)

Rafael.
You need to log in before you can comment on or make changes to this bug.