Using MOZ HTTP log adds a lot of overhead

RESOLVED FIXED in Firefox 65

Status

()

defect
P2
normal
RESOLVED FIXED
Last year
6 months ago

People

(Reporter: peter, Assigned: kershaw)

Tracking

61 Branch
mozilla65
Points:
---

Firefox Tracking Flags

(firefox65 fixed)

Details

(Whiteboard: [necko-triaged][perf-tools])

Attachments

(3 attachments)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36

Steps to reproduce:

When we turn on the https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging HTTP log on Linux it affect Firefox and makes metrics that we measure a lot higher. I got two examples:

1. We run Browsertime on AWS c4.large where we test a couple of URLs of Wikipedia. We are using latest nightly of Firefox and tried turning on/off the HTTP log. I've tried it multiple times and it's always the same pattern (checkout the attached screenshot).
Turning off the MOZ for the https://en.wikipedia.org/wiki/Barack_Obama page makes our first visual render become 1.5s faster. The same on other URLs with a lot of requests. Pages with lower amount of requests is affected, but not as much.

We run with the following FF preferences: https://github.com/sitespeedio/browsertime/blob/master/lib/firefox/webdriver/firefoxPreferences.js

And moz log configured with: timestamp,nsHttp:5,cache2:5,nsSocketTransport:5,nsHostResolver:5

2. We also use WebPageTest to measure the performance of our pages. The MOZ log is turned on there too and when we compare Firefox performance with Chrome we can see that Firefox is much slower but I think one of the main reasons for that is that the MOZ log is turned on and adds overhead. 

You can checkout one example where we compare the runs (we run on the exact same setup except that one is using latest stable Chrome and the other stable Firefox):
http://wpt.wmftest.org/video/compare.php?tests=180409_1M_4X,180409_SV_4V

First render happens at 1.1 s for Chrome and 4.0 for Firefox. I cannot turn of MOZ log though in that tool to verify that it is really the log that adds the overhead.





Expected results:

I emailed with Harald Kirschner who said I should file this bug and make sure that past (perf team eng manager) is cc:d (how do I do that).
It's expected to have an overhead.  Also depends on if you use the "sync" module which flushes log writes after every line written to output file(s).  Logging's purpose is to provide data for (known) bug analyzes and is not intended to run while doing any performance testing.

If there is anything we can improve, let's do it.  But there will IMO always be at least some overhead.
cc'ing (and ni?ing) :past as mentioned in the original report.

Not sure there would be a whole lot for necko to do here - as Honza mentioned, enabling logging adds overhead, period. There may be an item here or there we could stop logging, or log slightly smarter, but most of our logging is pretty useful (I've honestly often wanted even more things to be logged).

I have a sneaking suspicion this will end up either WONTFIX or in xpcom to look at the logging code in general.
Flags: needinfo?(past)
One option is to fine tune the use of log levels to move extremely verbose portions to LogLevel::Verbose, etc.

For example all nsSocketTransport, nsHostResolver, & cache2 logging are done at the Debug (4) level.

The default for nsHttp is also Debug with the exception of Http2Compression.cpp, Http2Session.cpp, Http2Stream.cpp, nsHttpConnection.cpp, nsHttpConnectionInfo.cpp, & nsHttpConnectionMgr.cpp defaulting to Verbose. Info is also used a fair amount [1].

Obviously that doesn't help in the reporter's case of "timestamp,nsHttp:5,cache2:5,nsSocketTransport:5,nsHostResolver:5" as those are all verbose.

There are other options we can look into where the log data is emitted in a higher perf format that pushes off post processing to outside of Fx (similar to raw mach logs), but that hasn't been fully explored yet.

Peter, what's your use case for enabling logging during perf testing?

[1] https://searchfox.org/mozilla-central/search?q=symbol:M_9af0390facd94579&redirect=false
Flags: needinfo?(peter)
WebPageTest.org has the HTTP log enabled and use it to create a HAR file for Firefox.
Flags: needinfo?(peter)
You can check it out here how it's used here: https://github.com/WPO-Foundation/wptagent/blob/master/internal/support/firefox_log_parser.py

One problem IMHO is that since FF 54 the HTTP log has been the only way to automate to get HAR file for Firefox (since the HAR Export trigger was broken with 55). We will use https://github.com/devtools-html/har-export-trigger when it is ready (FF 61?) for our internal testing but we will also continue using WebPageTest.
Thanks for all the right questions and answers.

I think we could do an optimization using the level, as mentioned by Eric.  All top-level state stuff could go to just level 1 or 2, so that we'd lower the overhead while still providing what's needed.  There is also some code that is log-enabled dependent as it does complex loops over pending or active connection and transaction lists.  That could produce some overhead too and when enabled only for level 4+ we may have a win.

Since I always wanted to have better layering of our net logging, this is a good opportunity to do it now!
Assignee: nobody → honzab.moz
Status: UNCONFIRMED → NEW
Ever confirmed: true
Priority: -- → P2
Thanks for the ping, it seems that we've got a plan figured out now.
Flags: needinfo?(past)
Whiteboard: [necko-triaged]
Jason, when Kershaw is bask, can you please assign to him as we agreed?  Thanks.

Note that one of possible proposals was to have a separate module for logs that are interesting for webpage test.

Comment 5 references the py code that does the log file parsing.  It's the ultimate source of what we have to expose in this separate module or move to top-level (1) in the nsHttp module.
Assignee: honzab.moz → nobody
Since this has been idle for a while I ni? Jason, assuming Honza meant to ping him.
Flags: needinfo?(jduell.mcbugs)
Whiteboard: [necko-triaged] → [necko-triaged][perf-tools]
(In reply to Honza Bambas (:mayhemer) from comment #8)
> Jason, when Kershaw is bask, can you please assign to him as we agreed? 
> Thanks.
> 
> Note that one of possible proposals was to have a separate module for logs
> that are interesting for webpage test.
> 
> Comment 5 references the py code that does the log file parsing.  It's the
> ultimate source of what we have to expose in this separate module or move to
> top-level (1) in the nsHttp module.

Assign to myself.
Assignee: nobody → kershaw
Flags: needinfo?(jduell.mcbugs)
(In reply to Honza Bambas (:mayhemer) from comment #6)
> Thanks for all the right questions and answers.
> 
> I think we could do an optimization using the level, as mentioned by Eric. 
> All top-level state stuff could go to just level 1 or 2, so that we'd lower
> the overhead while still providing what's needed.  There is also some code
> that is log-enabled dependent as it does complex loops over pending or
> active connection and transaction lists.  That could produce some overhead
> too and when enabled only for level 4+ we may have a win.
> 
> Since I always wanted to have better layering of our net logging, this is a
> good opportunity to do it now!

Honza, could you please explain more about what's top-level state?
The log level 1 and 2 [1] actually refer to errors and warnings. I think really make error and warning logs go to level 1 and 2 and top-level logs go to level 3. What do you think?


[1] https://searchfox.org/mozilla-central/rev/c9272ef398954288525e37196eada1e5a93d93bf/netwerk/protocol/http/HttpLog.h#45-46
Flags: needinfo?(honzab.moz)
See https://github.com/WPO-Foundation/wptagent/blob/master/internal/support/firefox_log_parser.py for logs WPO is using.  Move those to level 1.
Flags: needinfo?(honzab.moz)
WPT parses HTTP log and create a HAR file. To reduce the overhead of logging, this patch moves some logs that are used by WPT parser to level 1.
Hi Peter,

I've created a build which produces logs that firefox_log_parser.py needs with lower log level. Could you download it from [1] and give it a test? From comment #0, I assume you run firefox on mac. If not, please let me know.

Please configure MOZ_LOG to: timestamp,nsHttp:1,nsSocketTransport:1,nsHostResolver:1


Thanks.

[1] https://queue.taskcluster.net/v1/task/eVaNCxeUQNOsax5zFZdyGQ/runs/0/artifacts/public/build/target.dmg
Flags: needinfo?(peter)
Sorry, I have very limited time until Christmas. To be frank for me it doesn't matter if the log adds overhead for Firefox, as long the metrics I get out of Firefox is stable. But for you if you are using WebPageTest to compare metrics between Firefox and Chrome, I think it's really valuable so that when you compare, you comparison is fare.
(In reply to Peter Hedenskog from comment #15)
> Sorry, I have very limited time until Christmas. To be frank for me it
> doesn't matter if the log adds overhead for Firefox, as long the metrics I
> get out of Firefox is stable. But for you if you are using WebPageTest to
> compare metrics between Firefox and Chrome, I think it's really valuable so
> that when you compare, you comparison is fare.

Thanks. I think I would still land this change.
If you want, you can use "timestamp,nsHttp:1,nsSocketTransport:1,nsHostResolver:1" to make the log in the future.
Flags: needinfo?(peter)
Pushed by kjang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e7720f22eab1
Move some logs to level 1 r=mayhemer
https://hg.mozilla.org/mozilla-central/rev/e7720f22eab1
Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Cross-referencing here, for those interested in the WebPageTest status, here; I've put up a PR over in https://github.com/WPO-Foundation/wptagent/pull/219, which will/should immediately help us in Firefox Nightly-runs' overheads.

Also on the WebPageTest side, Patrick Meenan landed a fix to the base Ubuntu image for Firefox on Linux, in EC2 (https://github.com/WPO-Foundation/webpagetest/commit/7a4d01158c58a01a4d613beb036d4a856abce16c), and now we have (starting tomorrow), accurate, up-to-date Firefox Nightly timings.

Thanks for filing this, Peter, and thanks for the whole team, and for fixing it, Kershaw!
I just pushed the change selectively for Nightly builds and it should roll out over the next hour. I haven't run extensive testing to be sure, but in the one quick test I ran, it looks like I lost the connection setup information for OCSP checks. Not a huge deal but wondering if that goes down a section of code where the log levels weren't updated.

A few other thoughts:

- I don't necessarily need full sync semantics, I just need the buffers written within a few seconds of the activity. Without sync it looks like there is no time-based flushing so a chunk of data is kept in RAM until the browser is actually closed.  If it can flush but not necessarily fsync() that would meet my needs perfectly.

- On Linux I may be able to use eatmydata to disable the fsync's at the OS level but still get the buffers flushing quickly: http://manpages.ubuntu.com/manpages/trusty/man1/eatmydata.1.html
Thanks Patrick.  Kershaw, do you think you could help with this?  I have no clear idea how to provide flushing (fflush) on a time based manner.  We could add it to the 'test' function [1] and simply keep a timestamp (std::time) of the last write and fflush the file when std::time() - epoch > something.  Not sure how std::time can be expensive.


[1] https://searchfox.org/mozilla-central/rev/55895c49f55073d82d977cb74ec1d3a71ae4b25f/xpcom/base/Logging.h#193
Flags: needinfo?(kershaw)
Another optimization I can do on my side is to only use sync when a test has more than one step. Otherwise I can close the browser at the end of the test before reading out the log.  I'll implement that across all platforms and eatmydata on Linux (for the multi-step case) and see where that gets us.
Looks like at least some events are missing from the logs. I'm not seeing the granular transfer timings in the WPT waterfalls.

Nightly (long, solid download bars): https://www.webpagetest.org/result/181121_QE_5ff4dd3c0ad4504dafaa36711c194803/1/details/#waterfall_view_step1

Stable (individual chunk timings): https://www.webpagetest.org/result/181121_WB_5d8c71b8e5ff89e051be10f7318d004b/2/details/#waterfall_view_step1

I'll see about getting the other optimizations in place but it looks like I'll probably need to switch back to log level 5.
(In reply to Honza Bambas (:mayhemer) from comment #21)
> Thanks Patrick.  Kershaw, do you think you could help with this?  I have no
> clear idea how to provide flushing (fflush) on a time based manner.  We
> could add it to the 'test' function [1] and simply keep a timestamp
> (std::time) of the last write and fflush the file when std::time() - epoch >
> something.  Not sure how std::time can be expensive.
> 
> 
> [1]
> https://searchfox.org/mozilla-central/rev/
> 55895c49f55073d82d977cb74ec1d3a71ae4b25f/xpcom/base/Logging.h#193

I've filed bug 1509090 for time-based flushing.
Flags: needinfo?(kershaw)
(In reply to Patrick Meenan from comment #23)
> Looks like at least some events are missing from the logs. I'm not seeing
> the granular transfer timings in the WPT waterfalls.
> 

Could you show me what events you are looking for in your parser code?
Thanks.
Flags: needinfo?(patmeenan)
There are quite a lot of them: https://github.com/WPO-Foundation/wptagent/blob/master/internal/support/firefox_log_parser.py#L136

FWIW, I just removed sync from the logging, added eatmydata and switched log level back to 5: https://github.com/WPO-Foundation/wptagent/commit/61c647ea88c0d97268a0c6a41900e3d051f7f238

I can switch the log level back again as needed but if I'm consuming most of the debug events then it won't make much of a difference. Hopefully the move away from sync logging gets most of the benefits.
Flags: needinfo?(patmeenan)
(In reply to Kershaw Chang [:kershaw] from comment #25)
> (In reply to Patrick Meenan from comment #23)
> > Looks like at least some events are missing from the logs. I'm not seeing
> > the granular transfer timings in the WPT waterfalls.
> > 
> 
> Could you show me what events you are looking for in your parser code?
> Thanks.

Note that during the review process I went through all the lines you expect in the parser code.  All of them have been updated in our code base.

(In reply to Stephen Donner [:stephend] from comment #28)

Created attachment 9035800 [details]
Screenshot comparing two WebPageTest runs with log levels 3 vs./and 5

Looking at those two runs: 1) log level 3 (Nightly) and (2) log level 5 (release), respectively), we can see the following missed requests, in the Firefox Nightly run with log level 3:

  1. oscp.digicert.com - /

  2. oscp.digicert.com - /

  3. aus5.mozilla.org - update.xml

  4. oscp.digicert.com - /

  5. oscp.digicert.com - /

  6. oscp.digicert.com - /

  7. oscp.digicert.com - /

  8. oscp.digicert.com - /

  9. aus5.mozilla.org - /

  10. push.services.mozilla.com

(the latter two appear to not be properly disabled via https://github.com/WPO-Foundation/wptagent/blob/30310759c4120bf5054d4f7541671916e8e0d7db/internal/support/Firefox/profile/prefs.js?)

Patrick, any corrections/comments?

Flags: needinfo?(patmeenan)

(In reply to Stephen Donner [:stephend] from comment #29)

Patrick, any corrections/comments?

From the screenshot it looks like it is also missing the connection events (DNS, socket connect, TLS negotiation). I wouldn't necessarily be surprised if it is also missing the HTTP/2 stream information.

It's possible that the events are missing because the log strings changed slightly by going to log level 3 and it's failing to parse the events generally and just falling back to the dev tools events. It's also possible the events just aren't there.

Flags: needinfo?(patmeenan)
You need to log in before you can comment on or make changes to this bug.