Closed Bug 1361102 Opened 7 years ago Closed 7 years ago

nsUpdateService.js does main thread IO writes for the update xml files during the first startup after an update, and whenever an update is downloaded

Categories

(Toolkit :: Application Update, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla57
Performance Impact high
Tracking Status
firefox57 --- fixed

People

(Reporter: florian, Assigned: robert.strong.bugs)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [bhr])

Attachments

(4 files, 4 obsolete files)

FileUtils_closeSafeFileOutputStream is called 5 times by update related code on this profile: https://perfht.ml/2p1wH1d

The first one is before the first window is shown. The next 4 are after the window is shown, but cause UI jank, and they seem to be writing the same file over and over again.
This comes from _writeUpdatesToXMLFile(): <https://searchfox.org/mozilla-central/rev/ae8c2e2354db652950fe0ec16983360c21857f2a/toolkit/mozapps/update/nsUpdateService.js#2740>.  The reason we're flushing here is that the output file stream used here is a safe output file stream.  As far as I can tell, that goes back to bug 299302 where this code was introduced first: <https://searchfox.org/mozilla-central/rev/f59f02c76f3a0e997a84a5711b8531bbec91173c/toolkit/mozapps/update/src/nsUpdateService.js.in#1868>

Robert, is an fflush() really necessary here?  If no, we can simply make the output file stream here an atomic output filestream (added in bug 928321) which doesn't flush.  That should fix this bug...
Flags: needinfo?(robert.strong.bugs)
That is likely but I'll look at it again in a few days after I finish some uplifts I have to get done.
Flags: needinfo?(robert.strong.bugs)
Using a deferred task to write it once instead of 4 times would also be helpful.
If flushing is necessary we should consider doing so from a background thread to avoid pausing the UI thread.

I saw reports of this in the BHR reports of hangs of the main thread for over 8 seconds, for example:

https://people-mozilla.org/~mlayzell/bhr/20170429/763.html#1
https://people-mozilla.org/~mlayzell/bhr/20170429/753.html#0
https://people-mozilla.org/~mlayzell/bhr/20170429/967.html#0
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #1)
> This comes from _writeUpdatesToXMLFile():
> <https://searchfox.org/mozilla-central/rev/
> ae8c2e2354db652950fe0ec16983360c21857f2a/toolkit/mozapps/update/
> nsUpdateService.js#2740>.  The reason we're flushing here is that the output
> file stream used here is a safe output file stream.  As far as I can tell,
> that goes back to bug 299302 where this code was introduced first:
> <https://searchfox.org/mozilla-central/rev/
> f59f02c76f3a0e997a84a5711b8531bbec91173c/toolkit/mozapps/update/src/
> nsUpdateService.js.in#1868>
Not related to bug 299302 which was an add-ons manager bug.
Component: Application Update → Startup and Profile System
Ehsan, what is the typical file output stream that is used for things like this? I much prefer using what is used by the majority of code for app update when ever possible.
Flags: needinfo?(ehsan)
I have a couple of other improvements to how app update writes to files I've been planning but they haven't made it to the top of my queue due to higher priority work. Fixing this bug will definitely be an improvement for approximately 2 startups every 6 weeks on the release channel and I'll fold the other improvements in with this.
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #6)
> Ehsan, what is the typical file output stream that is used for things like
> this? I much prefer using what is used by the majority of code for app
> update when ever possible.

An atomic file output stream: <https://searchfox.org/mozilla-central/rev/ae8c2e2354db652950fe0ec16983360c21857f2a/toolkit/modules/FileUtils.jsm#112>.  This gives you a file output stream that you can write to and when you close it, we atomically replace it with the destination file.  On Windows this is typically done by writing the new file to a temporary file and then atomically rename the temporary file to the destination file to ensure that the written data doesn't get stored partially.  The safe output file stream that the current code uses does the exact same thing plus the additional fflush.

To make things even better and avoid the main-thread I/O itself also, you can create an input string stream and use nsIDOMSerializer to serialize the XML DOM into the string and then use NetUtil.asyncCopy() to copy the input string stream into the output stream.  The finish callback which would be responsible to close the output stream which will close the file handle will still do I/O on the main thread but that will be much better than the current situation...  :-)

I'm wondering if the OS.File API gives us better primitives to avoid all main-thread I/O while working with XPCOM streams?  David should be able to help us figure out if we can use that to our benefit.

(BTW thanks for the note about how frequently we hit this code on the release channel, knowing that is helpful in putting the severity of the bug into perspective.)
Flags: needinfo?(ehsan) → needinfo?(dteller)
OS.File doesn't work with XPCOM streams, unfortunately. The main primitive for atomic writes is `OS.File.writeAtomic`, which takes an entire buffer (String or ArrayBuffer) and writes it to disk using the usual pseudo-atomic properties.

As a side-note, recall that even a full flush does not guarantee that the file is correcty written to disk. For Session Restore, we had to come up with alternative solutions. I wrote a blog post about this a long time ago: https://dutherenverseauborddelatable.wordpress.com/2014/06/26/firefox-the-browser-that-has-your-backup/
Flags: needinfo?(dteller)
Note that if you have a string, OS.File.writeAtomic will generally be faster than NetUtil.asyncCopy.
Whiteboard: [qf] → [qf][bhr]
Component: Startup and Profile System → Application Update
[qf:p3] because of the frequency of the occurrence in comment 7.
Whiteboard: [qf][bhr] → [qf:p3][bhr]
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #11)
> [qf:p3] because of the frequency of the occurrence in comment 7.

"2 startups every 6 weeks" in comment 7 is one way to phrase it, but we could also say that it affects 100% of the startups right after updating to 57. Is this the first time experience we want for 57?

It will also occur a lot on Nightly where we do much of our profiling.
Prioritizing this based on people profiling is probably not the right way to prioritize either since it isn't prioritizing off of what our release users experience.

If we had all the time in the world to fix all of the bugs then everything should be a P1 whether it affected 1 startup after updating or all startups. Even then an all startups bug should have a higher priority than this bug.

Having said that I do plan on addressing this bug and the other things I mentioned well before 57 so the priority isn't that important to me.
Here is another startup profile from the quantum reference hardware https://perfht.ml/2rgSN14 where we spend more than 3s on the main thread writing this file.
Priority: -- → P2
I keep seeing this in startup profiles, here is another one seen today: https://perfht.ml/2sQ0fQJ

This happens whenever we are downloading an update, and whenever we are installing an update. I've been told during the all hands that we have lots of users who use Firefox only once or twice a month. These users will experience this jank in each Firefox session.

Given this, I don't think qf:p3 is the right priority here.
Summary: The first startup after an update writes update related files to disk several times using main thread I/O → nsUpdateService.js does main thread IO during the first startup after an update, and whenever an update is downloaded
Whiteboard: [qf:p3][bhr] → [qf][bhr]
Note: we are trying to finish up bug 1343671 which will move the download outside of Firefox (also doesn't require Firefox to be running) to a scheduled task on Windows. This will benefit this bug as well as get those people that seldom run Firefox (easily less than 15% though I don't have an exact number) updated quicker. There are also plans to do something similar on Mac.
Whiteboard: [qf][bhr] → [qf:p1][bhr]
Robert, are you still planning on taking this bug for 57? Bug 1343671 got pushed to 58, correct? If you're low on time I can look into the improvements you all have talked about here.
Flags: needinfo?(robert.strong.bugs)
Doug, I planned on this bug for 57 and Bug 1343671 got pushed to 58. I should be done with bug 1385865 soon and this bug would be next in line since bug 1385865 changes the code that this bug touches quite alot. If you could take this after I land bug 1385865 that would be helpful.
Flags: needinfo?(robert.strong.bugs)
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #19)
> Doug, I planned on this bug for 57 and Bug 1343671 got pushed to 58. I
> should be done with bug 1385865 soon and this bug would be next in line
> since bug 1385865 changes the code that this bug touches quite alot. If you
> could take this after I land bug 1385865 that would be helpful.
Copied the wrong bug number. I meant that "bug 1367910 changes the code that this bug touches quite alot", etc.
Works for me!
Assignee: nobody → dothayer
Status: NEW → ASSIGNED
Depends on: 1367910
Robert, I'm confused by the status of this bug. It seems in comment 19 and 20 you want this to wait until bug 1367910 lands, but that bug doesn't have a patch that looks close to landing (no pending review request, and a patch last touched 3 months ago). What's the next step here to get this unblocked?
Flags: needinfo?(robert.strong.bugs)
Florian, I have a patch almost ready for review but other work has repeatedly taken priority. I'll ask Doug to go ahead with this and I'll adapt the patch for bug 1367910 after this lands. To expedite this I am going to ask Doug to request review from you for this bug.

Doug, I'll adapt the patch for bug 1367910 to what lands for this bug. Please request review from Florian. Thanks!
Flags: needinfo?(robert.strong.bugs) → needinfo?(dothayer)
(In reply to Florian Quèze [:florian] [:flo] from comment #3)
> Using a deferred task to write it once instead of 4 times would also be
> helpful.

Or maybe http://searchfox.org/mozilla-central/source/toolkit/mozapps/extensions/DeferredSave.jsm already does most of what we need.
Robert, you mentioned in IRC that you were taking this. Should I unassign myself?
Flags: needinfo?(dothayer) → needinfo?(robert.strong.bugs)
Done
Assignee: dothayer → robert.strong.bugs
Flags: needinfo?(robert.strong.bugs)
Attached patch patch in progress - client code (obsolete) — Splinter Review
I suspect that I will need to change when UpdateService shutsdown so not quite ready for review. The modified tests are in good shape and I'll attach that patch after I finish the aforementioned.
Attachment #8904446 - Flags: feedback?(dothayer)
Attached patch patch 1 - client code (obsolete) — Splinter Review
Attachment #8904446 - Attachment is obsolete: true
Attachment #8904446 - Flags: feedback?(dothayer)
Attachment #8904712 - Flags: review?(dothayer)
A couple of the tests that tested multiple things had to be split out to make the end state of the xml files deterministic so they wouldn't periodically fail when run sequentially.
Attachment #8904740 - Flags: review?(dothayer)
Attachment #8904741 - Flags: review?(dothayer)
Test run is looking good and none of the tests failed the sequential run and had to be retried so far
Comment on attachment 8904712 [details] [diff] [review]
patch 1 - client code

Review of attachment 8904712 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/mozapps/update/nsUpdateService.js
@@ +2530,5 @@
> +  // If there is an active-update.xml.tmp file try to use that for the active
> +  // update. If it contains a valid update remove the active-update.xml file
> +  // and rename the active-update.xml.tmp to active-update.xml.
> +  let activeUpdates = [];
> +  let tmpFile = getUpdateFile([FILE_ACTIVE_UPDATE_XML + ".tmp"]);

From a quick glance, this part of the patch looks like it's doing a lot of main thread IO. How is it related to the rest of the patch, and is it all required to be done synchronously?

@@ +2538,5 @@
> +    if (activeUpdates.length > 0) {
> +      let file = getUpdateFile([FILE_ACTIVE_UPDATE_XML]);
> +      try {
> +        if (file.exists()) {
> +          file.remove(false);

It's faster to call .remove() and let it throw an exception when the file doesn't exist, than to call .exists() before.
Comment on attachment 8904712 [details] [diff] [review]
patch 1 - client code

Review of attachment 8904712 [details] [diff] [review]:
-----------------------------------------------------------------

On the whole it looks good, but echoing Florian's comment, it would be nice if we could avoid increasing the sync IO in UpdateManager(). It should generally just be an extra file.exists() call, which doesn't seem too bad to me, but I'd like to hear your take on whether it's necessary for this to be synchronous.

::: toolkit/mozapps/update/nsUpdateService.js
@@ +2531,5 @@
> +  // update. If it contains a valid update remove the active-update.xml file
> +  // and rename the active-update.xml.tmp to active-update.xml.
> +  let activeUpdates = [];
> +  let tmpFile = getUpdateFile([FILE_ACTIVE_UPDATE_XML + ".tmp"]);
> +  if (tmpFile.exists()) {

_loadXMLFileIntoArray already does this check, so I think this is redundant.

@@ +2538,5 @@
> +    if (activeUpdates.length > 0) {
> +      let file = getUpdateFile([FILE_ACTIVE_UPDATE_XML]);
> +      try {
> +        if (file.exists()) {
> +          file.remove(false);

Do we need this? moveTo should replace the file if it already exists (http://searchfox.org/mozilla-central/rev/4d8e389498a08668cce9ebf6232cc96be178c3e4/xpcom/io/nsIFile.idl#155)

@@ +2541,5 @@
> +        if (file.exists()) {
> +          file.remove(false);
> +        }
> +        tmpFile.moveTo(tmpFile.parent, FILE_ACTIVE_UPDATE_XML);
> +        LOG("UpdateManager: found active-update.xml.tmp file");

Nit: Should this say something different than the LOG a few lines above?
Attachment #8904712 - Flags: review?(dothayer) → review-
Comment on attachment 8904741 [details] [diff] [review]
patch 3 - test code

Review of attachment 8904741 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!
Attachment #8904741 - Flags: review?(dothayer) → review+
Attachment #8904740 - Flags: review?(dothayer) → review+
Florian, is there any data on the percentage of clients that fail completing the write operation when using writeAtomic and tmpPath that don't complete the rename of the tmp file? If that isn't available is there any data on the percentage of clients that fail completing the write operation when using writeAtomic in general?
Flags: needinfo?(florian)
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #36)
> Florian, is there any data on the percentage of clients that fail completing
> the write operation when using writeAtomic and tmpPath that don't complete
> the rename of the tmp file? If that isn't available is there any data on the
> percentage of clients that fail completing the write operation when using
> writeAtomic in general?

These are questions for Yoric.
Flags: needinfo?(florian) → needinfo?(dteller)
I am changing the summary to limit the scope of this bug and make it closer to the original bug report. I filed bug 1397450 to cover the additional items that have been added since it is unlikely that all main thread IO will be eliminated in nsUpdateService.js during the 57 cycle.
See Also: → 1397450
Summary: nsUpdateService.js does main thread IO during the first startup after an update, and whenever an update is downloaded → nsUpdateService.js does main thread IO writes for the update xml files during the first startup after an update, and whenever an update is downloaded
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #36)
> Florian, is there any data on the percentage of clients that fail completing
> the write operation when using writeAtomic and tmpPath that don't complete
> the rename of the tmp file? If that isn't available is there any data on the
> percentage of clients that fail completing the write operation when using
> writeAtomic in general?

No, we do not have such data.
Flags: needinfo?(dteller)
Let me add one piece of data: most of the time, if there is a ".tmp", it's corrupted data.

The only case in which a ".tmp" can contain all the data, uncorrupted, is if:

1. the process *and the system* have written all the data to disk; AND
2. the process or the system crash before actually renaming the ".tmp" to its final name.

Most OSes finish treating the renaming *before* finishing writing all the data to disk, which means that this situation pretty much never happens.

In other words, the ".tmp" is typically useless.
Thanks for the info David!

Doug, I think there is already enough telemetry in pingStateAndStatusCodes, specifically PATCH_UNKNOWN, to handle the case where the active update is missing for whatever reason.
Attachment #8904712 - Attachment is obsolete: true
Attachment #8905790 - Flags: review?(dothayer)
Attached patch patch 4 - test code followup (obsolete) — Splinter Review
This mainly removes redundant checks and performs file checks for a couple of other tests.
Attachment #8905792 - Flags: review?(dothayer)
Comment on attachment 8905790 [details] [diff] [review]
patch 1 - client code rev 2

Review of attachment 8905790 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!
Attachment #8905790 - Flags: review?(dothayer) → review+
Attachment #8905806 - Flags: review?(dothayer) → review+
I combined patch 3 and 4.
Attachment #8905806 - Attachment is obsolete: true
Attachment #8906038 - Flags: review+
Pushed by rstrong@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a18f510ef12e
update the application update xml files asynchronously for writes and deletions. r=dothayer
https://hg.mozilla.org/integration/mozilla-inbound/rev/814b441b2bad
split out tests so they are more deterministic when updating the application update xml files asynchronously. r=dothayer
https://hg.mozilla.org/integration/mozilla-inbound/rev/b04b2abfc170
change tests to account for updating the application update xml files asynchronously. r=dothayer
Performance Impact: --- → P1
Whiteboard: [qf:p1][bhr] → [bhr]
You need to log in before you can comment on or make changes to this bug.