Last Comment Bug 717225 - Figure out why folderTree.json is getting corrupted, and improve stability during writing by transfer to nsISafeOutputStream
: Figure out why folderTree.json is getting corrupted, and improve stability du...
Status: RESOLVED FIXED
: dataloss
Product: Thunderbird
Classification: Client Software
Component: General (show other bugs)
: Trunk
: All All
: -- normal with 1 vote (vote)
: Thunderbird 13.0
Assigned To: :Irving Reid (No longer working on Firefox)
:
Mentors:
Depends on: 707329
Blocks: 729212
  Show dependency treegraph
 
Reported: 2012-01-11 06:09 PST by Mark Banner (:standard8)
Modified: 2012-04-04 16:03 PDT (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
fixed


Attachments
Process Monitor log for write of folderTree.json upon termination of Tb 9.01 on Win (2.75 KB, text/plain)
2012-01-12 01:21 PST, WADA
no flags Details
Use the safe file output stream to save foldertree.json (1.56 KB, patch)
2012-01-24 20:09 PST, :Irving Reid (No longer working on Firefox)
no flags Details | Diff | Review
Use safe file output stream for folderTree.json, directoryTree.json, session.json (3.53 KB, patch)
2012-01-30 07:08 PST, :Irving Reid (No longer working on Firefox)
standard8: review+
standard8: approval‑comm‑aurora+
Details | Diff | Review

Description Mark Banner (:standard8) 2012-01-11 06:09:54 PST
As we saw in bug 707329, for some reason folderTree.json is getting corrupted, probably on system crashes.

In theory we should be writing out nice data to the file, so this shouldn't happen. However, we should ensure that we have the right methods for reducing the possibility of corruption, and possibly consider backups.
Comment 1 WADA 2012-01-12 01:21:25 PST
Created attachment 587969 [details]
Process Monitor log for write of folderTree.json upon termination of Tb 9.01 on Win

(In reply to Mark Banner (:standard8) from comment #0)
> for some reason folderTree.json is getting corrupted, probably on system crashes.

If system crash or power failure is relevant, it's very simple.
File write of folderTree.json is following, so, null file, partial file, no update, can occur.
             <= If external app opens file, update is lost due to open error
   Open Write with Replace(write from offset=0)
             <= If system crash/power failure occurs, partial file is created 
   Write first data block
           | <= If system crash/power failure occurs, partial file is created 
   Write last  data block
   Close
If power failure of HDD occurs while OS is writing sectors to HDD physically, result is unpredictable.

Above is absolutely same as Bug 650831 for virtualFolders.dat, and above is reason why SafeFileWriting was implemented.
However, SafeFileWriting can do nothing for "system crash/power failure while OS is processing file rename request from Tb", even though SafeFileWriting actually reduced probability of file loss very very much.
Comment 2 Mark Banner (:standard8) 2012-01-12 01:53:34 PST
(In reply to WADA from comment #1)
> If system crash or power failure is relevant, it's very simple.

If we held the file open all the time Thunderbird is open, I'd agree with you. However, we don't.

> File write of folderTree.json is following, so, null file, partial file, no
> update, can occur.
>              <= If external app opens file, update is lost due to open error

That's a possibility, but we've got no evidence to back that up, or even why an external app would want that file.

> If power failure of HDD occurs while OS is writing sectors to HDD
> physically, result is unpredictable.

That's true, but I'd argue that someone would have to be shutting down TB within a few seconds prior to a system crash, which I find unlikely to be consistently happening enough to get us the number of reports we've had.

> However, SafeFileWriting can do nothing for "system crash/power failure
> while OS is processing file rename request from Tb", even though
> SafeFileWriting actually reduced probability of file loss very very much.

I think switching to a safe file output would be something to do, especially as this is something we're writing on shutdown. I'm just not totally convinced that it'll clear up the instances of this happening that much.
Comment 3 WADA 2012-01-12 21:19:37 PST
I can't imagine difference in file writing during termination among bookmarks.html, prefs.js, virtualFolders.dat(bug 650831), fileTree.json(this bug), session.json(bug 584005). File writing of all of them is writing of new version to a file.
Problem occurred on bookmarks.html/prefs.js before SafeFileWriting can occur on any of virtualFolders.dat, fileTree.json, session.json too, and even after SafeFileWriting is used for such files, loss/corruption currently occurs on prefs.js will occur sooner or later, even though frequency/probability of loss/corruption is perhaps different.

And, when partial file of prefs.js, user usually is not aware of partial data if account definition part is not lost. When partial file of virtualFolders.dat, user usually is not aware of partial data, because user usually knows about loss of virtual folder far after restart of Tb. User is aware of it only when all or many virtual folders are lost upon restart.
In contrast to it, when partial fileTree.json/session.json, even when last several bytes loss, it causes blank folder pane or blank thread/message pane because json error occurs.
i.e.
  Partial data == merely loss of some lines when prefs.js/virtualFolders.dat
  Partial data == absolutely same as loss of file when fileTree.json/session.json 
This is perhaps a reason why bugs are opened frequently for fileTree.json/session.json.

IIRC, major trigger of SafeFileWriting was very frequent loss/corruption of bookamarks.html(bookmarks.html is far larger than prefs.js, so possibility of corruption is higher than prefs.js).
Please look into history of prefs.js loss or corruption.
  Bug 477934 for still remaining prefs.js corruption even after SafeFileWriting.
  Dependency tree for meta bug 193638(Blocks: of bug 477934) with "Show Resolved.
  Dup'ed bugs to bug 193638(many bugs were duped even though meta bug.)
Bug reports at B.M.O for Tb's prefs.js loss/corruption, which frequently produced loss of account definitions, was drastically reduced by current/simple SafeFileWriting (write to prefs-1.js, rename prefs-1.js to prefs.js with deleting old prefs.js).
However, although not so many, loss of prefs.js is still reported.

I can't think Tb's/Fx's termination logic is too poor.
I can't imagine other major causes of prefs.js loss than (a) system crash(including crash/freeze of Tb or similar ones of Tb)/power failure, (b) issue like double invocation of termination process.
  For (b). See bug 422447, please.
  Problem like bug 212316, bug 480466 were foud in that bug.
  If this kind of problem happens, second invocation kills currently runnning
  termination process, and if killed task is file I/O, file is lost or corrupted.
  Such problem may still remain, although many cases are already resolved.
Other minor cases are;
  (c) Network file (power of Ethernet Hab can drop any time.)
                   (user can pull of LAN cable any time.    )
  (d) Lack of sufficient flush request.
 
(In reply to Mark Banner (:standard8) from comment #2)
> > If power failure of HDD occurs while OS is writing sectors to HDD
> > physically, result is unpredictable.
> That's true, but I'd argue that someone would have to be shutting down TB
> within a few seconds prior to a system crash, which I find unlikely to be
> consistently happening enough to get us the number of reports we've had.

In prefs.js loss/corruption cases, some loss/corruption of prefs.js without system crash/power failure were also reported. bug 422447 which I already pointed is an example.

> > SafeFileWriting actually reduced probability of file loss very very much.
> I think switching to a safe file output would be something to do,
> especially as this is something we're writing on shutdown. 
> I'm just not totally convinced that it'll clear up the instances of this happening that much.

If SafeFileWriting, xxxx.ext is not deleted until writing to xxxx-1.ext normally completes. So, any error while writing to xxxx-1.ext won't produce loss/corruption of xxxx.ext, if Tb correctly processes error return code of write & close request.
And, partial xxxx.ext file won't be produced as far as Tb processes return code correctly. So, I believe it'll reduce problem on fileTree.json/session.json very much, as did on prefs.js.

> > File write of folderTree.json is following, so, null file, partial file, no
> > update, can occur.
> >              <= If external app opens file, update is lost due to open error
> That's a possibility, but we've got no evidence to back that up, or even why
> an external app would want that file.

I think this case is not applicable to fileTree.json, because old/good version remains and is usable if fileTree.json was opened by external apps.
However, above itself is never merely a possibility.
It's based on evidence that problem of bug 494706 actually occurred in user's environment, and it was not single user only. Problem of bug 494706 actually occurred in environment of at least one auto-backup software user.
And, you can see following bugs in dependency tree for meta bug 193638 8with Show Resolved).
> bug 496825 : Norton Antivirus software triggering NS_ERROR_FILE_ACCESS_DENIED exceptions when using
> bug 431065 : Repeated use of safe file output streams conflicts with file monitoring tools
> bug 438316 : Safe File Outputstreams conflict with file monitoring tools  
Please note that auto-backup software, virus scanner of file of antivirus software etc. can try to open any file which Tb uses at any time. 

> > If system crash or power failure is relevant, it's very simple.
> If we held the file open all the time Thunderbird is open, I'd agree with
> you. However, we don't.

System crash, power failure can occur at any time.
While tb is writing multiple blocks of file data(multiple buffered data), Tb opens file, isn't it? Are you saying that time Tb is opening file for file writing is ZERO?
For problem like double invocation of termination process.
Because double invokation of termination process occurs during Tb is doing termination work, it can easily interfere file writing for termination, and required time for file writing is not ZERO.
Comment 4 Mark Banner (:standard8) 2012-01-13 03:08:54 PST
Wada, we don't need to debate this in detail. I'm just saying the reasons for this happening aren't obvious, nor why it appears to have suddenly got worse why its been around for at least half a year.

I will state that some of the other files you mentioned e.g. prefs.js do not have the read only on startup, write only on shutdown model the folderTree.json has. They get written to much more frequently and therefore the risk is much higher.

I'm happy for safe output streams to be implemented, I'm not debating that.
Comment 5 WADA 2012-01-15 16:53:16 PST
(In reply to Mark Banner (:standard8) from comment #4)
> why it appears to have suddenly got worse why its been around for at least half a year.

It's perhaps simple too.
  When folderTree.json is corrupted and "blank folder pane" is shown,
    Tb 3 era : folder pane was re-initialized by restart of Tb.
               So, users didn't shout "Tb showed blank folder pane!!!".
    Tb 9     : folder pane is never re-initialized,
               unless corrupted folderTree.json is manually deleted. 
               So, many users needed forum help, Q/A etc.
               Problem was changed to critical one.

(A) Following is behaviour of Tb3.02apre final build.
> Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1b3pre) Gecko/20090108 Shredder/3.0b2pre

(1) Corrupt fileTree.json : Remove last "}", or add "  x" after last "}"
(2) Restart Tb => blank folder pane, and next exception
> Error: uncaught exception: [Exception... "Component returned failure code:
> 0x80004005 (NS_ERROR_FAILURE) [nsIJSON.decode]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"
> location: "JS frame :: chrome://messenger/content/folderPane.js :: ftv_load :: line 109"  data: no]
(2) With keep Error Console open, close Tb window => next exception
> Error: uncaught exception: [Exception... "Component returned failure code:
> 0x80004005 (NS_ERROR_FAILURE) [nsIControllers.removeController]"  nsresult: "0x80004005 (NS_ERROR_FAILURE)"
> location: "JS frame :: chrome://messenger/content/mail3PaneWindowCommands.js :: UnloadCommandUpdateHandlers :: line 792"  data: no]
(3) Close Error Console, and terminat Tb.
(4) content of fileTree.json is initialized.
    {}
(5) Restart Tb => folder pane is re-initialized.

(B) Behavior of Tb 9.0.1 when folderTree.json is manually corrupted.
(1) Following exceptionn occurs, and blank folder pane is shown.
(1-1) when last "}" is removed from folderTree.json
> Error: JSON.parse: end of data while reading object contents
> Source File: chrome://messenger/content/folderPane.js Line: 196
(1-2) when last "}" of folderTree.json is changed to "} x"
> Error: JSON.parse: unexpected non-whitespace character after JSON data
> Source File: chrome://messenger/content/folderPane.js Line: 196
(2) Terminate Tb => folderTree.json is kept as-is.
(3) Restart Tb => same exception occurs again.

(C) After fix of bug 707329(Tb12.0a1 2012/01/14 build)
(1) when folderTree.json is corrupted(" x" is added. changedd to { ... } x),
    following is shown in Error Console by Try/Catch for "JSON.parse".
> Timestamp: 2012/01/16 08:57:01
> Error: Failed to read file: folderTree.json reason: SyntaxError: JSON.parse:
> unexpected non-whitespace character after JSON data
> Source File: chrome://messenger/content/folderPane.js Line: 200
(2) Null json data of "{}" is used, then folder pane is re-initialized.
(3) Terminate Tb => {} is written in folderTree.json.
    (if folder is selected at folder pane before termination,)
    (folder path is added to ... part in { ... } as expected.)
Comment 6 WADA 2012-01-15 17:04:05 PST
> I will state that some of the other files you mentioned e.g. prefs.js do not
> have the read only on startup, write only on shutdown model the folderTree.json has.

I could see it by Process Monitor. But, not so big problem but a problem was found in "write only on shutdown".

As I intentionlly opened 3 Tb windows via "Open" of account/folder context menu, I could observe following.
(1) Upon each Tb window open, folderTree.json was read opened, data was read, and file is closed.
(2) When Tb is trminated by File/Exit, 3 sets of "Open Write with OverwriteIf, Write data, Close" was observed. Written data length was different.

The "read only on startup, write only on shutdown model" you call looks "per Tb window", not "per Thunderbird instance".

3 sets of "Open Write with OverwriteIf, consecutive Write data, Close" are perhaps serialized and execued sequentially, and it may usually not produce partial folderTree.json. However, These file writings are executed in termination process of Tb, and these 3 sets are invoked at same time.
This may increase probability of partial fileTree.json due to error while concecutive "Write data".

I think "corruption of fileTree.json" is usually "partial fileTree.json" because file writing is very simple "Open Write with OverwriteIf, consecutive Write data, Close". However, data in attached fileTree.json to bug 707329 were binary garbage. 
They are "fileTree.json after system crash or power failure", so anything can happen on corrupted file data. But "multiple file write of fileTree.json at same time during termination of Tb" may be relevant to such broken fileTree.json.
It may be a reason why fileTree.json is corrupted in user's environment, even when crash/power failure doesn't occur, many times than we think.
Comment 7 WADA 2012-01-15 21:07:42 PST
(In reply to Mark Banner (:standard8) from comment #4)
> I'm happy for safe output streams to be implemented, I'm not debating that.

As SQLite is already included in package and is already used, utilizing of BLOB(Binary Large Object) may be a simple solution, because ...json files are not-so-huge flat files and partial file data is merely a garbage for json. FileWriting/FileReading can simply be replaced by Insert/Delete/Replace/Select of SQL DB, although VACUUM is always hard-to-resolve issue in SQL DB.
Comment 8 :Irving Reid (No longer working on Firefox) 2012-01-24 20:09:04 PST
Created attachment 591359 [details] [diff] [review]
Use the safe file output stream to save foldertree.json

Haven't reproduced the problem, but the folder tree does save & restore correctly with this patch and it should be safer.
Comment 9 WADA 2012-01-24 21:39:50 PST
(In reply to Irving Reid (:irving) from comment #8)
> Use the safe file output stream to save foldertree.json

Changing from file-output-stream to safe-file-output-stream looks simple.
Is there any plan to use safe-file-output-stream for other json files?
> directoryTree.json
> search.json
> session.json
Comment 10 WADA 2012-01-25 08:01:38 PST
Question to developer on phenomenon of commenent #6.
Why should folderTree.json be saved multiple times when multiple Tb Window exist, even though only one snap shot of single Tb Window is finally saved to single folderTree.json file? (folderTree.json is replaced by each save for a Tb Window)
Comment 11 Mark Banner (:standard8) 2012-01-25 10:37:09 PST
(In reply to WADA from comment #9)
> Changing from file-output-stream to safe-file-output-stream looks simple.
> Is there any plan to use safe-file-output-stream for other json files?
> > directoryTree.json

Yes, we should probably do that as well (the implementation is in abTrees.js)

> > search.json

That's a core file so a separate bug.

> > session.json

Not sure we need to do this, but I guess it wouldn't hurt.
Comment 12 WADA 2012-01-25 21:47:45 PST
(In reply to Irving Reid (:irving) from comment #8)
> Use the safe file output stream to save foldertree.json

Because this bug is to find reason why problem reports on "blank folder pane" suddenly increased even though probaility of fileTree.json corruption is same since initial(answer to it is comment #5), and because issue of "one fileTree.json writing per a Tb window" is also found in this bug, I think patch proposal to this bug is confusing and this bug is not appropriate for it.
Irving Reid, could you open separate bug for it, please.
Comment 13 Mark Banner (:standard8) 2012-01-26 01:39:46 PST
Wada, I told Irving to put that patch on this bug. Having thought about it and discussed with people that are seeing this bug, I'm not sure we're realistically going to resolve why this is happening.

Therefore I think the safe output stream is the best thing we can do at the moment. If we still see instances after we release a fix with the safe output stream, then we can file a new bug and think about it some more.
Comment 14 :Irving Reid (No longer working on Firefox) 2012-01-28 05:19:53 PST
I'll add directoryTree.json and session.json to this patch.


search.json is managed by Mozilla platform code, and does *not* use safe-file-output-stream (http://mxr.mozilla.org/comm-central/source/mozilla/toolkit/components/search/nsSearchService.js#2555); should we file a bug on that?

training.dat and traits.dat use raw C++ standard library I/O (http://mxr.mozilla.org/comm-central/source/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp#2506); shall we modify them to use safe-file-output-stream?

virtualfolders.dat is covered by Bug 650831

All other files in the root of the profile directory are either database format (mork, sqlite), or already using safe-file-output-stream as far as I can tell.
Comment 15 :Irving Reid (No longer working on Firefox) 2012-01-30 07:08:48 PST
Created attachment 592703 [details] [diff] [review]
Use safe file output stream for folderTree.json, directoryTree.json, session.json

Added safe output stream for more Thunderbird profile files
Comment 16 Mark Banner (:standard8) 2012-02-20 12:51:45 PST
Comment on attachment 592703 [details] [diff] [review]
Use safe file output stream for folderTree.json, directoryTree.json, session.json

This seems fine, lets get it out for some testing.

[Triage Comment]
Taking for aurora as it is a low-risk fix, but it'd be good to monitor for effects before release.
Comment 17 Wayne Mery (:wsmwk, NI for questions) 2012-02-20 16:45:24 PST
(In reply to Irving Reid (:irving) from comment #14)
>
> training.dat and traits.dat use raw C++ standard library I/O
> (http://mxr.mozilla.org/comm-central/source/mailnews/extensions/bayesian-
> spam-filter/src/nsBayesianFilter.cpp#2506); shall we modify them to use
> safe-file-output-stream?
Comment 18 Mark Banner (:standard8) 2012-02-21 04:34:33 PST
Checked in: http://hg.mozilla.org/comm-central/rev/c3184f1b3534
Comment 19 Mark Banner (:standard8) 2012-02-21 07:49:08 PST
Checked in:

http://hg.mozilla.org/releases/comm-aurora/rev/be844aef8229

Note You need to log in before you can comment on or make changes to this bug.