Consider introducing a short read option to "chaos mode"
Categories
(Core :: XPCOM, enhancement)
Tracking
()
People
(Reporter: nika, Unassigned)
References
(Depends on 1 open bug)
Details
In bug 1170668 it was discovered that some code in our codebase does not properly handle short reads from files. This likely doesn't come up often in our use in tests due to them generally being read from local disks, but can come up in situations like reading large files or over a network share.
It my be worthwhile to add a chaos mode option which sometimes artificially makes reads/writes using the file input streams only attempt to read a fraction of the number of characters originally requested to catch codepaths which expect short reads not to happen, potentially letting us catch and fix hard-to-find errors in input stream consumers.
Alternatively, we could look into performing a loop within the file streams to always perform full-length reads or writes when possible, meaning that callers which don't expect short reads/writes will implicitly work. This may fix more issues, but would likely come at a performance cost for well-behaved callers due to unnecessary extra read/write system calls.
Reporter | ||
Comment 1•2 years ago
|
||
From looking into this more, we probably don't want to do a short write case, as PR_Write
actually has special logic internally already to re-try writes until the entire provided buffer has been written (https://searchfox.org/mozilla-central/rev/4f2984be127d2e7c788cf1848d63dca63022beec/nsprpub/pr/src/pthreads/ptio.c#892-898). This means that nsFileOutputStream::Write
should never short-write under the hood and we don't need to worry about it for that case.
This doesn't apply to the read side, which NSPR allows to short-read (https://searchfox.org/mozilla-central/rev/4f2984be127d2e7c788cf1848d63dca63022beec/nsprpub/pr/src/pthreads/ptio.c#796-800), so something like this could still be valuable there.
Other streams we use internally may short write in some cases, but it shouldn't be an issue specifically for nsFileOutputStream
.
Reporter | ||
Comment 2•2 years ago
|
||
It appears that we already even have a chaos mode flag which claims to do similar things to what we'd want to do here (IOAmounts
: https://searchfox.org/mozilla-central/rev/4f2984be127d2e7c788cf1848d63dca63022beec/mfbt/ChaosMode.h#26-27), however right now it only applies to nsHttpConnection
(https://searchfox.org/mozilla-central/rev/4f2984be127d2e7c788cf1848d63dca63022beec/netwerk/protocol/http/nsHttpConnection.cpp#1645-1649). We could potentially expand that feature to also apply to file reads, though this may cause us issues as it would mean that the feature is immediately enabled, and may start to cause many intermittent test failures (especially under test-verify in CI).
Hi,
This is a great idea.
Thunderbird suffers from short read, obviously because it needs to store local mail articles on a remote server.
Bug 1170668 is only the tip of iceberg I have found (and not fixed since it is M-C code).
I have been trying to fix short read issues in C-C TB code for some time now.
Bug 1170606
I tried to find the issue by inserting a dummy open/read/read64, etc. call using LD_PRELOAD feature of linux and let the read system calls (there are some variants) to imitate short read.
This promptly helped me in finding problematic codes in C-C TB, but the problem was that there are SO MANY M-C code that suffer from this problem, too.
So I had to filter the problematic places so that I could focus only on the spots in C-C TB tree first.
But I assure you there are many spots in M-C tree that suffer from the issue.
JSON file read routine was one such glaring issue I found.
As you noted, I was afraid of Write routine, too, initially, but it is taken care of correctly by the underlying routine.
I will vote for implementing a test framework for short read in M-C and C-C.
This mechanism will help us in identifying places where short read can cause an issue and fixing them will help people suffer from short read issues sporadically.: storing profile on a remote file server is the most prominent problematic use case (and in TB's case, it is likely that mail store would be also on the remote server in that case).
TIA.
PS: I had the code in bug Bug 1170646 almost made it into the source tree, but a subtle bug in the code was found by a switch to a new Mac SDK, and unfortunately, my PC got fried in a very mysterious manner then (backplane of my external disk enclosure became faulty), it took me more than a month to figure out the exact cause and since I could not pay attention to the subtle bug in the code immediately, it got backed out.
That patch with a proper fix should go in IMHO.
PPS: In any case, with short read test routine, people would be amazed that right after TB starts, way before something is shown on the screen, the execution hits places where short read is not properly handled at all.
Description
•