[meta] Unit tests should cover error handling due to I/O failures.
Categories
(Thunderbird :: Testing Infrastructure, task)
Tracking
(Not tracked)
People
(Reporter: benc, Unassigned)
References
Details
(Keywords: meta)
We don't really have any unit test coverage of error handling for I/O errors (eg disk full, network error etc...), and we really should.
It's a tricky problem, and I'm not quite sure what the best approach is.
But ideally I'd like to see some API that xpcshell-tests (and c++ gtests and rust tests...) can use to induce specific errors at specific points, eg "The next read operation should fail", or "the next disk write should fail with a disk-full error"... etc etc...
There have been efforts to create controllable "stub" versions of various classes (eg nsIFile, in Bug 784870), but I think this is always going to be problematic and piecemeal. The surface area is too large and it's too clunky and intrusive.
My gut feeling is that to be properly comprehensive, it'll have to be done by talking to the OS level - eg custom filesystem/network drivers with a mechanism for inducing failures, or being able to monkey-patch OS I/O calls on the fly.
Which sounds like an arse to integrate with CI and might only be available on some platforms, but I think increased test coverage - even if only on one platform - would be a huge win. Most code is platform-agnostic anyway, even error-handling code.
First step would be to collect and evaluate some alternative approaches and existing solutions - I'm sure other projects must be doing this kind of testing, so I'd be really surprised if there wasn't some good stuff out there already we could learn from.
Comment 1•1 year ago
|
||
DTrace-like feature might be one way to go although that is pretty comprehensive and it needs to be implemented in the OS-level (I don't know if it is available under linux although wiki seems to suggest Opendtrace is available for linux).
I used dtrace under Sun's Solaris and FreeBSD and for statistical gathering purposes for performance analysis. And it is very good.
You can write a DTrace script to simulate I/O errors. As a matter of fact, I used opensolaris for several years just to use DTrace and its malloc/free behavior was different from GNU Library's. It was useful to catch memory allocation bugs.
For network error simulationI, I created a wrapper for read/write routines using dynamic load feature of ld.so of Linux so that
the substituted read always returns shorter than requested # of octets which helped me in getting the
patch for bug Bug 1170606 right. (Basically, the original read/write routine was wrapped in my own read/write routines which simulates short read. I could create similar write routines, but I realized mozilla's I/O code took care of short write by repeating it until all requested octets are written or an error occurs, thus I needed to fix TB only for short read, and for users who use non-Windows platform only. This is because Windows take care of the short read issues under the hood, so to speak. Windows API for read takes care of the short read issues for CIFS/SAMBA mount (I am not so sure about NFS mount, but that is rare for windows users.)
Lately, early this year, I also ended up creating a wrapper (a series of macros) to test mork's behavior when there are I/O errors I reported earlier this year. (Thankfully, mork handles short read issues correctly. That was a relief.)
The macros enabled me to talk to a local test harness to simulate a network I/O error at a specific function entry (I had to insert a macro at the entry of a call to do this) so that network I/O error can be made to happen at a particular function entry's N-th call (N can be specified by the test harness). This helped me to figure out whether the network error during message download and message read/write from a remote file server where the user's profile is stored is properly handled.
(This approach covered network file system error conditions in a more comprehensible manner than my unplugging the network cable randomly, and I learned a few places where the error message could be better. )
I wanted to see if it is practical to cause an I/O error at a particular line in the source program instead of function entry granularity, but I think short of embedding many marco calls in the program source files and cluttering it, it is close to impossible. Something like "DTrace" with means to specify the code position within a source file (by looking at the stack trace of a system call when the DTrace script is invoked) to simulate I/O error in general (open, close, read, write, lstat, etc.) is preferable. The beauty of DTrace is that you don't have to recompile your program to do some basic stuff. That is very important. For the mork I/O error wrapper, I had to re-build C-C TB whenever I needed to add a function as the target of network I/O error and there are plenty of functions in C-C TB (and obviously not all of them are involved in I/O operation. The overhead of macro call should not be imposed on functions not related to I/O. )
My macro framework was very crude and slow. (I did not want to invest in creating a kernel module or anything for this particular test and thus used an external shell script to record the state of the prorgarm, i.e., C-C TB and triggered a network error by disabling the network adaptor that talks to a remote NFS where the TB profile including messages is stored.)
Dtrace-like feature or eBPF mentioned later implemented in kernel and has a set of API to control from user space is preferable for speed.
That said, my crude macro approach for figuring out mork I/O error issue worked very well.
I found opendtrace github repository does not mention linux at all: https://github.com/opendtrace
BTW, DTrace is part of MacOS. So it is available on Mac platform IIUC. Also, it is available in Oracle Linux. For many of the developers this is not helpful at all.
Linux seems to go the way of eBPF for statistical gathering and kernel information collection and changing the behavior of system calls (for a single program or in general) such as simulating an error (returning an I/O error code).
- BPF talk by https://www.youtube.com/watch?v=JRFNIKUROPE
- eBPF resources: https://ebpf.foundation/ebpf-resources/
It seems that we need to depend on a tool that is supported by big players so that we don't need to support it on our own.
Comment 2•1 year ago
|
||
Since I use linux mainly for producing patches, for me a framework usable in linux is the major way to go.
(I do run FreeBSD on my TrueNAS box, but that is not a main development machine.)
Someone can chime in with suggestion for Window test coverage or error simulation framework.
I found the following link at youtube helpful to learn eBPF. (But I have not really used eBPF much yet. I thought eBPF was for packet filter only for a long time.)
https://www.youtube.com/watch?v=s1mobd8t_u0&list=PLDg_GiBbAx-l4D4oKbscJhPFKv2oqPcD_
For modifyting a return value of a syscall using eBPF, I found the following stack overflow article.
It may not be as easy as one wishes it to be, but we have to start from somewhere.
https://stackoverflow.com/questions/43003805/can-ebpf-modify-the-return-value-or-parameters-of-a-syscall
Comment 3•1 year ago
|
||
Fault injection would be the term to describe what is needed.
https://en.wikipedia.org/wiki/Fault_injection
However, the type of toosl available is of adhoc nature
For example, fuzzing is very good to induce errors.
But after a certain error found by fuzzing is fxed, we need a particular set of errors that happen at the right moment
to see if the fix is good.
fuzzing tools have a mechanism to do that.
On the other hand, my intention of checking I/O errors during POP3 mail downloading (which eventually found mork rror led to a fix for that) was not that ad-hoce. My debug effort was meant to create a network error while the network download of POP3 message goes on AND the read/write takes place against a remote file server where the user profile including messages are stored.
So we DO need a sematic framework to narrow the occurrence of error to a selected components AND to a selected period of operation (expressed in terms meaningful to C-C TB).
The tools mentioned in the wikipedia page may have different degree of sophistication in terms of narrowing down the
components of software where error occurs and when they occur.
We need some input from users of each of the tools mentioned there. (There probably are blog pages from the users.)
One reason I needed to handcraft an error injection framework for an I/O error during pop3 download when a user profile is
stored on a remote server is that I knew there were issues in error handling in the last dozen years, but
the test coverage I could was very random and ad-hoc in the sense that the I/O error injection was done manually by
virtually unplugging the network cable. (actually I disabled the network interface.)
- Since it was manual, repeatability was bad.
- Manual intervention is not desirable for wide coverage. I needed automation.
That is why I resort to embedding a macro call in each function where I want an I/O error via network to the remote server to occur.
The simulated error occurs only in functions where I want them to occur.
I could instruct the test framework to cause an error at the N-th execution of the function (instead of randomly).
So repeatability was good.
There was no need for manual intervention. Once I set the N and function name to indicate when and where a simulated network error should occur, the error occurs at the desired place and I could check whether proper error recovery occurs.
In any case, some type of coding effort on the application side to cope with the error injection framework will be necessary.
The problem with fault injection for testing error recovery is it is good to check non-GUI operation.
Trying to check if C-C TB behaves sanely with proper error dialog and such is another matter.
Handling error dialog in X window (to see if proper error dialog is shown or not) has proved very difficult.
I used a GUI-automation tool called sikulix http://sikulix.com/
it was a bit klunky and since I learned it while developing the error injection framework, I would have coded the sikulix test script differently (and probably used a second screen instead of trying to use Xephyr virtual desktop on my real desktop.).
There are font size issues which made it so difficult for me to use it on my native screen and so I used a magnification factor to make the font bigger. Unfortunately, sikulix's matching function to detect whether GUI component is present or not on the screen does not handle magnification factor difference. Thus the script I created after I enabled magnification cannot be used on default setup, etc. If I want to run it in default setting on a second screen where a new login is done, I need to re-create the sikulix dialog with incorporating screen images in default magnification.
Description
•