Open Bug 1020229 Opened 11 years ago Updated 6 months ago

valgrind/memgrind uninitialized memory issues for C-C thunderbird

Categories

(Thunderbird :: General, defect)

x86_64
All
defect

Tracking

(Not tracked)

People

(Reporter: ishikawa, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

I ran valgrind/memcheck to C-C thunderbird last week. Basically I ran |make mozmill| test suite of TB, and TB was run under memcheck. After combing through possibly new memory leaks, I noticed that a few new uninitialized memory accesses (not noticed before) crept in in the last few months. I am attaching an edited excerpt from the log. The names of the functions shown in memcheck's suggested suppression template have been demangled by c++filt for easy reference. Also, the template is preceded by the stack trace that leads up to where the memory (uninitialized) was created (this is simply a copy of the stack trace shown by memcheck.) SUMMARY: There were 9 pairs of stack trace and memcheck's suppression template. Looking at where the uninitialized memory location was created, I recognize three patterns. (1), (3) --- related to graphics library usage. gfxContext::PopGroup() (mozalloc.h:201) (Not sure if this TB -specific or a bug in the graphics library.) (2), (4), (6) (7), (8), (9) --- related to mail DB stored in a file. nsMsgDBService::OpenMailDBFromFile(nsIFile*, nsIMsgFolder*, bool, bool, nsIMsgDatabase**) (mozalloc.h:201) (5) MOZCRASH This is related to MOZCRASH: not really an uninitialized issue. But it suggests that mozilla::storage::Service::Observe(nsISupports*, char const*, char16_t const*) has a serious issue (null reference?). Maybe it does not have an Observer (==null) when the reference is made? I will investigate the OpenMailDBFromFile() issue, first. Looking at the memcheck's suggested suppression template, which reported the uninitialized memory accesses in JavaScript interpreter functions, I think an uninitialized variable or two in C++ code were passed to JavaScript side (java interpreter) via XPCOM remote procedure call. This happened before a few times. TIA (The attachment has the preamble which is the main text of this post.)
Blocks: 803816
(this wouldn't be OS specific)
OS: Linux → All
Severity: normal → S3

obsolete?

Flags: needinfo?(benc)

Likely the specific issues are obsolete, but it'd definitely be nice to do a new run over with valgrind/memgrind and see what else pops up.

Flags: needinfo?(benc)

(In reply to Ben Campbell from comment #3)

Likely the specific issues are obsolete, but it'd definitely be nice to do a new run over with valgrind/memgrind and see what else pops up.

As of now, this is the situation of valgrind + C-C TB.

  • Last year, I could no longer interactively run C-C TB under valgrind due to Bug 1880148
    I gave up on running C-C TB under valgrind and switched to
    ASAN build to see if I can find issues. Bug 1929721

  • However, ASAN build is good at finding out buffer overrun, etc., but it cannot find the real uninitialized issue very well.

  • About a week ago, I was pleasantly surprised to find that I can, for now, run C-C TB + valgrind combination for an interactive session (!).
    Details are in https://bugzilla.mozilla.org/show_bug.cgi?id=1880148#c14

So since then, I have been running GCC-compiled C-C TB under valgrind.
Basically, I can run xpcshell test with C-C TB + valgrind combination.
I found a suspicious issue. Bug 1952749 (I am not ruling out valgrind thread model may break some fundamental assumption about TB thread model.)

But mochitest is hopeless. There are still many timeout values that need tweaking to take care of the slowdown caused by valgrind.
This has been the case since we switched from mozmill to mochitest framework. Bug 1636945
At least now, C-C + TB combination is working and I can try experimenting the lengthening of timeout values. (It took me better part of several months to make timeout values of mozmill to make C-C TB run |make mozmill|. Ever since we have switched to mochitest, C-C TB + valgrind combination did not run smoothly very much, that is why I have not been able to extend the timeout values successfully.
So far, I found that

  • some tests that usually timeout in normal time times even cannot be run successfully at all and probably simply better skipped.
    I simply decided to skip all the tests under browser subdirectory.
  • I realized that there are many tests that simply fail due to a daemon that is started at the beginning of executing a new test fails due to the failure of binding to a TCP port.
    I figureed this. When a test time out occurs, |mach| cannot seem to kill / terminate the daemons that were spawned under valgrind because they could not respond in a timely manner and |mach| seems to assume that the daemons died when timeout occurs and did not issue explicit kill command. The persistent daemon(s) interfere with the next test because at the beginning of next test, daemons are again spawned but they fail to start because they cannot bind to TPC port which is still occupied by the earlier still alive daemon. Failing to start the daemon finishes the test as failure and thus, basically, once a timeout occurs, no successive tests are executed meaningfully.

At least, I have figured it out. So I am either skipping the helpless test and/or lengthening the timeout value to make the first failling test work.
I realized timeout value for waiting for the test to produce any output at all had to be lengthened as well. Ouch. valgrind does slow down some actions very much.
But this is a great progress so far. At least, now I can figure out WHAT timeout value needs to be lengthened. Previous error mode of C-C + TB was very difficult to analyze and helpless. For example, Bug 1732848

However, at this pace, I don't think I can extend the timeout values by the end of 2025. The execution of |mach mochitest| when C-C TB is run under valgrind is very slow.

BTW, I am not using the built-in valgrind feature any more. Bug 1631197
The quoting mechanism |mach| that passes options to |valgrind| was once broken, then fixed (can't recall which bug number it was), and broken again and I cannot pass the hairy option of valgrind via |mach|.
I looked at the code and it was really convoluted and no wonder the fix is not easy.
So, I am using the wrapper program. But this poses a problem of NOT passing the information that the test is executed under valgrind, and I cannot use the information to increase the timeout value programmatically. Oh well. Once I figured out which timeout values need lengthening, I can probably come back to the quote mechanism again.

I am running a few tests at a time to check for timeout, etc. Still very slow progress.

In the past, there have been issues to try to run C-C TB under valgrind. E.g. Bug 1629433 Bug 1880148 Bug 1732848 Bug 1534109 Bug 1346534
Simply running C-C TB + valgrind sometimes become an adventure game, almost.

While somehow C-C TB + valgrind combination works, I wish I could verify the existence of the previously reported issues such as
Bug 809866
Bug 803816
Bug 1725959
Bug 1162079
Bug 1311202
Bug 965424
Bug 1735940
Bug 1020229 (this one)

ASAN build has its own problems by the way. Some issues make it crash prematurely and I could not list all the issues ASAN could find until
I locally replace the event-related library with an upstream version whoesale. Yeah, ASAN detected an issue INSIDE the even library in C-C + M-C tree and TB could not proceed further if the ASAN test mode is to crash it then and there.
In hind sight, maybe I could run it simply dump error info and continue, but once such an error that is detectable by ASAN build, I doubt the operation of C-C TB is trusted after that point. We need to remove the earler error to proceed to find the later issues.
In order to do that, I had to replace the event library, which is a major surgery. Bug 1909876 (ASAN detected heap-use-after-free and as such, I marked this as security-related.)

Several years ago, I read somewhere that FF tests under valgrind is executed monthly, and some people tested FF under valgrind while they surf the web.
It is hard to believe now with all the problems I have faced.
Nobody promised us a rose garden, but why is C-C TB so difficult to run under valgrind in contrast to FF (if what I read was real) ?
Hmm...

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: