Closed Bug 549224 Opened 15 years ago Closed 2 years ago

[META] valgrind: mochitests-plain: fix all Memcheck-detectable memory errors (x64-Linux)

Categories

(Core :: General, defect)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: jseward, Assigned: jseward)

References

Details

(Keywords: meta, valgrind)

Attachments

(2 files)

It would be nice to ship 1.9.3 with zero memory errors as detectable by run time detection tools. As a starting point, this metabug covers errors detectable by Valgrind's Memcheck tool, for a complete run of mochitests-plain, on x86_64-Linux. Using the techniques described at https://developer.mozilla.org/en/Debugging_Mozilla_with_Valgrind#Tips_for_improving_performance_and_accuracy_of_Valgrind%27s.c2.a0Memcheck_tool it is possible to complete a mochitest-plain run of Fx in less than 7 CPU hours on a fast machine with 3GB memory. This metabug tracks the individual bugs harvested from, or observable from, such runs. Why x86_64-Linux and not a more common target? Because Valgrind can run Fx on that platform relatively fast (30 x slowdown), as per comments in abovementioned URL. Most of the bugs found are in cross-platform code, so this exercise is of value to all platforms. Future work may expand the set of platforms and tools for which this exercise is viable. --- SETUP ----------------------------------------------- You must use the suppressions and mozconfig files attached to this bug. Platform: Ubuntu 9.10 x86_64, but any recent 64-bit Linux would do. Recent Valgrind trunk as per http://www.valgrind.org/downloads/repository.html. Don't use old versions, they are slower and less stable than trunk. Make sure your /usr/include/valgrind/{valgrind,memcheck}.h are either installed from the Valgrind build, or are symlinks to it. The Fx build will need them for code-discard notifications. I built Fx with vanilla FSF gcc-4.3.4 with "-g -O2". For unknown reasons (possible Valgrind bug) a build with gcc-4.4.x at -O2 segfaults when run on Valgrind, so don't use that. Note that gcc-4.4.1 is the default compiler on 9.10. Make sure you get a 64-bit build, not a 32-bit one. Use the mozconfig file attached to this bug. The most critical things are to disable JEMalloc and to build at a high optimisation level. Build; check the build works. Make sure you have a local DBUS: killall dbus-daemon eval `dbus-launch` \ && export DBUS_SESSION_BUS_ADDRESS && export DBUS_SESSION_BUS_PID --- RUN ------------------------------------------------- Start up a VNC server -- I use 960 x 720 x 16-bit depth. Run the entire suite using the following command. You will need to set DISPLAY correctly, and make --suppressions point to the attached .supp file. (DISPLAY=:1.0 make -C ff-opt mochitest-plain EXTRA_TEST_ARGS='--close-when-done --debugger=vTRUNK --debugger-args="--tool=memcheck --error-limit=no --stats=yes --vex-guest-chase-cond=yes --suppressions=/home/sewardj/MOZ/mochitest-mc.supp --trace-children=yes --child-silent-after-fork=yes '--trace-children-skip=/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel'"') 2>&1 | tee spew5-memcheck-jit-enabled Progress is very non-linear. Most of the time goes in tests numbered 50000-65000 approximately. Once past the low 60ks it finishes quite rapidly (160k tests in total). Peruse results. To rerun a specific test, for debugging purposes, add TEST_PATH to the above command line, eg: ... make -C ff-opt mochitest-plain \ TEST_PATH=dom/tests/mochitest/whatwg/test_bug500328.html \ EXTRA_TEST_ARGS=... Mochitests prints the name of a test only after it's done. So if memcheck emits a bunch of complaints, look for the name immediately after those complaints, not before.
Assignee: nobody → jseward
Keywords: valgrind
Depends on: 549236
Keywords: meta
Depends on: 549501
Depends on: 549779
Some observations: As of now (following filing of bug 549779) I'd guess I have found about 2/3 of the flaws detectable like this. All but one of them are to do with use of undefined values, most of which were created by stack allocations. They are all in XP code. It surprised me that there were not more invalid-address errors (reading/writing in a bad place). This might be because invalid address errors are easier to track down, or it might be because uninitialised value errors are regarded as less dangerous (a fallacy! they can just as easily lead to bizarre behaviour and crashing). Some of these flaws have, I suspect, been around a long time. Some of them involve lengthy and obscure control-flow paths which makes them pretty hard to track down. Bug 549236 is an example of both. I have a temporary rollup patch which "fixes" most of the problems so that further checking can go on without being swamped by noise from the so-far-discovered problems, until such time as proper fixes for them are made. Ping me if you want a copy.
Depends on: 550211
Severity: normal → S3

Closing inactive metabugs

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: