Closed Bug 549224 Opened 14 years ago Closed 1 year ago

[META] valgrind: mochitests-plain: fix all Memcheck-detectable memory errors (x64-Linux)

Categories

(Core :: General, defect)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: jseward, Assigned: jseward)

References

Details

(Keywords: meta, valgrind)

Attachments

(2 files)

It would be nice to ship 1.9.3 with zero memory errors as detectable
by run time detection tools.  As a starting point, this metabug covers
errors detectable by Valgrind's Memcheck tool, for a complete run of
mochitests-plain, on x86_64-Linux.  Using the techniques described at

https://developer.mozilla.org/en/Debugging_Mozilla_with_Valgrind#Tips_for_improving_performance_and_accuracy_of_Valgrind%27s.c2.a0Memcheck_tool

it is possible to complete a mochitest-plain run of Fx in less than 7
CPU hours on a fast machine with 3GB memory.  This metabug tracks the
individual bugs harvested from, or observable from, such runs.

Why x86_64-Linux and not a more common target?  Because Valgrind can
run Fx on that platform relatively fast (30 x slowdown), as per
comments in abovementioned URL.  Most of the bugs found are in
cross-platform code, so this exercise is of value to all platforms.
Future work may expand the set of platforms and tools for which this
exercise is viable.


--- SETUP -----------------------------------------------

You must use the suppressions and mozconfig files attached to this bug.

Platform: Ubuntu 9.10 x86_64, but any recent 64-bit Linux would do.

Recent Valgrind trunk as per
http://www.valgrind.org/downloads/repository.html.
Don't use old versions, they are slower and less stable than trunk.

Make sure your /usr/include/valgrind/{valgrind,memcheck}.h are either
installed from the Valgrind build, or are symlinks to it.  The Fx
build will need them for code-discard notifications.

I built Fx with vanilla FSF gcc-4.3.4 with "-g -O2".  For unknown
reasons (possible Valgrind bug) a build with gcc-4.4.x at -O2
segfaults when run on Valgrind, so don't use that.  Note that
gcc-4.4.1 is the default compiler on 9.10.  Make sure you get a 64-bit
build, not a 32-bit one.

Use the mozconfig file attached to this bug.  The most critical things
are to disable JEMalloc and to build at a high optimisation level.

Build; check the build works.  Make sure you have a local DBUS:

  killall dbus-daemon
  eval `dbus-launch` \
       && export DBUS_SESSION_BUS_ADDRESS && export DBUS_SESSION_BUS_PID


--- RUN -------------------------------------------------

Start up a VNC server -- I use 960 x 720 x 16-bit depth.  Run the
entire suite using the following command.  You will need to set
DISPLAY correctly, and make --suppressions point to the attached .supp
file.


(DISPLAY=:1.0 make -C ff-opt mochitest-plain EXTRA_TEST_ARGS='--close-when-done --debugger=vTRUNK --debugger-args="--tool=memcheck --error-limit=no --stats=yes --vex-guest-chase-cond=yes --suppressions=/home/sewardj/MOZ/mochitest-mc.supp --trace-children=yes --child-silent-after-fork=yes '--trace-children-skip=/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel'"') 2>&1 | tee spew5-memcheck-jit-enabled

Progress is very non-linear.  Most of the time goes in tests numbered
50000-65000 approximately.  Once past the low 60ks it finishes quite
rapidly (160k tests in total).

Peruse results.  To rerun a specific test, for debugging purposes,
add TEST_PATH to the above command line, eg:

  ... make -C ff-opt mochitest-plain \
      TEST_PATH=dom/tests/mochitest/whatwg/test_bug500328.html \
      EXTRA_TEST_ARGS=... 

Mochitests prints the name of a test only after it's done.  So if
memcheck emits a bunch of complaints, look for the name immediately
after those complaints, not before.
Assignee: nobody → jseward
Keywords: valgrind
Depends on: 549236
Keywords: meta
Depends on: 549501
Depends on: 549779
Some observations:

As of now (following filing of bug 549779) I'd guess I have found
about 2/3 of the flaws detectable like this.  All but one of them
are to do with use of undefined values, most of which were created
by stack allocations.  They are all in XP code.

It surprised me that there were not more invalid-address errors
(reading/writing in a bad place).  This might be because invalid
address errors are easier to track down, or it might be because 
uninitialised value errors are regarded as less dangerous
(a fallacy!  they can just as easily lead to bizarre behaviour
and crashing).

Some of these flaws have, I suspect, been around a long time.
Some of them involve lengthy and obscure control-flow paths
which makes them pretty hard to track down.  Bug 549236 is an
example of both.

I have a temporary rollup patch which "fixes" most of the problems
so that further checking can go on without being swamped by noise
from the so-far-discovered problems, until such time as proper fixes
for them are made.  Ping me if you want a copy.
Depends on: 550211
Severity: normal → S3

Closing inactive metabugs

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: