Open Bug 1629433 Opened 5 years ago Updated 2 years ago

Running thunderbird binary under valgrind while trying to run mochitest requires way TOO MANY threads (> 1500) !?

Categories

(Thunderbird :: Testing Infrastructure, defect)

x86_64
Linux
defect

Tracking

(Not tracked)

People

(Reporter: ishikawa, Unassigned)

References

Details

Attachments

(11 files)

Note: "thread" here refers to the mult-threading on CPU. Not the threading of topics of mail messages.

--

I have tried to run thunderbird binary under valgrind while executing |mach mochitest|.

valgrind blows up stating that there are simply too many threads being created.
I upped the maximum number of threads supported by valgrind via its --max-threads=NNN parameter. I upped it to 1500 and valgrind still blew up.

So there is something very wrong with valgrind+thunderbird binary combination when |mach mochitest| is executed.

Background:

I am not sure how valgrind run is supported for mochitest of TB, and so
I created a binary that executes, under valgrind, the original thunderbird binary (created as thunderbird-bin)under MOZ_OBJ/dist/bin directory.
The new binary to invoke thunderbird-bin under valgrind is copied into MOZ_OBJ/dist/bin/thunderbird.

This arrangement has worked fine with xpcshell-tests and |make
mozmill| test suite for some years.

I tested the xcpshell-tests under valgrind in this manner as late as the end of Feb 22 of this year. It worked fine.

Ever since |make mozmill| was retired in December 2019, I have tried to run |mach mochitest| running thunderbird under valgrind to check memory-related issues of TB.
Unfortunately, I have not been successful so far with this too many threads issue.

Regarding |mach mochitest|:

Without valgrind, the use of "no target", i.e., specifying no target after |mochitest| command seems to run the all the subtests of mochitest suite intended for TB after I |cd|ed to the C-C ./comm subdirectory of M-C ./mozilla directory.

However, with the same target, if I run TB under valgrind, the valgrind run blows up saying
too many tasks (default 500 and even if I increased it to 1500) and this happened even before a SINGLE test is run.

So I suspect one of the following causes.:

  1. The default mochitest setup consumes THAT MANY tasks (Does anyone if this is also the case with valgrind test of FF run that is done a few weeks apart or every month? In that case, adding

"--max-threads=NNN " /* April, 2020. */

to valgrind argument seems to be the way to go. But still, even NNN=1500 did not solve the issue at all (!)

  1. TB's mochitest invoked inside valgrind does something funny to create so many tasks, or.

  2. Possible valgrind issue:

I notice that valgrind printed the following warning during local run, which is a known valgrind issue.:

0:21.22 GECKO(275735) --275739-- WARNING: unhandled amd64-linux syscall: 315

https://github.com/svn2github/valgrind-master-mirror/blob/master/docs/internals/3_12_BUGSTATUS.txt

(My valgrind is farily recent.: valgrind --version
valgrind-3.16.0.GIT )

syscall 315 is |sched_getattr|. So it IS possible that there could be some issues related to task scheduling and creation.
But if so, there can and should be a problem with FF run under valgrind as well. But I could not find it in bugzilla. I am a bit puzzled.

I am uploading an excerpt from my local log in the next comment.

See the excerpt from my local log.

I think the valgrind blows up even before running any test.
This means the blowup due to too-many-threads happens during the
initialization step (possibly the webserver thingy from what I see in
the log?)

Observation:

You can see the following lines when my wrapper program to run the
original thuderbird binary under valgrind is invoked.:

 0:01.95 GECKO(325209) run-valgrind (masquerading as thunderbird binary)
 0:01.95 GECKO(325209) final command line is:
 0:01.95 GECKO(325209) valgrind --trace-children=yes --fair-sched=yes --smc-check=all-non-file --gen-suppressions=all -v --trace-signals=yes --vex-iropt-register-updates=allregs-at-mem-access --child-silent-after-fork=yes --trace-children-skip=/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java --max-threads=500 --num-transtab-sectors=24 --tool=memcheck --freelist-vol=500000000 --redzone-size=128 --px-default=allregs-at-mem-access --px-file-backed=unwindregs-at-mem-access --malloc-fill=0xA5 --free-fill=0xC3 --num-callers=50 --suppressions=/home/ishikawa/Dropbox/myown.sup --show-mismatched-frees=no --show-possibly-lost=no --read-inline-info=yes  /KERNEL-SRC/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin -marionette -foreground -profile /COMM-CENTRAL/TMP-DIR/tmp3BUfge.mozrunner

You can also see the following line when the thread number limit is
reached in valgrind.:

 4:43.92 GECKO(325209) Use --max-threads=INT to specify a larger number of threads
 4:43.92 GECKO(325209) and rerun valgrind
 4:43.92 GECKO(325209) valgrind: the 'impossible' happened:
 4:43.92 GECKO(325209)	  Max number of threads is too low
 4:43.92 GECKO(325209) host stacktrace:
 4:43.92 GECKO(325209) ==325213==    at 0x5804669A: show_sched_status_wrk (m_libcassert.c:406)
 4:43.92 GECKO(325209) ==325213==    by 0x580467B7: report_and_quit (m_libcassert.c:477)

The log is for setting the limit to the default (i.e., 500) and thus
the last thread dumped is as follows.:

 4:46.29 GECKO(325209) Thread 499: status = VgTs_WaitSys syscall 202 (lwpid 325815)
 4:46.29 GECKO(325209) ==325213==    at 0x4870C3B: __lll_lock_wait (lowlevellock.c:52)
 4:46.29 GECKO(325209) ==325213==    by 0x4869892: pthread_mutex_lock (pthread_mutex_lock.c:80)

The issue still persists even if I set the max-threads=1500. (The
excerpt from the log would be too large, and does not add much
useful information.)

Any hint to resolve the issue is appreciated.

My idea is to check if

    • the place where the |sched_getattr| may try to produce extra thread
      if |sched_getattr| fails.
    • trying to find out if "FF under valgrind" test in mozilla
      compilation/testing farm uses a very large --max-threads=NNN option to
      run FF under valgrind.

TIA

In the log, you can see that

  • there are a few patterns of memory issues that are not handled by my suppression file.

-- statx() signature is I think very new. There is a bugzilla regarding statx().

https://bugzilla.mozilla.org/show_bug.cgi?id=1606007
Valgrind builds fail with rustc 1.40: Syscall param statx(file_name) points to unaddressable byte(s) at ...

I am not sure if the signature observed in my local log and bug 1606007 are the same ones.

-- There are some memory issues related to Python memory allocators, too.

  • TB calls a few child programs and these show up in the log.

Hi,

I have created a patch to support |sched_getattr()| for valgrind.
But then when I ran thunderbird under valgrind during mochitest, I notice I need to support |sched_setattr()| as well.

I posted the work-in-progress patch to support |sched_getattr()| and |sched_setattr()| to valgrind bugzilla.
https://bugs.kde.org/show_bug.cgi?id=369029

Even then, with the version of valgrind that supports these system calls, I still see the blow up of
valgrind (> 500 threads !?).
Looking at so many threads that seem to wait for futex, I am afraid there may be
an issue of support of futex-related calls under x86_64.

But then, I wonder how valgrind test of FF is done in the mozilla compilation/testing farm, and is it free free from these issues then?
I don't think I find any similar bugzilla for FF.

Any tips will be appreciated.

Running TB under valgrind during mochitest is a sure way to discover many memory-related issues.

TIA

"futex-related calls under x86_64.": futex_wait_cancellable, to be exact.

With the patched valgrind to support sched_{set|get}attr, I see it blows up again with the creation more than 500 (default) threads.

I notice there are many threads that wait for futex, esp. futex_wait_cancellable.

Surprise.
I upped the max-threads limit to 5,000.

Now the test seems to proceed (!).
But then I got hit by the bug in
bug 1630345
./mach bootstrap fails with python-pip dependency issue: python-pip : Depends: python-pip-whl (= 18.1-5) but 20.0.2-4 is to be installed

When an assertion is hit and stack backtrace is decoded, it is done by fix-stacks external script and
I need to install this new script by running |mach bootstrap|. But it failed due to some package dependencies in newer Ubuntu/Debian distribution.

In any case, I would think requiring more than 1500 ( and < 5000) threads for a mere testing of mail client is a waste of CPU and memory resources.
We are NOT testing a very busy webserver at all.
Most of the testing is done by creating an event to trigger GUI-like operation of TB at a time if I am not mistaken.

Does anyone know how to reduce the number of threads to mere a few hundreds or something?
Just running through the creation of this many threads take valgrind many minutes to complete...

(Again, the valgrind was patched with the crude and yet-to-be-polished patch posted in https://bugs.kde.org/show_bug.cgi?id=369029 )

TIA

Not sure how you're running it, but could ./mach valgrind-test be useful?
https://searchfox.org/mozilla-central/source/build/valgrind/mach_commands.py

No, I don't think valgrind-test is useful.
It seems it expects firefox for the test target.

Please see the following.

Your current directory is
++ pwd
/NREF-COMM-CENTRAL/mozilla/comm
++ /NREF-COMM-CENTRAL/mozilla/mach --log-no-times --log-no-times valgrind-test
kind=None
environ.get('TERM', 'unknown')=dumb
environ.get('TERM', 'linux')=dumb
It looks like you tried to run a mach command from an invalid context. The valgrind-test
command failed to meet the following conditions: 

  is_firefox - Must have a Firefox build.

Run |mach help| to show a list of all commands available to the current context.

I ran TB under valgrind during |make mozmill| test execution for some years.
The trick to do that is basically this.:

  • save the original thunderbird binary somewhere. (Traditionally it is thunderbird-bin under the original directory.)
  • create a wrapper binary that invokes the original binary) under valgrind with suitable options.
  • Then install the wrapper as "thunderbird".

After this, |make mozmill| invoked TB under valgrind during its tests.
I can also run TB under valgrind during xpcshell-tests with this setup.
I could find a few memory issues using the setup since last December this way, for example.

It seems that the newcomer to TB test scene, mochitest that has replaced |make mozmill| since December is not quite friendly to valgrind as of now.

This is a serious test issue IMHO.

TIA
PS: BTW, the wrapper I create invokes TB with this list of arguments which includes the ones set up by ./mach valgrind-test for FF.

- Valgrind options:
 --trace-children=yes
 --fair-sched=yes
 --smc-check=all-non-file
 --gen-suppressions=all
 -v
 --trace-signals=yes
 --vex-iropt-register-updates=allregs-at-mem-access
 --child-silent-after-fork=yes
 --trace-children-skip=/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java
 --max-threads=500  <=== THIS NEED TO BE INCREASED to 5000 for successful progress...
 --num-transtab-sectors=24
 --tool=memcheck
 --freelist-vol=500000000
 --redzone-size=128
 --px-default=allregs-at-mem-access
 --px-file-backed=unwindregs-at-mem-access
 --malloc-fill=0xA5
 --free-fill=0xC3
 --num-callers=50
 --suppressions=/home/ishikawa/Dropbox/myown.sup
 --show-mismatched-frees=no
 --show-possibly-lost=no
 --read-inline-info=yes
 

PPS: In any case, as of now, I cannot proceed very well due to missing fix-stacks script(s?).
The stack dump decoder has been improved, but to obtain that I have to use python package installer (PIP). But unfortunately for Debian and presumably the newer Ubuntu distribution, we have package dependency that fails when we run |mach bootstrap| today.
As in comment 6.
I can probably ignore MOZ_ASSERT failures, but that would not be a very useful valgrind run. :-(

PPPS: Oh, WAIT A SECOND. Does the error message mean I have to create a FF binary to run TB under valgrind automatically using valgrind-test ??? It is a bother because it means whenever I update the M-M/C-C tree combo, I have to re-build binary of FF.
It takes twice the build time I suspect. That would be UNACCEPTABLE for many patch submitters like myself.
(Or maybe I can simply install pre-build FF binary at a known place?)
But this is unlikely scenario. There is no place to tell mach to say, FF is under this MOZ_OBJ directory and TB is under THAT MOZ_OBJ' directory IIRC.

See Also: → 1636945

Well, the strange behavior mochitest of C-C TB requiring more than 1500 threads and the failure to run valgrind successfully still puzzles me.

"futex wait storm", so to speak, is observed during the period of apparent no externally observable operation during C-C TB mochitest.
That is, when the test does not seem to proceed (no debug dump lines in the TTY console window where I invoked mochitest, and no visible
change on the screen of TB happens, I traced system calls thunderbird process invokes and found a repetition of futex wait and recv message
calls that fail, etc. I am sure this is due to the timeout of some operation. But I have a nagging feeling that the futex wait by 1500+ threads may be adding to the the superficial no-op behavior.

Now, In the mozilla dev-platform mailing list, there was a mention of multi-process mode of operation of mozilla firefox vs. single process(?) operation.

https://www.mail-archive.com/dev-platform@lists.mozilla.org/msg29140.html
Proposal: remove support for running desktop Firefox in single-process mode (e10s disabled) anywhere but in tests

The essence of the discussion is as follows.

  • Firefox developers may want to ditch single process operation in favor of it.
  • Supporting two different modes of operation on tryserver machines is difficult according to the discussion.
  • However, multi-process operation may make debugging very difficult.

What caught my eyes was it was mentioned TB does not support the multi-process operation. Since the strange requirement of 1500+
threads and observation of futex wait storm puzzled me, this "multi-process" issue rang a bell.

In one of the posts,
Gijs Kruitbosch <gijskruitbosch@gmail.com> said:

I can't speak for Thunderbird's plans, but either way these plans shouldn't affect them and is restricted to desktop Firefox; the pref still works there: https://searchfox.org/mozilla-central/rev/4bb2401ecbfce89af06fb2b4d0ea3557682bd8ff/toolkit/xre/nsAppRunner.cpp#5020-5024 , and they set it: https://searchfox.org/comm-central/rev/e62a0af3cba6e1c65b2d4be02d3fefce88cf3f8f/mail/app/profile/all-thunderbird.js#654

The latter refers to the following lines.

// No e10s in Thunderbird for now.
pref("browser.tabs.remote.autostart", false); <=== This

So, according to what was said, we have to set the preference |browser.tabs.remote.autostart| to |false| to make sure TB works in a sane manner.

Now, from my experience with the now defunct older test mozmill-based test suite of TB (invoked by |make mozmill|), C-C TB tests create preference files (user.js, and prefs.js) during execution and use them during the tests.(and modify them from time to time.: this makes
analysis a bit difficult.)

I think files were created under /tmp directory with |make mozmill| test suite, but the current mochitest of C-C TB now seems to create
the file in a different temporary directory.

I set the following environmental variables before the execution of mochitest, and mochitest seems to honor the setting for the temporary
user.js creation.
export TMP=/COMM-CENTRAL/TMP-DIR
export TMPDIR=/COMM-CENTRAL/TMP-DIR

In the latest local mochitest which I started, the temporary user.js created seems to reside under
a dynamically created directory of the following form: /COMM-CENTRAL/TMP-DIR/tmpuC7FB1.mozrunner/user.js
(tmpuC7FB1.mozrunner is a temporary directory and dynamically created/destroyed)

I noticed this temporary directory is destroyed and a new user.js is created under another temporary directory, for example.:
/COMM-CENTRAL/TMP-DIR/tmpDxIPaU.mozrunner/user.js

It seems these temporary directories get created afresh for starting a set of new tests. So there are many user.js created during C-C TB mochitest.

The above directory where user.js is created was found by searching in the log file where user.js is explicitly mentioned.
In a previous test, there was a few lines that looked like the following.:

113:37.57 GECKO(1073015)   file:///COMM-CENTRAL/TMP-DIR/tmpODI60Z.mozrunner/user.js

Such lines were reported in LEAKED URL section of the mochitest report (!)

So I searched for the latest creation under the directory and found user.js under a temporary directory created there. The important thing
here is that these user.js files contain a strange conflicting setting (more on this later.)

BTW, I find the following |user.js| files under MOZ_OBJ directory while I searched user.js. Among them,

/NEW-SSD/moz-obj-dir/objdir-tb3/temp/user.js

seems to be the main user profile if I am not mistaken.

I checked the discovered user.js and found a few disturbing things.

(1) Dynamically generated user.js files during tests of C-C TB
mochitest execution, contain conflicting stetting.

pref("browser.tabs.remote.autostart", {true|false});

The value is either true or false. Not consistent. In mochitest of C-C TB, there may be a few very browser-like test. In those cases,
|browser.tabs.remote.autostart| can be true? However, if we ALWAYS need to set the value for false for C-C TB mochitest,

    >> // No e10s in Thunderbird for now.
    >> pref("browser.tabs.remote.autostart", false); <=== This

we may be in a big trouble.

I observe that there are multiple lines to set "browser.tabs.remote.autostart" in a user.js with different
setting. Of course, the last one sets the used value. Inquiring mind wants to know how the setting is composed (presumably concatenate some
settings prepackaged in user.js files in test file directories? I have noticed that there are user.js files in test directory. See (3)
below.) Clear description of how the run-time user.js is created for each test during c-c TB mochitest run would be desirable.

In any case, I DID find cases where sub-tests (albeit web browser-oriented?) run with a user.js with
|pref("browser.tabs.remote.autostart", true);| which may not sit well for C-C thunderbird binary.

I have a nagging feeling that this bad setting may/could be one of the causes of the failure to run valgrind because we need more than 1500+
threads to run the C-C TB mochitest(!)

(2) user.js file under MOZ_OBJ/temp, /NEW-SSD/moz-obj-dir/objdir-tb3/temp/user.js, does NOT set |pref("browser.tabs.remote.autostart", false);|

I wonder if this user.js in MOZ_OBJ directory is used or not during C-C mochitest. If it is used, I think we are in a big trouble since it does not set the value to false.

(3) There are a few |user.js| files with conflicting settings in the test directories. Some set the pref to false, but some set to true.
I hope they are used appropriately. Since the C-C mochitest works more or less, I believe they are used correctly, but a clear
documentation/explanation of how each is used and how each test creates user.js dynamically from the prepackaged user.js for TB
mochitest would be desirable.

Appendix.:

Trying to locate files named |user.js| under MOZ_OBJ and finding out the setting of "browser.tabs.remote.autostart"

I change my current working directory to MOZ_OBJ directory, /NEW-SSD/moz-obj-dir/objdir-tb3, so that listing of file paths is shorter.

ishikawa@ip030:/NEW-SSD/moz-obj-dir/objdir-tb3$ ls /NEW-SSD/moz-obj-dir/
./  ../  objdir-tb3/
ishikawa@ip030:/NEW-SSD/moz-obj-dir/objdir-tb3$ cd /NEW-SSD/moz-obj-dir/objdir-tb3/
ishikawa@ip030:/NEW-SSD/moz-obj-dir/objdir-tb3$ pwd
/NEW-SSD/moz-obj-dir/objdir-tb3
ishikawa@ip030:/NEW-SSD/moz-obj-dir/objdir-tb3$ find . -name "user*.js" -print
./dist/bin/chrome/devtools/modules/devtools/client/inspector/rules/models/user-properties.js
./dist/bin/modules/matrix/matrix_sdk/models/user.js
./temp/user.js     <===== This is used for testing(?)
./_tests/web-platform/prefs/unittest-features/user.js
./_tests/web-platform/prefs/web-platform/user.js
./_tests/web-platform/prefs/perf/user.js
./_tests/web-platform/prefs/unittest-required/user.js
./_tests/web-platform/prefs/mochitest/user.js
./_tests/web-platform/prefs/base/user.js
./_tests/web-platform/prefs/geckoview-junit/user.js
./_tests/web-platform/prefs/reftest/user.js
./_tests/web-platform/prefs/common/user.js
./_tests/web-platform/prefs/profileserver/user.js
./_tests/mozbase/mozprofile/tests/files/dummy-profile/user.js
./_tests/reftest/profile_data/unittest-features/user.js
./_tests/reftest/profile_data/web-platform/user.js
./_tests/reftest/profile_data/perf/user.js
./_tests/reftest/profile_data/unittest-required/user.js
./_tests/reftest/profile_data/mochitest/user.js
./_tests/reftest/profile_data/base/user.js
./_tests/reftest/profile_data/geckoview-junit/user.js
./_tests/reftest/profile_data/reftest/user.js
./_tests/reftest/profile_data/common/user.js
./_tests/reftest/profile_data/profileserver/user.js
./_tests/testing/mochitest/profile_data/unittest-features/user.js
./_tests/testing/mochitest/profile_data/web-platform/user.js
./_tests/testing/mochitest/profile_data/perf/user.js
./_tests/testing/mochitest/profile_data/unittest-required/user.js
./_tests/testing/mochitest/profile_data/mochitest/user.js
./_tests/testing/mochitest/profile_data/base/user.js
./_tests/testing/mochitest/profile_data/geckoview-junit/user.js
./_tests/testing/mochitest/profile_data/reftest/user.js
./_tests/testing/mochitest/profile_data/common/user.js
./_tests/testing/mochitest/profile_data/profileserver/user.js
ishikawa@ip030:/NEW-SSD/moz-obj-dir/objdir-tb3$ find . -name "pref*.js" -print
./dist/bin/chrome/toolkit/content/global/preferencesBindings.js
./dist/bin/chrome/marionette/content/prefs.js
./dist/bin/chrome/messenger/content/messenger/preferences/preferencesTab.js
./dist/bin/chrome/messenger/content/messenger/preferences/preferences.js
./dist/bin/chrome/messenger/content/messenger/addressbook/pref-editdirectories.js
./dist/bin/chrome/messenger/content/messenger/addressbook/pref-directory-add.js
./dist/bin/chrome/devtools/modules/devtools/client/fronts/preference.js
./dist/bin/chrome/devtools/modules/devtools/client/performance/test/helpers/prefs.js
./dist/bin/chrome/devtools/modules/devtools/client/debugger/src/utils/prefs.js
./dist/bin/chrome/devtools/modules/devtools/client/netmonitor/src/middleware/prefs.js
./dist/bin/chrome/devtools/modules/devtools/client/netmonitor/src/utils/prefs.js
./dist/bin/chrome/devtools/modules/devtools/client/shared/prefs.js
./dist/bin/chrome/devtools/modules/devtools/client/webconsole/selectors/prefs.js
./dist/bin/chrome/devtools/modules/devtools/client/webconsole/utils/prefs.js
./dist/bin/chrome/devtools/modules/devtools/client/webconsole/reducers/prefs.js
./dist/bin/chrome/devtools/modules/devtools/client/performance-new/preference-management.js
./dist/bin/chrome/devtools/modules/devtools/server/actors/preference.js
./dist/bin/chrome/devtools/modules/devtools/shared/specs/preference.js
./dist/bin/defaults/autoconfig/prefcalls.js
./temp/prefs.js <--- This is an empty file.
./_tests/mozbase/mozprofile/tests/files/dummy-profile/prefs.js
./_tests/mozbase/mozprofile/tests/files/prefs_with_interpolation.js
./_tests/mozbase/mozprofile/tests/files/prefs_with_comments.js
ishikawa@ip030:/NEW-SSD/moz-obj-dir/objdir-tb3$ find . -name "pref*.js" -print | xargs egrep "autostart"
     comment: So pref.js does not have any relevant settings. What
     interests me is as follows.
ishikawa@ip030:/NEW-SSD/moz-obj-dir/objdir-tb3$ find . -name "user*.js" -print | xargs egrep "(e10|autostart)"
./_tests/web-platform/prefs/perf/user.js:user_pref("browser.tabs.remote.autostart", true);
./_tests/web-platform/prefs/unittest-required/user.js:user_pref("accessibility.typeaheadfind.autostart", false);
./_tests/web-platform/prefs/unittest-required/user.js:// Don't use auto-enabled e10s
./_tests/web-platform/prefs/unittest-required/user.js:user_pref("browser.tabs.remote.autostart", false);
./_tests/web-platform/prefs/geckoview-junit/user.js:// Always run in e10s
./_tests/web-platform/prefs/geckoview-junit/user.js:user_pref("browser.tabs.remote.autostart", true);
./_tests/reftest/profile_data/perf/user.js:user_pref("browser.tabs.remote.autostart", true);
./_tests/reftest/profile_data/unittest-required/user.js:user_pref("accessibility.typeaheadfind.autostart", false);
./_tests/reftest/profile_data/unittest-required/user.js:// Don't use auto-enabled e10s
./_tests/reftest/profile_data/unittest-required/user.js:user_pref("browser.tabs.remote.autostart", false);
./_tests/reftest/profile_data/geckoview-junit/user.js:// Always run in e10s
./_tests/reftest/profile_data/geckoview-junit/user.js:user_pref("browser.tabs.remote.autostart", true);
./_tests/testing/mochitest/profile_data/perf/user.js:user_pref("browser.tabs.remote.autostart", true);
./_tests/testing/mochitest/profile_data/unittest-required/user.js:user_pref("accessibility.typeaheadfind.autostart", false);
./_tests/testing/mochitest/profile_data/unittest-required/user.js:// Don't use auto-enabled e10s
./_tests/testing/mochitest/profile_data/unittest-required/user.js:user_pref("browser.tabs.remote.autostart", false);
./_tests/testing/mochitest/profile_data/geckoview-junit/user.js:// Always run in e10s
./_tests/testing/mochitest/profile_data/geckoview-junit/user.js:user_pref("browser.tabs.remote.autostart", true);
ishikawa@ip030:/NEW-SSD/moz-obj-dir/objdir-tb3$ 

The inconsistency needs investigation.

This is a copy of a version of dynamically generated user.js files during mochitest of C-C TB.
I think this is web-oriented test. But I am not sure if TB binary can run in multi-process mode.
Does the web engine can run in multiprocess mode whereas the rest of C-C TB binary cannot?

Well Tunderbird doesn't run neither web-platform nor reftest, so I have no idea where you see a problem. It's possible something is copied the obj-dir, but if it's not run it's not run.

I will check WHICH test uses the user_pref that enables e10s mode of operation.
It is really strange.

We turn off e10s in mochitest where necessary, that is, where the test expects to load and display a message. Here's all of the test manifests where that happens. We have to set the preference in the test manifest to override the test default settings, which in turn override the setting in all-thunderbird.js. That's probably what you're seeing in the user.js files, though I've never looked into how it actually works.

(In reply to Geoff Lankow (:darktrojan) from comment #14)

We turn off e10s in mochitest where necessary, that is, where the test expects to load and display a message. Here's all of the test manifests where that happens. We have to set the preference in the test manifest to override the test default settings, which in turn override the setting in all-thunderbird.js. That's probably what you're seeing in the user.js files, though I've never looked into how it actually works.

Sorry, I got side-tracked with a strange build failure involving sccache.
(Bug 1649037 rustc cargo compilation failure: mozilla/third_party/rust/dns-parser
disabling sccache again built binary.)

Finally, I could rebuild C-C TB locally yesterday.
I have yet to capture the details of tests and e10s mode.
I just noticed that there are telltale signs of some programs/routines (not sure what) that seem to use e10s operation mode.
Here is the output of

grep -i mode: /FF-NEW/log1242-mochitest.txt 
 0:41.39 INFO Mode:    non-e10s
	Mode: e10s
 3:26.64 INFO Mode:    non-e10s
13:10.53 INFO Mode:    non-e10s
14:37.39 INFO Mode:    non-e10s
15:59.21 INFO Mode:    non-e10s

... 45 such lines in total ...

154:54.23 INFO Mode:    non-e10s
157:15.18 INFO Mode:    non-e10s
158:17.36 INFO Mode:    non-e10s
	Mode: e10s

Maybe the number 46 (45 + 1, the first line) is twice the number of files listed in " Here's all of the test manifests where that happens. "
But then what about the other |Mode: e10s| lines?

The last one is in this block:

Browser Chrome Test Summary
	Passed: 8315
	Failed: 18
	Todo: 22
	Mode: e10s

It seems to print the total summary? (not sure).

The first one is a mystery: output of

egrep "(SUITE|TEST_START|TEST_END|Mode:)" log1242-mochitest.txt 
 0:02.68 SUITE_START: mochitest-browser - running 1 tests
 0:27.77 TEST_START: comm/mail/components/enterprisepolicies/tests/browser/hardware_acceleration/browser_policy_hardware_acceleration.js
 0:27.81 TEST_END: Test OK. Subtests passed 1/1. Unexpected 0
 0:41.39 INFO Mode:    non-e10s
	Mode: e10s
 0:45.92 SUITE_END

And the "Mode: e10s" appears as

TEST-INFO | checking window state
Browser Chrome Test Summary
	Passed: 1
	Failed: 0
	Todo: 0
	Mode: e10s
*** End BrowserChrome Test Results ***
 0:45.92 INFO Buffered messages finished
 0:45.92 SUITE_END

Why the test prints mode: e10s is a bit of mystery, but I think the user.js I saw which uses e10s must be from the operation (maybe not the individual test but the mochitest framework?)

I still don't know if the e10s mode of operation is one of the causes of valgrind failure.
But given that "mode: e10s" is printed near mochitest run (the first SUITE, whatever that is) make me a bit more suspicious.

I am curious. Does mochitest needs a working firefox?
I noticed that if I have a firefox running before invoking mochitest, I saw a log line that warns a browser is already running.
I wonder if "mode: e10s" line comes from a browser engine inside TB binary used as a browser replacement in mochitest framework..

mochitests do not need any external Firefox. What's run is run through Thunderbird code.

The warning I mentioned is as follows.:

WARNING Found 'firefox' running before starting test browser! 

I wonder if "mode: e10s" line comes from a browser engine inside TB binary used as a browser replacement in mochitest framework.

So maybe then, thunderbird binary that is manipulated by the test code or the mochitest framework code prints at the end

Mode: e10s

user.js scattered in the tree seem to be appended to override the common settings one by one to run each test.

I need to investigate more.

I have no idea why FF seems to have successful valgrind run while TB can't.
There has to be a different setting somewhere.

Blocks: 1364543

Great discovery.
After so many months (actually a few years) of not being able to run thunderbird under valgrind as an ordinary user (under Debian GNU/Linux),
I could today.
To be exact, the versions are as follows.

ishikawa@ip030:/home/ishikawa$ uname -a
Linux ip030 5.10.0-4-amd64 #1 SMP Debian 5.10.19-1 (2021-03-02) x86_64 GNU/Linux
ishikawa@ip030:/home/ishikawa$ valgrind --version
valgrind-3.16.1
ishikawa@ip030:/home/ishikawa$ 

I even suspected a bug of virtualbox where my linux instance runs before.
virtualbox version is 6.1.18.

For the last few years up until several months ago, valgrind + TB combination produces mysterious segmentation fault of valgrind itself.
Small binary would run without an issue. But a large TB binary didn't . :-(

So being able to run valgrind and TB combination as an ordinary user opens up a much easier venue of debugging.
(I didn't want to do messy things as superuser under linux.)

I can possibly run builtin debugger vgdb of valgrind and place a breakpoint on clone() system call that is used to create a new thread and
see where all those more than 1500 threads are created. Possibly.
I have yet to figure out how to automate the "breakpoint reached. Create a stack dump. Continue" loop for more than 1500 threads creation.

If anyone has an idea on this, please let me know.
It seems there is a builtin support for ruby script or something, but I am not sure.
I have not programmed gdb command files for quite sometime now.

I believe I have to reduce this # of threads to a reasonable level.

mochitest run under valgrind cannot even run a single test due to a time out early in the test sequence.
I think that one reason for this timeout is a group of many threads vying for futex, and use up so much elapsed time in simulated CPU of valgrind.

59:16.41 GECKO(32022) ==32026== Warning: set address range perms: large range [0x184f054000, 0x1c4f054000) (noaccess)
59:56.57 GECKO(32022) ==32026== Warning: set address range perms: large range [0x184f054000, 0x1c4f054000) (noaccess)
60:12.21 INFO Buffered messages finished
60:12.26 INFO TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 2960 seconds with no output
60:12.28 ERROR Force-terminating active process(es).
60:12.30 INFO Determining child pids from psutil...
60:12.55 INFO [32023]
60:12.59 INFO ==> process 32026 launched child process 32093
60:12.59 INFO Found child pids: {32093, 32023}
60:12.63 INFO Failed to get child procs
60:12.63 INFO Killing process: 32093
60:12.63 INFO Not taking screenshot here: see the one that was previously logged
60:12.65 INFO Can't trigger Breakpad, just killing process
60:12.65 INFO Error: Failed to kill process 32093: psutil.NoSuchProcess no process found with pid 32093
60:12.65 INFO Killing process: 32023
60:12.65 INFO Not taking screenshot here: see the one that was previously logged
60:12.65 INFO Can't trigger Breakpad, just killing process
60:42.79 INFO psutil found pid 32023 dead
60:42.79 INFO failed to kill pid 32026 after 30s
60:42.79 INFO Killing process: 32022
60:42.79 INFO Not taking screenshot here: see the one that was previously logged
60:42.79 INFO Can't trigger Breakpad, just killing process
60:42.86 INFO psutil found pid 32022 dead
60:42.86 INFO psutil found pid 32022 dead

Come to think of it, I don't even need to valgrind, but runs gdb as the debugger to invoke the test from mach and see where the threads are created more than 1500, etc.
In any case, I need the automated breakpoint-reached-dump-stack-and-continue for 1500 threads creations to figure out where the problem lies.

So, of course, I can try to see how many threads are created and trace that under valgrind but that will take a long time with this approach of using gdb (valgrind has vgdb feature to invokde gdb and attach it to the process. ).
valgrind slows down is not something to be taken lightly.

But come to think of it, I want to trace where the threads are created during mochitest run of TB.

This means I do not need to run mochitest with TB running under valgrind, but
I can run mochitest with TB running under gdb which is basically the full machine speed. Much better.

I wonder how to do that, though.
I vaguely recall that |mach| can invoke a debugger when it runs test, but not quite sure if that is available for TB.
Tough. Many features of |mach| have not been tested for TB and may need tweaking of the script.

I will report what I can find out.

I modified how to invoke mochitest as follows.
Below the script, do-make-tb.sh, sets up various environment variables that fit my local build environment and invokes |mach| with the arguments passed to the script.

I basically added |--debugger gdb| to the invocation of mochitest.

MOCHITESTDEBUGGER="--debugger gdb"
# mach mochitest mail/components/extensions/test/browser
time /NREF-COMM-CENTRAL/mozilla/do-make-tb.sh   --log-no-times mochitest ${MOCHITESTDEBUGGER} $*

With this, the gdb prompt appeared after a while (This delay is something I am not quite sure. I wonder what mochitest is doing here.).
And then I set a breakpoint at |clone()| since I suspect that is where I should investigate to figure out where the threads are created.
So I followed the command sequence in
https://blog.mozilla.org/nnethercote/2015/02/13/using-gdb-to-get-stack-traces-at-particular-program-points/
I did not type |set logging on| since I am recording the whole session using |script| command.
Maybe I should use |set logging on|.: it may be faster since the output is done to a file with buffering and all that while
interacting directly with visual console may incur overhead.

Anyway, right now the test is running under gdb with lots of stacktraces.
I hope I can figure out where the extra threads are created. (One thing I am afraid is that it is created from Javascript) and I won't know which Javascript file the thread creation is requested. But I will see.

My tentative guess is that there is a routine that expects to create a thread.
BUT, given the heavy workload on try server farm, there is often the resource stavation which makes thread creation fail on the first try, but
it succeeds after a timeout period or something. Because valgrind slows down the execution so much, the response from clone() is passed back to the caller only after so much delay, and this retry kicks in, thus resulting in so many threads.
That was my guess previously.

As I look at the busy output on the console, I am not sure if that was the case.
TB mochitest may require so many threads by design. But why more than 1500 (!?)

I will report what I find. Stay tuned.

This is from the about a dozen or so run of the test. TB has not even shown the main window on the scree yet due to the delay caused by the
busy stacktrace dump.
In the following the output from the tentative script log (it is still accumulating the output)
grep "hit Breakpoint 1" gdb-09-12-TB-d.txt | sort | uniq -c | sort -n -r -k 1 | head -20
It shows the 20 most frequent hits.

Wait. Do I have to qualify "clone" with proper prefix or maybe I am simply stopping at a function name that happens to be clone in its own scoe, and not the real system call clone to create theread?

If I don't have to quality the system call |clone| with proper prefix, then
I suspect the first one is very likely JavaScript engine invocation.
What is the second |NonHeapSoizeOfExcludingThis|?
Also, I see lots of SharedPropMap::clone with different |length| argument. What is this?
All I can say is that in contrast to old |make mozmill| test suite,
mochitest test framework seems to rely on many JavaScript threads.

But then again, I may be counting extra ones due to the name conflict.

   1941 Thread 1 "GeckoMain" hit Breakpoint 1, js::FunctionScope::clone (cx=0x555555a49930, scope=scope@entry=..., fun=fun@entry=..., enclosing=enclosing@entry=...) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/vm/Scope.cpp:1102
   1703 Thread 1 "GeckoMain" hit Breakpoint 1, mozilla::VolatileBuffer::NonHeapSizeOfExcludingThis (this=0x5555558179c8) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/memory/volatile/VolatileBufferFallback.cpp:65
   1487 Thread 1 "GeckoMain" hit Breakpoint 1, js::Scope::clone (cx=0x555555a49930, scope=scope@entry=..., enclosing=enclosing@entry=...) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/vm/Scope.cpp:568
   1214 Thread 1 "GeckoMain" hit Breakpoint 1, js::SharedPropMap::clone (cx=0x555555a49930, map=..., length=1) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/vm/PropMap.cpp:85
    505 Thread 1 "GeckoMain" hit Breakpoint 1, js::SharedPropMap::clone (cx=0x555555a49930, map=..., length=2) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/vm/PropMap.cpp:85
    272 Thread 1 "GeckoMain" hit Breakpoint 1, js::SharedPropMap::clone (cx=0x555555a49930, map=..., length=3) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/vm/PropMap.cpp:85
    154 Thread 1 "GeckoMain" hit Breakpoint 1, alloc::boxed::{{impl}}::clone () at library/alloc/src/boxed.rs:1083
    107 Thread 1 "GeckoMain" hit Breakpoint 1, style::stylesheets::UrlExtraData::ptr (self=0x7ffff53ea2e8 <geckoservo::glue::DUMMY_URL_DATA>) at servo/components/style/stylesheets/mod.rs:177
    106 Thread 1 "GeckoMain" hit Breakpoint 1, clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:52
    104 Thread 1 "GeckoMain" hit Breakpoint 1, js::SharedPropMap::clone (cx=0x555555a49930, map=..., length=4) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/vm/PropMap.cpp:85
     89 Thread 1 "GeckoMain" hit Breakpoint 1, mozilla::VolatileBuffer::NonHeapSizeOfExcludingThis (this=0x555556f1eeb0) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/memory/volatile/VolatileBufferFallback.cpp:65
     79 Thread 1 "GeckoMain" hit Breakpoint 1, style::stylesheets::UrlExtraData::ptr (self=0x7fffffffa4b0) at servo/components/style/stylesheets/mod.rs:177
     53 Thread 73 "DOM Worker" hit Breakpoint 1, js::SharedPropMap::clone (cx=0x7fff340dc090, map=..., length=1) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/vm/PropMap.cpp:85
     49 Thread 1 "GeckoMain" hit Breakpoint 1, js::GlobalScope::clone (cx=0x555555a49930, scope=scope@entry=...) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/vm/Scope.cpp:1311
     33 Thread 1 "GeckoMain" hit Breakpoint 1, mozilla::VolatileBuffer::NonHeapSizeOfExcludingThis (this=0x555556cdf750) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/memory/volatile/VolatileBufferFallback.cpp:65
     32 Thread 64 "DOM Worker" hit Breakpoint 1, js::SharedPropMap::clone (cx=0x7fff2c0a0ec0, map=..., length=1) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/vm/PropMap.cpp:85
     31 Thread 1 "GeckoMain" hit Breakpoint 1, mozilla::VolatileBuffer::NonHeapSizeOfExcludingThis (this=0x555556a5e9c0) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/memory/volatile/VolatileBufferFallback.cpp:65
     26 Thread 1 "GeckoMain" hit Breakpoint 1, js::SharedPropMap::clone (cx=0x555555a49930, map=..., length=6) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/vm/PropMap.cpp:85
     20 Thread 1 "GeckoMain" hit Breakpoint 1, core::sync::atomic::atomic_load<usize> (dst=<optimized out>, order=core::sync::atomic::Ordering::Relaxed) at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/core/src/sync/atomic.rs:2354
     18 Thread 1 "GeckoMain" hit Breakpoint 1, js::SharedPropMap::clone (cx=0x555555a49930, map=..., length=7) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/vm/PropMap.cpp:85

I stopped the tracing after typing control-C to gdb and typed |del 1| and then |cont|.
The TB's main window shows up now and the test is proceeding at full speed.

BTW, the stacktrace does suggest NonHeapSizeOfExcludingThis is related to thread creation for event handling, but I am not sure.

It seems that I will need to a lot of analysis eventually figure out why mochitest under valgrind does not seem to proceed very well even though I set timeout to 30 minutes in some key places. Nothing visible is shown on the screen when I ran TB under valgrind.
Too many thread creation is a start. But I still have no idea where this will lead.

Thread 1 "GeckoMain" hit Breakpoint 1, mozilla::VolatileBuffer::NonHeapSizeOfExcludingThis (this=0x5555558179c8) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/memory/volatile/VolatileBufferFallback.cpp:65
65	size_t VolatileBuffer::NonHeapSizeOfExcludingThis() const { return 0; }
#0  mozilla::VolatileBuffer::NonHeapSizeOfExcludingThis() const (this=0x5555558179c8) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/memory/volatile/VolatileBufferFallback.cpp:65
#1  0x00007fffeb05ddb6 in nsThread::ProcessNextEvent(bool, bool*) (this=0x5555556bc0c0, aMayWait=<optimized out>, aResult=0x7fffffffb147) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/threads/nsThread.cpp:1176
#2  0x00007fffeb03ea7a in NS_ProcessNextEvent(nsIThread*, bool) (aThread=<optimized out>, aThread@entry=0x5555556bc0c0, aMayWait=aMayWait@entry=false) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/threads/nsThreadUtils.cpp:466
#3  0x00007fffeb755e7a in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) (this=0x5555557026c0, aDelegate=0x5555556bf7e0) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/ipc/glue/MessagePump.cpp:85
#4  0x00007fffeb6fa179 in MessageLoop::RunHandler() (this=<optimized out>) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/ipc/chromium/src/base/message_loop.cc:324
#5  MessageLoop::Run() (this=<optimized out>) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/ipc/chromium/src/base/message_loop.cc:306
#6  0x00007fffee1bb649 in nsBaseAppShell::Run() (this=0x5555558179c0) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/widget/nsBaseAppShell.cpp:137
#7  0x00007fffef738e7a in nsAppStartup::Run() (this=0x555555813b20) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/components/startup/nsAppStartup.cpp:284
#8  0x00007fffef83957e in XREMain::XRE_mainRun() (this=<optimized out>) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsAppRunner.cpp:5291
#9  0x00007fffef83aa1d in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (this=0x7fffffffb510, argc=<optimized out>, argv=<optimized out>, aConfig=<optimized out>) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsAppRunner.cpp:5476
#10 0x00007fffef83b339 in XRE_main(int, char**, mozilla::BootstrapConfig const&) (argc=5, argv=0x7fffffffc808, aConfig=...) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsAppRunner.cpp:5535
#11 0x000055555556aa34 in do_main(int, char**, char**) (argc=<optimized out>, argv=<optimized out>, envp=0x7fffffffc838) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mail/app/nsMailApp.cpp:229
#12 0x0000555555569ca3 in main(int, char**, char**) (argc=5, argv=0x7fffffffc808, envp=0x7fffffffc838) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mail/app/nsMailApp.cpp:368

After trials and errors, I noticed this line in the log. (I have not seen this before?. I see. Now I am running valgrind using |mach
mochitest --valgrind ... | feature. Before, I created a wrapper program that invokes |thunderbird-bin| under valgrind, and renamed
this wrapper as |thunderbird| in dist/bin directory and ran mochitest. I think my wrapper used slightly different options and I did not see
themessage due to timing differences?):

socket.timeout: Timed out waiting for connection on 127.0.0.1:2828!

I suppose the thunderbird binary is listining on port 2828.

This timeout error message is printed from mozilla/testing/marionette/client/marionette_driver/marionette.py
I changed the line to print out the used timeout. And I found out it was 900.

socket.timeout: Timed out waiting for connection on 127.0.0.1:2828! timeout was 900

This may be too short with TB under valgrind today. So I upped it to 1800 and am continueing with testing.

diff --git a/testing/marionette/client/marionette_driver/marionette.py b/testing/marionette/client/marionette_driver/marionette.py
--- a/testing/marionette/client/marionette_driver/marionette.py
+++ b/testing/marionette/client/marionette_driver/marionette.py
@@ -559,16 +559,19 @@ class Marionette(object):
         :param timeout: Optional timeout in seconds for the server to be ready.
         :param check_process_status: Optional, if `True` the process will be
             continuously checked if it has exited, and the connection
             attempt will be aborted.
         """
         if timeout is None:
             timeout = self.startup_timeout
 
+        # valgrind
+        timeout = 1800
+
         runner = None
         if self.instance is not None:
             runner = self.instance.runner
 
         poll_interval = 0.1
         starttime = datetime.datetime.now()
         timeout_time = starttime + datetime.timedelta(seconds=timeout)
 
@@ -591,18 +594,18 @@ class Marionette(object):
             time.sleep(poll_interval)
 
         if not connected:
             # There might have been a startup crash of the application
             if runner is not None and self.check_for_crash() > 0:
                 raise IOError("Process crashed (Exit code: {})".format(runner.wait(0)))
 
             raise socket.timeout(
-                "Timed out waiting for connection on {0}:{1}!".format(
-                    self.host, self.port
+                "Timed out waiting for connection on {0}:{1}! timeout was {2}".format(
+                    self.host, self.port, timeout
                 )
             )
 
     @do_process_check
     def _send_message(self, name, params=None, key=None):
         """Send a blocking message to the server.
 
         Marionette provides an asynchronous, non-blocking interface and
ishikawa@ip030:/NREF-COMM-CENTRAL/mozilla$ 

Interesting.

With the change to the timeout to marionette.py in the previous comment (comment 26) and a slightly different parameters passed to valgrind, it seems that the first few tests in mochitest suite seem to proceed to a point
where the calendar test is shown on the screen and proceed very slowly.
(there are errors obviously due to timeouts still, but that can be handled by longer timeout where it matters, I hope).

But I am not holding my breath. Many tests failed due to timeout (see the excerpt from the log at the end.). I have not analyzed much yet. And based on my previous experience, I think the full test runs for like 2 days. Yes DAYS (not hours).

I have no idea what else has changed in M-C/C-C to make this possible.
But this is a great start.

The results from the first several tests. The whole mochitest is still running.
Output from

egrep "(Subtests passed|Running on|/mach)" t9.list 

(<- t9.list contains a partial log from the valgrind run).

time /NREF-COMM-CENTRAL/mozilla/mach   --log-no-times mochitest --valgrind valgrind --valgrind-args='--trace-children=yes --max-threads=5000 --max-stackframe=16000000 --num-transtab-sectors=24 --tool=memcheck --freelist-vol=500000000 --redzone-size=128 --px-default=allregs-at-mem-access --px-file-backed=unwindregs-at-mem-access --malloc-fill=0xA5 --free-fill=0xC3 --num-callers=50 --suppressions=/home/ishikawa/Dropbox/myown.sup --show-mismatched-frees=no --show-possibly-lost=no --read-inline-info=yes --fair-sched=yes --smc-check=all-non-file --gen-suppressions=all -v --trace-signals=yes --vex-iropt-register-updates=allregs-at-mem-access --child-silent-after-fork=yes --suppressions=/home/ishikawa/Dropbox/myown.sup --trace-children-skip=/usr/bin/lsb_release,/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java,*/fix-stacks,*/firefox/firefox,*/bin/firefox-esr,*/bin/python,*/bin/python2,*/bin/python3,*/bin/python2.7,*/bin/bash,*/bin/nodejs,*/bin/node,*/bin/xpcshell,python3' $*
 0:01.78 INFO ishikawa	447031	446969 75 09:22 pts/4	 00:00:21 python3 /NREF-COMM-CENTRAL/mozilla/mach --log-no-times mochitest --valgrind valgrind --valgrind-args=--trace-children=yes --max-threads=5000 --max-stackframe=16000000 --num-transtab-sectors=24 --tool=memcheck --freelist-vol=500000000 --redzone-size=128 --px-default=allregs-at-mem-access --px-file-backed=unwindregs-at-mem-access --malloc-fill=0xA5 --free-fill=0xC3 --num-callers=50 --suppressions=/home/ishikawa/Dropbox/myown.sup --show-mismatched-frees=no --show-possibly-lost=no --read-inline-info=yes --fair-sched=yes --smc-check=all-non-file --gen-suppressions=all -v --trace-signals=yes --vex-iropt-register-updates=allregs-at-mem-access --child-silent-after-fork=yes --suppressions=/home/ishikawa/Dropbox/myown.sup --trace-children-skip=/usr/bin/lsb_release,/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java,*/fix-stacks,*/firefox/firefox,*/bin/firefox-esr,*/bin/python,*/bin/python2,*/bin/python3,*/bin/python2.7,*/bin/bash,*/bin/nodejs,*/bin/node,*/bin/xpcshell,python3
 0:01.83 INFO ishikawa	447031	446969 75 09:22 pts/4	 00:00:21 python3 /NREF-COMM-CENTRAL/mozilla/mach --log-no-times mochitest --valgrind valgrind --valgrind-args=--trace-children=yes --max-threads=5000 --max-stackframe=16000000 --num-transtab-sectors=24 --tool=memcheck --freelist-vol=500000000 --redzone-size=128 --px-default=allregs-at-mem-access --px-file-backed=unwindregs-at-mem-access --malloc-fill=0xA5 --free-fill=0xC3 --num-callers=50 --suppressions=/home/ishikawa/Dropbox/myown.sup --show-mismatched-frees=no --show-possibly-lost=no --read-inline-info=yes --fair-sched=yes --smc-check=all-non-file --gen-suppressions=all -v --trace-signals=yes --vex-iropt-register-updates=allregs-at-mem-access --child-silent-after-fork=yes --suppressions=/home/ishikawa/Dropbox/myown.sup --trace-children-skip=/usr/bin/lsb_release,/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java,*/fix-stacks,*/firefox/firefox,*/bin/firefox-esr,*/bin/python,*/bin/python2,*/bin/python3,*/bin/python2.7,*/bin/bash,*/bin/nodejs,*/bin/node,*/bin/xpcshell,python3
 0:02.43 INFO runtests.py | Running on Valgrind.  Using timeout of 2960 seconds.
20:33.55 INFO ishikawa	447031	446969	3 09:22 pts/4	 00:00:48 python3 /NREF-COMM-CENTRAL/mozilla/mach --log-no-times mochitest --valgrind valgrind --valgrind-args=--trace-children=yes --max-threads=5000 --max-stackframe=16000000 --num-transtab-sectors=24 --tool=memcheck --freelist-vol=500000000 --redzone-size=128 --px-default=allregs-at-mem-access --px-file-backed=unwindregs-at-mem-access --malloc-fill=0xA5 --free-fill=0xC3 --num-callers=50 --suppressions=/home/ishikawa/Dropbox/myown.sup --show-mismatched-frees=no --show-possibly-lost=no --read-inline-info=yes --fair-sched=yes --smc-check=all-non-file --gen-suppressions=all -v --trace-signals=yes --vex-iropt-register-updates=allregs-at-mem-access --child-silent-after-fork=yes --suppressions=/home/ishikawa/Dropbox/myown.sup --trace-children-skip=/usr/bin/lsb_release,/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java,*/fix-stacks,*/firefox/firefox,*/bin/firefox-esr,*/bin/python,*/bin/python2,*/bin/python3,*/bin/python2.7,*/bin/bash,*/bin/nodejs,*/bin/node,*/bin/xpcshell,python3
20:33.60 INFO ishikawa	447031	446969	3 09:22 pts/4	 00:00:48 python3 /NREF-COMM-CENTRAL/mozilla/mach --log-no-times mochitest --valgrind valgrind --valgrind-args=--trace-children=yes --max-threads=5000 --max-stackframe=16000000 --num-transtab-sectors=24 --tool=memcheck --freelist-vol=500000000 --redzone-size=128 --px-default=allregs-at-mem-access --px-file-backed=unwindregs-at-mem-access --malloc-fill=0xA5 --free-fill=0xC3 --num-callers=50 --suppressions=/home/ishikawa/Dropbox/myown.sup --show-mismatched-frees=no --show-possibly-lost=no --read-inline-info=yes --fair-sched=yes --smc-check=all-non-file --gen-suppressions=all -v --trace-signals=yes --vex-iropt-register-updates=allregs-at-mem-access --child-silent-after-fork=yes --suppressions=/home/ishikawa/Dropbox/myown.sup --trace-children-skip=/usr/bin/lsb_release,/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java,*/fix-stacks,*/firefox/firefox,*/bin/firefox-esr,*/bin/python,*/bin/python2,*/bin/python3,*/bin/python2.7,*/bin/bash,*/bin/nodejs,*/bin/node,*/bin/xpcshell,python3
20:34.24 INFO runtests.py | Running on Valgrind.  Using timeout of 2960 seconds.
34:11.90 TEST_END: Test OK. Subtests passed 13/16. Unexpected 3
39:40.61 TEST_END: Test OK. Subtests passed 31/35. Unexpected 1
41:05.25 INFO runtests.py | Running on Valgrind.  Using timeout of 2960 seconds.
57:59.68 TEST_END: Test OK. Subtests passed 6/8. Unexpected 2
61:36.10 INFO runtests.py | Running on Valgrind.  Using timeout of 2960 seconds.
77:41.40 TEST_END: Test OK. Subtests passed 4/6. Unexpected 2
80:54.94 TEST_END: Test OK. Subtests passed 3/6. Unexpected 3
82:07.06 INFO runtests.py | Running on Valgrind.  Using timeout of 2960 seconds.
95:26.38 TEST_END: Test OK. Subtests passed 1/2. Unexpected 1
102:38.05 INFO runtests.py | Running on Valgrind.  Using timeout of 2960 seconds.
117:17.64 TEST_END: Test OK. Subtests passed 1/2. Unexpected 1
121:19.88 TEST_END: Test OK. Subtests passed 37/39. Unexpected 2
123:09.01 INFO runtests.py | Running on Valgrind.  Using timeout of 2960 seconds.
140:31.91 TEST_END: Test OK. Subtests passed 31/33. Unexpected 2
142:55.68 TEST_END: Test OK. Subtests passed 21/24. Unexpected 3
143:39.95 INFO runtests.py | Running on Valgrind.  Using timeout of 2960 seconds.
158:58.01 TEST_END: Test OK. Subtests passed 3/4. Unexpected 1
161:33.68 TEST_END: Test OK. Subtests passed 3/4. Unexpected 1
164:11.30 INFO runtests.py | Running on Valgrind.  Using timeout of 2960 seconds.
180:17.64 TEST_END: Test OK. Subtests passed 3/4. Unexpected 1
182:25.29 TEST_END: Test OK. Subtests passed 3/4. Unexpected 1
184:42.15 INFO runtests.py | Running on Valgrind.  Using timeout of 2960 seconds.
199:52.37 TEST_END: Test OK. Subtests passed 3/5. Unexpected 2
203:39.01 TEST_END: Test OK. Subtests passed 3/6. Unexpected 3
205:13.07 INFO runtests.py | Running on Valgrind.  Using timeout of 2960 seconds.

I forgot exactly where this 2960 is set. Maybe it is default of |mach| when it invokes valgrind(?).
Or it may be possible I have applied a local patch to increase this number already.
But as in previous comment, (comment 26), I have not increased the timeout in marionette.py before until today.
(Come to think of it, I only looked at scripts under mozilla/comm directory and missed this timeout value in marionette.py which is mozilla/testing/marionette/client/marionette_driver/marionette.py)

The mochitest is still running slowly, but I have changed the timeout in marionette.py from 1800 to 2400 and will see how the test failures will diminish.
A cautious great news so that more memory-related issues can be found for TB by running it under valgrind during mochitest. (Somehow TB under valgrind has run successfully since early this year for xpcshell tests already.)

I will summarize what needs to be done to run TB successfully under valgrind (especially the timeout value changes) once I figure out where other timeout values may need increasing.

I am using valgrind 3.17.0. Updated about a week ago on my Debian GNU/Linux PC.
This may be one of the reasons for success.

Those who may not have run TB or FF under valgrind locally may be amused to learn that (re)drawing is so slow under valgrind that you can literally see how the visual elements are drawn on the screen element by element.

This is actually a great way to learn if there is any inefficient extra/unnecessary redrawings done by programs. (We can also run TB on PC with slow CPU of less than 1GH to learn such inefficient code.)

The test continues. I think I can report how well mochitest is covered when it is run under valgrind with preliminary timeout settings on Thursday or Friday.

Regarding strange error lines of the form,

41:03.60 GECKO(471407) --471563-- WARNING: Serious error when reading debug info
641:03.60 GECKO(471407) --471563-- When reading debug info from /memfd:mozilla-ipc (deleted):
641:03.60 GECKO(471407) --471563-- failed to stat64/stat this file,

Attached is the listing of /proc/471563/map_files.
I will post the maps file, but note that between the captures of directory listings, the memory mapping might have changed.

I think mozilla-ipc, as the name suggests, is the memory area used by inter-process communication mechanism as the needs arise and then they are released. I have no idea why valgrind wants to take a look at (deleted?) the area.

For a few years, valgrind could not run large TB under Debian's stock kernel.
I think there was a kernel config issue that may have played a role in it. Huge binary of TB could not extend the stack when stack overflow occurred and valgrind itself segfaulted. Very strange.
At least in the next memory map, we can see that stack is really near the end of the whole address space and there is no memory segment near it to block its growth.

/proc/<pid>/mmap of the previous process.
Note that the mapping may have changed during the captures.

I have modified a few timeout values in |rmozilla/testing/mochitest/untests.py| script.
There are a few of them.
However, with longer time out, some tests ran, but most of the tests timed out.

Attached is an output of the command where log1235-mochitest*.txt refers to the captured log of running mochitest under valgrind |./mach mochitest --valgrind valgrind ...|.

egrep "(TEST_START|TEST_END|Subtests passed|timeout|Running on|/mach|timed out|application timed out after 2960 seconds|Running manifest)" log1325-mochitest*.txt

There were many time out errors. When these happened no testing results are recorded.

20:02.55 INFO TEST-UNEXPECTED-TIMEOUT | Last test finished | application timed out after 2960 seconds with no output
40:33.70 INFO TEST-UNEXPECTED-TIMEOUT | comm/calendar/test/browser/browser_calendarList.js | application timed out after 2960 seconds with no output
61:04.71 INFO TEST-UNEXPECTED-TIMEOUT | Last test finished | application timed out after 2960 seconds with no output
81:35.59 INFO TEST-UNEXPECTED-TIMEOUT | comm/calendar/test/browser/eventDialog/browser_alarmDialog.js | application timed out after 2960 seconds with no output
102:06.91 INFO TEST-UNEXPECTED-TIMEOUT | comm/calendar/test/browser/invitations/browser_testIdentityPrompt.js | application timed out after 2960 seconds with no output
122:37.86 INFO TEST-UNEXPECTED-TIMEOUT | comm/calendar/test/browser/preferences/browser_alarmDefaultValue.js | application timed out after 2960 seconds with no output
143:08.73 INFO TEST-UNEXPECTED-TIMEOUT | comm/calendar/test/browser/providers/browser_caldavCalendar_cached.js | application timed out after 2960 seconds with no output
163:39.74 INFO TEST-UNEXPECTED-TIMEOUT | comm/calendar/test/browser/recurrence/browser_annual.js | application timed out after 2960 seconds with no output
184:10.59 INFO TEST-UNEXPECTED-TIMEOUT | comm/calendar/test/browser/recurrence/browser_biweekly.js | application timed out after 2960 seconds with no output
204:41.55 INFO TEST-UNEXPECTED-TIMEOUT | comm/calendar/test/browser/views/browser_multiweekView.js | application timed out after 2960 seconds with no output
225:12.44 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/base/test/browser/browser_formPickers.js | application timed out after 2960 seconds with no output
245:43.30 INFO TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 2960 seconds with no output
266:14.15 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/base/test/webextensions/browser_extension_install_experiment.js | application timed out after 2960 seconds with no output
286:45.12 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/components/addrbook/test/browser/browser_ldap_search.js | application timed out after 2960 seconds with no output
307:15.96 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/components/addrbook/test/browser/browser_cardDAV_init.js | application timed out after 2960 seconds with no output
328:04.03 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/components/addrbook/test/browser/new/browser_ldap_search.js | application timed out after 2960 seconds with no output
348:35.34 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/components/addrbook/test/browser/new/browser_cardDAV_init.js | application timed out after 2960 seconds with no output
369:06.17 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/components/cloudfile/test/browser/browser_repeat_upload.js | application timed out after 2960 seconds with no output
389:37.05 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/components/enterprisepolicies/tests/browser/browser_policy_app_auto_update.js | application timed out after 2960 seconds with no output
410:07.92 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/components/enterprisepolicies/tests/browser/disable_app_update/browser_policy_disable_app_update.js | application timed out after 2960 seconds with no output
430:38.77 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/components/enterprisepolicies/tests/browser/disable_developer_tools/browser_policy_disable_developer_tools.js | application timed out after 2960 seconds with no output
451:09.63 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/components/extensions/test/browser/browser_ext_addressBooksUI.js | application timed out after 2960 seconds with no output
471:40.51 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/components/im/test/browser/browser_browserRequest.js | application timed out after 2960 seconds with no output
492:20.80 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/components/preferences/test/browser/browser_chat.js | application timed out after 2960 seconds with no output
512:51.69 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/addrbook/browser_addressBookPanes.js | application timed out after 2960 seconds with no output
533:38.84 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/attachment/browser_attachment.js | application timed out after 2960 seconds with no output
554:09.72 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/cloudfile/browser_attachmentItem.js | application timed out after 2960 seconds with no output
574:40.74 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/composition/browser_addressWidgets.js | application timed out after 2960 seconds with no output
595:11.66 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/content-tabs/browser_aboutSupport.js | application timed out after 2960 seconds with no output
615:42.60 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/cookies/browser_cookies.js | application timed out after 2960 seconds with no output
636:13.54 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/downloads/browser_aboutDownloads.js | application timed out after 2960 seconds with no output
656:44.58 INFO TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 2960 seconds with no output
677:15.53 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/folder-pane/browser_displayMessageWithFolderModes.js | application timed out after 2960 seconds with no output
697:46.49 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/folder-tree-modes/browser_customFolderTreeMode.js | application timed out after 2960 seconds with no output
718:18.19 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/folder-widget/browser_messageFilters.js | application timed out after 2960 seconds with no output
738:49.09 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/global-search-bar/browser_clickResultItem.js | application timed out after 2960 seconds with no output
759:20.02 INFO TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 2960 seconds with no output
779:50.98 INFO TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 2960 seconds with no output
800:38.20 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/keyboard/browser_spacehit.js | application timed out after 2960 seconds with no output
821:09.25 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/message-header/browser_messageHeader.js | application timed out after 2960 seconds with no output
841:40.12 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/message-reader/browser_bug594646.js | application timed out after 2960 seconds with no output
862:11.00 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/message-window/browser_autohideMenubar.js | application timed out after 2960 seconds with no output
882:55.69 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/multiple-identities/browser_displayNames.js | application timed out after 2960 seconds with no output
903:26.66 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/newmailaccount/browser_newmailaccount.js | application timed out after 2960 seconds with no output
923:57.55 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/notification/browser_notification.js | application timed out after 2960 seconds with no output
944:28.49 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/openpgp/browser_editDraftTemplate.js | application timed out after 2960 seconds with no output
964:59.41 INFO TEST-UNEXPECTED-TIMEOUT | Last test finished | application timed out after 2960 seconds with no output
985:30.28 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/quick-filter-bar/browser_displayIssues.js | application timed out after 2960 seconds with no output
1006:01.17 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/search-window/browser_multipleSearchWindows.js | application timed out after 2960 seconds with no output
1026:32.03 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/session-store/browser_sessionStore.js | application timed out after 2960 seconds with no output
1047:02.96 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/smime/browser_multipartAlternative.js | application timed out after 2960 seconds with no output
1067:33.90 INFO TEST-UNEXPECTED-TIMEOUT | Last test finished | application timed out after 2960 seconds with no output
1088:05.55 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/subscribe/browser_newsFilter.js | application timed out after 2960 seconds with no output
1108:36.43 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/tabmail/browser_closing.js | application timed out after 2960 seconds with no output
1129:07.35 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/utils/browser_extensionSupport.js | application timed out after 2960 seconds with no output
1149:39.00 INFO TEST-UNEXPECTED-TIMEOUT | Last test finished | application timed out after 2960 seconds with no output
1170:09.85 INFO TEST-UNEXPECTED-TIMEOUT | comm/mailnews/extensions/newsblog/test/browser/browser_feedDisplay.js | application timed out after 2960 seconds with no output

On the other hand, there are completed subtests.

31:52.09 TEST_START: comm/calendar/test/browser/browser_basicFunctionality.js
34:36.61 TEST_END: Test OK. Subtests passed 18/18. Unexpected 0
34:37.11 TEST_START: comm/calendar/test/browser/browser_calDAV_oAuth.js
38:24.66 TEST_END: Test OK. Subtests passed 31/34. Unexpected 0
38:24.73 TEST_START: comm/calendar/test/browser/browser_calendarList.js
52:27.07 TEST_START: comm/calendar/test/browser/contextMenu/browser_edit.js
59:39.14 TEST_END: Test OK. Subtests passed 10/10. Unexpected 0
72:50.77 TEST_START: comm/calendar/test/browser/eventDialog/browser_alarmDialog.js
94:30.41 TEST_START: comm/calendar/test/browser/invitations/browser_icsAttachment.js
96:59.41 TEST_END: Test OK. Subtests passed 4/5. Unexpected 0
96:59.59 TEST_START: comm/calendar/test/browser/invitations/browser_imipBarEmail.js
101:00.41 TEST_END: Test OK. Subtests passed 19/20. Unexpected 0
175:29.28 TEST_START: comm/calendar/test/browser/recurrence/browser_annual.js
183:14.42 TEST_END: Test OK. Subtests passed 6/8. Unexpected 2
195:55.65 TEST_START: comm/calendar/test/browser/views/browser_dayView.js
200:48.23 TEST_END: Test OK. Subtests passed 3/4. Unexpected 1
200:48.95 TEST_START: comm/calendar/test/browser/views/browser_monthView.js
203:30.97 TEST_END: Test OK. Subtests passed 3/4. Unexpected 1
387:27.51 TEST_START: comm/mail/components/enterprisepolicies/tests/browser/browser_policies_setAndLockPref_API.js
388:22.02 TEST_END: Test OK. Subtests passed 35/35. Unexpected 0
963:14.37 TEST_START: comm/mail/test/browser/override-main-menu-collapse/browser_overrideMainmenuCollapse.js
963:16.03 TEST_END: Test OK. Subtests passed 1/1. Unexpected 0
1065:35.53 TEST_START: comm/mail/test/browser/startup-firstrun/browser_menubarCollapsed.js
1065:37.16 TEST_END: Test OK. Subtests passed 1/1. Unexpected 0
1147:34.45 TEST_START: comm/mailnews/addrbook/test/browser/browser_cardDAV_oAuth.js
1149:10.40 TEST_END: Test OK. Subtests passed 11/12. Unexpected 0

It is hard to tell what makes some tests succeed and other tests fail.
I am trying to see what possibly is causing the difference.

Makes me wonder how firefox is executed under valgrind on mozilla try farm.

It is possible that TB's startup time is inherently longer due to the message store DB setup time, etc.

There is also a strange timeout issue. Possibly a miscalculation somewhere in runtests.py.

This execution was done against a single file target: comm/mail/test/browser/composition/browser_addressWidgets.js

You see 2960 seconds in the attached log.
This is a value which I mulitplied by 8.
https://searchfox.org/mozilla-central/source/testing/mochitest/runtests.py#3146

(I also increased a few timeout values as I have explained.)

However, as you can see the test is reported to have exceeded this timeout value.

20:02.65 INFO TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 2960 seconds with no output

However, I noticed this is not quite true.
This is because the whole run took only about 21 minutes, i.e., 1260 seconds.
The following line is printed by |time| command which I used when I invoked |mach| command.

real	20m39.273s      <==== much shorter than 2960 seconds.

So I think there is a mishandling of timeout values in runtests.py and the process(es) that are invoked from it.

Unless this strange timeout issue is resolved, I won't be able to proceed much, though. Hmm...

See Also: → 1732848

The attached file is produced by running the following command to the log file being captured.
The local job is running 13- hours now.

egrep -i "(entering|info test-unexpected|test_start|test_end|passed)" log1331-mochitest-memory.txt 

I think there are problems both in the test harness to run TB under valgrind during mochitest (case A), and the tests themselves which probably assumed some timing dependencies which are invalidated due to the skewed execution time under valgrind. (case B).

I am saying this based on the observations below.

Case-A: test harness problem.

Please take a look at this run.

720:25.34 TEST_START: comm/mail/test/browser/folder-widget/browser_messageFilters.js
720:31.13 INFO Entering test bound setupModule
720:36.53 INFO Entering test bound test_message_filter_shows_newsgroup_server
722:13.06 INFO Entering test bound test_customize_toolbar_doesnt_double_get_mail_menu
725:13.47 INFO Entering test bound test_address_books_appear_in_message_filter_dropdown
726:14.56 INFO Entering test bound test_can_cancel_quit_on_filter_changes
726:52.18 INFO Entering test bound test_can_quit_on_filter_changes
726:57.61 TEST_END: Test OK. Subtests passed 14/14. Unexpected 0
731:38.66 INFO TEST-UNEXPECTED-TIMEOUT | Last test finished | application timed out after 2630 seconds with no output

The test seems to have run completely and test results are summarised.
And yet, the test harness reported unexpected timeout without any output from an application?!
It seems to me, after a debugging of python script, there may be a problem of the test harness's trying to read past the end of stream (produced by child process) in

    def _read(self, stdout_reader, stderr_reader, queue):

in processhandler.py.
Maybe the child process already exited but the exit status was not passed to the caller very well?
(Maybe half the time?)

Case-B.
Test cases with log excerpt as in the following are problematic intrinsically, I think.

676:04.51 TEST_START: comm/mail/test/browser/folder-pane/browser_displayMessageWithFolderModes.js
676:09.66 INFO Entering test bound setupModule
677:14.89 INFO Entering test bound test_display_message_with_folder_not_present_in_current_folder_mode
687:16.29 INFO TEST-UNEXPECTED-TIMEOUT | comm/mail/test/browser/folder-pane/browser_displayMessageWithFolderModes.js | application timed out after 2630 seconds with no output

I can not really say much about this particular test above.

But I have seen (yes I watch the screen during the tests to see if anything goes wrong), and I have noticed many tests do not place proper pause after GUI element drawing and I suspect that
there are timing-races that may cause TB to issue events to still non-existing GUI elments on the screen, etc. due to the not enough delay / wait for TB operating under valgrind.

Anyway, maybe in 20+ hours or so, I can get to check more of the test result and can home in
tests that do not report test summary at all.

It helped to figure out what time out are caused by what.
processhandler.py calls timeout_callback routine explicitly when a stream from a child application times out.
It also helped to dump python stack every 20 seconds or so by creating a signal handler and
sending runtests.py process (invoked by |mach|) SIGUSR1 periodically.
By doing this, I could realize the reading of child process stream which can cause a timeout processing at the end. (But I still cannot figure out whey/when a timeout occurs. That seems to be a bit random and that is why I suspect unreliable processing under valgrind. runtests.py also mentions the following and I got the impression that |wait| is not so trustable somehow.

            # wait until app is finished
            # XXX copy functionality from
            # https://github.com/mozilla/mozbase/blob/master/mozrunner/mozrunner/runner.py#L61
            # until bug 913970 is fixed regarding mozrunner `wait` not returning status
            # see https://bugzilla.mozilla.org/show_bug.cgi?id=913970

This is what I found out so far.

Oh, I forgot to mention.

Thunderbird tries to inquire what graphics capabilites the host system has.
Under linux where I test my local build, it is done by probing done via X11 or known driver subsystems.
However, in the past, the graphics drivers had quirks and was prone to crashes when probed.
Thus TB (or for that matter, FF as well) FORKS itself to create a child process and does the
probe in the child process which may crash, but does not affect the main application.
The probe summary is sent back to the calling main application via pipe created during forking.

This is rather expensive operation under valgrind.

I noticed a rather lengthy pause during the invocation of TB under valgrind during mochitest.

After I figured out why TB under valgrind seems to get stuck without producing any output (that pause was during this external probe done by child process forked from the main application),
I changed TB code.

  • In a normal run, TB would print out the probed summary sent back via the pipe, as hexadecimal
    initializer for char array.
  • Under valgrind, TB would not invoke fork and skip the real probe altogether, and instead the
    data probed under normal processing (without valgrind) was made available in a character array and TB uses that for subsequent operation.

On a local test machine, it has turned out rather useful. One has to do this every time a host changes hardware configuration.

With the above mod, TB under valgrind during mochitest proceeds rather smoothly and I can focus on the real timeout issue (both in the test harness in itself, and the tests that seem to assume certain ordering of events without proper interlocking or waiting enough under valgrind.).

Instead of forking and using pipes, if the mozilla code simply runs an external program that saves the probe data in a file under, say, /tmp
we can simply excludes the tracing of such an external program from valgrind easily to save CPU workload.
However, the code was not written that way today. So I took a short cut.

Attached image image.png

I wonder why people testing FF under valgrind do not find issues, but I suspect that the test harness has been ironed out regarding the issues found for FF under valgrind (since it seems it gets run periodically on try server).

I am not even sure if the fundamental assumption runtests.py may have is correct for TB testing.
Under the current test we have following processes.
Excerpt from ps agxlww:

0  1000 1046429 1046370  20   0 3770228 284332 -    Sl+  pts/5     19:38 python3 /NREF-COMM-CENTRAL/mozilla/mach --log-no-times mochitest --timeout 2600 --valgrind valgrind --valgrind-args=--trace-children=yes --max-threads=5000 --max-stackframe=16000000 --num-transtab-sectors=24 --tool=memcheck --freelist-vol=500000000 --redzone-size=128 --px-default=allregs-at-mem-access --px-file-backed=unwindregs-at-mem-access --malloc-fill=0xA5 --free-fill=0xC3 --num-callers=50 --suppressions=/home/ishikawa/Dropbox/myown.sup --show-mismatched-frees=no --show-possibly-lost=no  --fair-sched=yes --smc-check=all-non-file --gen-suppressions=all --vex-iropt-register-updates=allregs-at-mem-access --child-silent-after-fork=yes --suppressions=/home/ishikawa/Dropbox/myown.sup --trace-children-skip=/usr/bin/lsb_release,/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java,*/fix-stacks,*/firefox/firefox,*/bin/firefox-esr,*/bin/python,*/bin/python2,*/bin/python3,*/bin/python2.7,*/bin/bash,*/bin/nodejs,*/bin/node,*/bin/xpcshell,python3,*/bin/thunderbird-bin
0  1000 1049644 1046429  20   0 356176 349856 -     S+   pts/5      0:09 /home/ishikawa/.mozbuild/fix-stacks/fix-stacks
1     0 1057902       2  20   0      0     0 -      I    ?          0:00 [kworker/0:1-events]
1     0 1061110       2  20   0      0     0 -      I    ?          0:04 [kworker/2:2-events]
1     0 1066989       2  20   0      0     0 -      I    ?          0:00 [kworker/4:2]
1     0 1069757       2  20   0      0     0 -      I    ?          0:00 [kworker/5:2]
1     0 1070572       2  20   0      0     0 -      I    ?          0:00 [kworker/6:0-events]
1     0 1083795       2  20   0      0     0 -      I    ?          0:02 [kworker/u14:2-events_unbound]
4   121 1085677    2120  20   0  40468  6612 -      S    ?          0:00 pickup -l -t unix -u -c
1     0 1088297       2  20   0      0     0 -      I    ?          0:00 [kworker/3:1-events]
4  1000 1088884 1046429  20   0 3163092 85576 -     Sl   pts/5      0:00 /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/xpcshell -g /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin -f /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/components/httpd.js -e const _PROFILE_PATH = '/COMM-CENTRAL/TMP-DIR/tmp7zkdhvzk.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false; -f /NEW-SSD/moz-obj-dir/objdir-tb3/_tests/testing/mochitest/server.js
0  1000 1088887 1046429  20   0  28752 18508 -      S    pts/5      0:00 /usr/bin/python3 /NEW-SSD/moz-obj-dir/objdir-tb3/_tests/testing/mochitest/pywebsocket_wrapper.py -H 127.0.0.1 -p 9988 -w /NEW-SSD/moz-obj-dir/objdir-tb3/_tests/testing/mochitest -l /NEW-SSD/moz-obj-dir/objdir-tb3/_tests/testing/mochitest/websock.log --log-level=debug --allow-handlers-outside-root-dir
0  1000 1088892 1046429  20   0 150980 13784 -      Sl   pts/5      0:00 /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/ssltunnel /COMM-CENTRAL/TMP-DIR/ssltunnel_4ftcg3o.cfg
4  1000 1088912 1046429  20   0 8136284 5644400 -   Rl   pts/5     18:12 valgrind --fair-sched=yes --smc-check=all-non-file --vex-iropt-register-updates=allregs-at-mem-access --trace-children=yes --child-silent-after-fork=yes --trace-children-skip=/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java,*/bin/lsb_release --partial-loads-ok=yes --leak-check=summary --show-possibly-lost=no --show-mismatched-frees=no --trace-children=yes --max-threads=5000 --max-stackframe=16000000 --num-transtab-sectors=24 --tool=memcheck --freelist-vol=500000000 --redzone-size=128 --px-default=allregs-at-mem-access --px-file-backed=unwindregs-at-mem-access --malloc-fill=0xA5 --free-fill=0xC3 --num-callers=50 --suppressions=/home/ishikawa/Dropbox/myown.sup --show-mismatched-frees=no --show-possibly-lost=no --fair-sched=yes --smc-check=all-non-file --gen-suppressions=all --vex-iropt-register-updates=allregs-at-mem-access --child-silent-after-fork=yes --suppressions=/home/ishikawa/Dropbox/myown.sup --trace-children-skip=/usr/bin/lsb_release,/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java,*/fix-stacks,*/firefox/firefox,*/bin/firefox-esr,*/bin/python,*/bin/python2,*/bin/python3,*/bin/python2.7,*/bin/bash,*/bin/nodejs,*/bin/node,*/bin/xpcshell,python3,*/bin/thunderbird-bin --suppressions=/NEW-SSD/NREF-COMM-CENTRAL/mozilla/build/valgrind/x86_64-pc-linux-gnu.sup --suppressions=/NEW-SSD/NREF-COMM-CENTRAL/mozilla/build/valgrind/cross-architecture.sup /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin -marionette -foreground -profile /COMM-CENTRAL/TMP-DIR/tmp7zkdhvzk.mozrunner
0  1000 1089320 1088912  20   0 3886856 112424 -    Sl   pts/5      0:00 /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin -contentproc -childID 1 -isForBrowser -prefsLen 1 -prefMapSize 248128 -jsInit 338144 -parentBuildID 20210928222550 -appdir /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin 1088912 tab
1     0 1089402       2  20   0      0     0 -      I    ?          0:00 [kworker/u14:0-events_unbound]
1     0 1089764       2  20   0      0     0 -      I    ?          0:00 [kworker/2:0-ata_sff]
0  1000 1089927 1088912  20   0 3887760 114996 -    Sl   pts/5      0:00 /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin -contentproc -childID 2 -isForBrowser -prefsLen 5331 -prefMapSize 248128 -jsInit 338144 -parentBuildID 20210928222550 -appdir /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin 1088912 tab
1     0 1090048       2  20   0      0     0 -      I    ?          0:00 [kworker/u14:3-events_unbound]

I have |mach| to invoke mochitest.
|xpcshell| running as httpd server listening on 8888.
|python3| process running pywebsocket_wrapper.py code.
|ssltunnel| process.
|valgrind| that runs thunderbird-bin (I modifed the script to run thunderbird-bin as opposed to thunderbird) as a marionette client (or a server)? (.../thunderbird-bin -marionette ...)

Then I have TWO (2) thunderbird processes (for browser like graphics display, I think.)
thunderbird-bin -contentproc -childID 1 -isForBrowser
thunderbird-bin -contentproc -childID 2 -isForBrowser

Please note there are two thunderbird binaries running. (Three if you count the marionette process, too.)
One of the browser process, I think, is the browser hidden behind the main window.

Big Question:
I am very curious when one runs FF under valgrind for its own mochitest, are there that many FF processes?
I have a suspicion that the FF invoked for marionette client (or server) may serve the graphics display server may serve as the graphics display, too.
That is we may not have this many processes during FF mochitest.

Some timeout handling issues related to child process not returning output at all under TB may be related to a mismatch of process configuration under mochitest f my guess above is correct.
But I could be wrong.

It would be nice if some detailed explanation of the overall configuration of processes that are executed while mochitest runs is available somewhere.

Attachment #9243641 - Attachment filename: image.png → Windows shown during TB mochitest testing.

Of course, the timeout of |_read| probably due to child exist status not propagating well might be solved by simply increasing some timeout values as we find in
https://bugzilla.mozilla.org/show_bug.cgi?id=915866
and many other places.

But if FF under valgrind runs mochitest successfully (I assume this is the case), and when I observe issues like case-A in comment 34, I am inclined to think there must be TB-specific problem lurking in the overall assumption of mochitest's runtests.py.

See Also: → 1675598

I have to figure this out, but it may be the thread creation code is written in such a way that it tries to keep on creating threads until certain number of thread creations succeed WITHIN a given timeout value.
If the thread does not get created within a timeout value, it is initially considered a failure.
But the thread may eventually get created after the timeout.
valgrind may slow the thread creation so much so that a bunch of initially to-be-created threads may not succeed within timeout, but may eventually get created.
This may explain the excessive number of thread creation.
See 1675598.

Now I have to find out where.
From bug 1675598, the initial suspect may be the HTTP parser thread(s) creation.
There may be others.
The issue occurred when the transition from mozmil to mochitest occurred.
So probably HTML parser is a good suspect IMHO.
Now I have to figure out where the repeated creation attempts may take place.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: