Closed Bug 1364543 Opened 8 years ago Closed 2 years ago

Possible use-after-free with C-C TB (a new twist unlike bug 1298103)

Categories

(MailNews Core :: Backend, defect)

x86_64
Linux
defect
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: ishikawa, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: regression, regressionwindow-wanted, Whiteboard: [regression:TB??])

Attachments

(1 file)

I see valgrind still reports the errors in Bug 1298103 Use after free in getenv during |make mozmill| test suite run But something has changed in the code. I see that the allocated buffer area is released during a GC sweep (?) and then the pointer that points to the 8th byte of the released area is referenced later. I don't know why, though, the access to the area is happening often in graphics related code. Given that the buffer in question seems to have something to do with keeping property/attribute of something during runtime, maybe the "something" is related to graphics object. I am attaching a single case for someone's inspection. There are many such examples recorded during |make mozmill| run. Strange thing is that the bad pointer is often referenced by getenv() or environment-related functions as in the case of Bug 1298103. At the time of my report of Bug 1298103, the pattern reported here did not exist. The execution time skew (slowing down) of running C-C TB under valgrind might cause the periodic GC-related activity to behave erratically. For example, a periodic mark/sweep may run from periodic timer event and will terminate well before the next firing of timer. But due to the excessive slow down, the GC mark/sweep may be still in progress when the next invocation of mark/sweep is triggered by next timer, etc. Or maybe there is a simple coding bug. I believe that M-C FF is checked under valgrind maybe once a month (?) If the bug does not manifest in the M-C FF case, there could be something C-C TB is doing uniquely, but the processing in the stacktrace of allocation and freeing seems to be very normal. I wonder why it is not noticed during FF testing under valgrind. TIA TIA
I thought I added the valgrind error message list, but I could not find it on bugzilla. So here it goes again.
Component: Untriaged → Backend
Product: Thunderbird → MailNews Core
Blocks: 1353702
Whiteboard: [regression:TB??]
Looking at the code in comm-central/mozilla/js/src/jsgc.cpp I have a feeling there must be some time-race in this incremental GC. Otherwise, I can't explain why the code leaves a pointer to a freed block, and that the bogus pointer is referenced later and caused valgrind to report it. Of course, valgrind skews the simulated execution time so much against the wall clock so it may uncover hidden assumption about the ordering, etc., but this suggests to me there is room for inserting more assertions in the code to check for possible violated assumptions. Sorry I don't have the time to look into this particular code right now. TIA
The code sequence in question may ring a bell of those who are familiar with the code in NativeObject::setLastProperty which sometimes requires to grow the slots by calling js::NativeObject::growSlots, and then which is later accessed by CC's js::gc::ArenaLists::background Finalize. cf. the following is from the attachment. Freeing: the pointer points at an area inside an already free'e block. Free was called in the following function call stack. Note there are GC-related function stackframes. Address 0x443e8c38 is 8 bytes inside a block of size 64 free'd at 0x4C2CD3A: free (vg_replace_malloc.c:530) by 0x14B196E2: bool FinalizeTypedArenas<JSObject>(js::FreeOp*, js::gc::Arena**, js::gc::SortedArenaList&, js::gc::AllocKind, js::SliceBudget&, js::gc::ArenaLists::KeepArenasEnum) (Utility.h:257) by 0x14B1A2DD: FinalizeArenas(js::FreeOp*, js::gc::Arena**, js::gc::SortedArenaList&, js::gc::AllocKind, js::SliceBudget&, js::gc::ArenaLists::KeepArenasEnum) (jsgc.cpp:613) by 0x14B1BDAC: js::gc::ArenaLists::backgroundFinalize(js::FreeOp*, js::gc::Arena*, js::gc::Arena**) (jsgc.cpp:2827) and then which was malloc'ed earlier, ALLOCATION: The area was originally allocated by the following function call. Block was alloc'd at at 0x4C2BB8D: malloc (vg_replace_malloc.c:299) by 0x1500EC20: js::Nursery::allocateBuffer(JSObject*, unsigned long) (Utility.h:229) by 0x14CFE279: js::HeapSlot* js::AllocateObjectBuffer<js::HeapSlot>(JSContext*, JSObject*, unsigned int) (Nursery-inl.h:56) by 0x14CFE50C: js::NativeObject::growSlots(JSContext*, unsigned int, unsigned int) (NativeObject.cpp:400) by 0x14B992F8: js::NativeObject::setLastProperty(JSContext*, js::Shape*) (NativeObject-inl.h:468) by 0x14DBAFE6: js::NativeObject::getChildProperty(JSContext*, JS::Handle<js::NativeObject*>, JS::Handle<js::Shape*>, JS::MutableHandle<js::StackShape>) (Shape.cpp:349) by 0x14DBB429: js::NativeObject::addPropertyInternal(JSContext*, JS::Handle<js::NativeObject*>, JS::Handle<jsid>, bool (*)(JSContext*, JS::Handle<JSObject*>,

Chiaki, does this issue still exist?

Flags: needinfo?(ishikawa)

(In reply to Wayne Mery (:wsmwk) from comment #4)

Chiaki, does this issue still exist?

I will check this again in a few days.

|make mozmill| is gone and I had to modify all my local test scripts accordingly.
PLUS, the clumsy handling of corona virus infection by government here in Japan has really diverted my attention to this literally life and death issue for the last several weeks.
Finally, I have begun telecommuting most of the week since last week,
and thus have more time (less commute time) to use for TB debuggin/development.

I will report my finding hopefully later this week.

Flags: needinfo?(ishikawa)
Flags: needinfo?(ishikawa)

I have not been able to run |mach mochitest|, the replacement of |make mozmill| test suite under varlgrind due to a strange issue.
Bug 1629433: Running thunderbird binary under valgrind while trying to run mochitest requires way TOO MANY threads (> 1500) !?
https://bugzilla.mozilla.org/show_bug.cgi?id=1629433

Once that is fixed, and I can run TB under valgrind during |mach mochitest| I can say for sure if the problem still persists or not.

I am still stuck.:
bug 1630345
./mach bootstrap fails with python-pip dependency issue: python-pip : Depends: python-pip-whl (= 18.1-5) but 20.0.2-4 is to be installed
bug 1629433
Running thunderbird binary under valgrind while trying to run mochitest requires way TOO MANY threads (> 1500) !?
(Well, actually, I can increase the # of threads to 5,000 and the test proceeded. But once a stack trace is printed, and external stack address decoder is invoked, and that external stack address decoder needs to be installed by |./mach bootstrap| which unfortunately fails.
.

(In reply to ISHIKAWA, Chiaki from comment #7)

I am still stuck.:
bug 1630345

With a priority of P3, that bug doesn't have a great chance of being fixed in coming months, unless something moves the priority needle.

See Also: → 1630345

I got stuck with another bug: bug 1633092
TB build failure: GLSL optimizer output causes a compiler error (GCC-9) error: comparison of integer expressions of different signedness: ‘long int’ and ‘size_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]

Hmm...

Flags: needinfo?(ishikawa)
Flags: needinfo?(ishikawa)

(In reply to Wayne Mery (:wsmwk) from comment #4)

Chiaki, does this issue still exist?

Wayne, I don't see it any more. (I have figured out to work around the build bugs for now.)

Bug 1578951 might have fixed the underlying cause.

So we can close it for now.

Flags: needinfo?(ishikawa)

-> WFM then

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME

Sorry, I was wrong.

The mochitest framework is not quite valgrind-friendly.

I realize that I may not see the bug BECAUSE the test times out prematurely due to the global mochitest timeout value, which I need to enlarge somehow.
Bug 1636945 - make timeout of mochitest much longer for running mochitest under valgrind successfully.

So until I can cover the whole of mochitest under valgrind, this bug needs to be reopened.

Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Depends on: 1629433

(In reply to ISHIKAWA, Chiaki from comment #12)

...
So until I can cover the whole of mochitest under valgrind, this bug needs to be reopened.

How is that going? What, if anything, from the Thunderbird team would help your work moved forward?

Flags: needinfo?(ishikawa)

The problem was reported by mozmill.
But we now have switched to mochitest framework.

From my old logs, I found that the errors were reported by
two tests.
mail/test/mozmill/content-tabs/test-about-support.js
mail/test/mozmill/content-tabs/test-plugin-crashing.js

But they are no longer in mochitest.

I wonder if someone can recreate the tests for mochitest framework, or if the equivalent of ech already exists under a different name.

Running mochitest under valgrind takes more than 24 hours including the failed timeout tests.
And thus the past test did NOT cover all the tests due to timeout (setting proper timeout value for each test is rather difficult. I have been trying to see if I can set reasonable timeout, but it is on the back burner.)

Oh yes, there will be a very useful feature if we can set
timeout for each mochitest test as in xpcshell.ini for xpcshell-test.
Actually, something better than xpcshell.ini.: xpcshell.ini is for general processing (not under valgrind).
We specifically want to state that valgrind timeout for each test.
I also find xpcshell.ini gets edited quite often and my effort to put in proper xpcshell timout locally for valgrind is
really difficult to sync with the latest changes.
Thus I am not so sure if I am setting proper timeout value for valgrind execution of xpcshell-test now.
(But I have more other bugs/paches on hand right now.)

So if somone can re-create the equivlalent tests above in mochitest framework, I can try
running them individually.


I searched for PR_SetEnv that appears in the valgrind stack in my local logs.
The test name that caused valgrind is inserted.

log842-mozmill.txt.gz
TEST-START | /NREF-COMM-CENTRAL/comm-central/ | test_display_about_supportmail/test/mozmill/content-tabs/test-about-support.js
...

==28905== by 0x40B5DB8: PR_SetEnv (prenv.c:110)
fun:PR_SetEnv

log973-mozmill-memcheck.txt
TEST-START | /NREF-COMM-CENTRAL/comm-central/mail/test/mozmill/content-tabs/test-plugin-crashing.js | setupTest
...
==25564== by 0x5B2A2D1: __add_to_environ (setenv.c:143)
==25564== by 0x40B2998: PR_SetEnv (prenv.c:110)
fun:PR_SetEnv

log974-mozmill-memcheck-opt.txt
TEST-START | /NREF-COMM-CENTRAL/comm-central/mail/test/mozmill/content-tabs/test-plugin-crashing.js | setupTest

==4154== by 0x5B2A2D1: __add_to_environ (setenv.c:143)
==4154== by 0x40B2998: PR_SetEnv (prenv.c:110)
fun:PR_SetEnv

This one is just FYI:
log637-mozmill-memcheck.txt.gz
--11108-- used_suppression: 9 PR_SetEnv requires its argument to be leaked, but does not appear on stacks. (See bug 793534 and see bug 793549.) /home/ishikawa/TB-NEW/TB-3HG/new-src/mozilla/build/valgrind/cross-architecture.sup:8 suppressed: 160 bytes in 9 blocks

TIA

Flags: needinfo?(ishikawa)

Tests were all converted over, changing the naming convetion
mail/test/mozmill/content-tabs/test-about-support.js -> browser_aboutSupport.js
mail/test/mozmill/content-tabs/test-plugin-crashing.js (deleted since plugins do not work at all anymore?)

(In reply to Magnus Melin [:mkmelin] from comment #15)

Tests were all converted over, changing the naming convetion
mail/test/mozmill/content-tabs/test-about-support.js -> browser_aboutSupport.js
mail/test/mozmill/content-tabs/test-plugin-crashing.js (deleted since plugins do not work at all anymore?)

Thank you, Magnus!

I will check running browser_aboutSupport.js test under valgrind and see how far I can debug.

Short story:

It seems that this particular problem is no longer reported by valgrind. Good.

Long story:

What is not so good is this.
I could run valgrind + TB on August 15.
Something has changed since then.
I think the binary tool chain (probably rust compiler?) has changed.

Suddenly today when I tried running the particular test in the mochitest test suite, during the execution of mochitest with "valgrind + TB" combination terminated because valgrind could not read certain dwarf2 data in libxul.a.
It did not happen on August 15 with then compiled binaries.

There have been problems of reading dwaref2 data by valgrind.
https://bugzilla.mozilla.org/buglist.cgi?quicksearch=dwarf&list_id=16216759
One of them is reported by me (!) : Bug 1534109
But luckily, for the last several months, valgrind did not crash.

Well, for the last several months I have been compiling C-C TB without -gsplit-dwarf, but I recompiled the whole binary
WITH -gpslit-dwarf, and the fatal crash of valgrind disappeared.
(There are still cases where valgrind could not read some debug info and reported as such.
But it does not get killed due to fatal error any more.)

In the log, only the previously reported valgrind warnings are visible now.

So, this is problem is no longer visible, but I have a bitter taste in my mouth.
I wonder if valgrind is working perfectly these days.
We need more people testing TB under valgrind.

My previous comment. : Maybe I should submit a bugzilla as Request for Enhancment.

Oh yes, there will be a very useful feature if we can set
timeout for each mochitest test as in xpcshell.ini for xpcshell-test.
Actually, something better than xpcshell.ini.: xpcshell.ini is for general processing (not under valgrind).
We specifically want to state that valgrind timeout for each test.
I also find xpcshell.ini gets edited quite often and my effort to put in proper xpcshell timout locally for valgrind is
really difficult to sync with the latest changes.
Thus I am not so sure if I am setting proper timeout value for valgrind execution of xpcshell-test now.
(But I have more other bugs/paches on hand right now.)
Status: REOPENED → RESOLVED
Closed: 5 years ago2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: