Possible use-after-free with C-C TB (a new twist unlike bug 1298103)
Categories
(MailNews Core :: Backend, defect)
Tracking
(Not tracked)
People
(Reporter: ishikawa, Unassigned)
References
(Depends on 1 open bug)
Details
(Keywords: regression, regressionwindow-wanted, Whiteboard: [regression:TB??])
Attachments
(1 file)
7.00 KB,
text/plain
|
Details |
I see valgrind still reports the errors in Bug 1298103 Use after free in getenv during |make mozmill| test suite run But something has changed in the code. I see that the allocated buffer area is released during a GC sweep (?) and then the pointer that points to the 8th byte of the released area is referenced later. I don't know why, though, the access to the area is happening often in graphics related code. Given that the buffer in question seems to have something to do with keeping property/attribute of something during runtime, maybe the "something" is related to graphics object. I am attaching a single case for someone's inspection. There are many such examples recorded during |make mozmill| run. Strange thing is that the bad pointer is often referenced by getenv() or environment-related functions as in the case of Bug 1298103. At the time of my report of Bug 1298103, the pattern reported here did not exist. The execution time skew (slowing down) of running C-C TB under valgrind might cause the periodic GC-related activity to behave erratically. For example, a periodic mark/sweep may run from periodic timer event and will terminate well before the next firing of timer. But due to the excessive slow down, the GC mark/sweep may be still in progress when the next invocation of mark/sweep is triggered by next timer, etc. Or maybe there is a simple coding bug. I believe that M-C FF is checked under valgrind maybe once a month (?) If the bug does not manifest in the M-C FF case, there could be something C-C TB is doing uniquely, but the processing in the stacktrace of allocation and freeing seems to be very normal. I wonder why it is not noticed during FF testing under valgrind. TIA TIA
Reporter | ||
Comment 1•7 years ago
|
||
I thought I added the valgrind error message list, but I could not find it on bugzilla. So here it goes again.
Updated•7 years ago
|
Updated•7 years ago
|
Reporter | ||
Comment 2•7 years ago
|
||
Looking at the code in comm-central/mozilla/js/src/jsgc.cpp I have a feeling there must be some time-race in this incremental GC. Otherwise, I can't explain why the code leaves a pointer to a freed block, and that the bogus pointer is referenced later and caused valgrind to report it. Of course, valgrind skews the simulated execution time so much against the wall clock so it may uncover hidden assumption about the ordering, etc., but this suggests to me there is room for inserting more assertions in the code to check for possible violated assumptions. Sorry I don't have the time to look into this particular code right now. TIA
Reporter | ||
Comment 3•7 years ago
|
||
The code sequence in question may ring a bell of those who are familiar with the code in NativeObject::setLastProperty which sometimes requires to grow the slots by calling js::NativeObject::growSlots, and then which is later accessed by CC's js::gc::ArenaLists::background Finalize. cf. the following is from the attachment. Freeing: the pointer points at an area inside an already free'e block. Free was called in the following function call stack. Note there are GC-related function stackframes. Address 0x443e8c38 is 8 bytes inside a block of size 64 free'd at 0x4C2CD3A: free (vg_replace_malloc.c:530) by 0x14B196E2: bool FinalizeTypedArenas<JSObject>(js::FreeOp*, js::gc::Arena**, js::gc::SortedArenaList&, js::gc::AllocKind, js::SliceBudget&, js::gc::ArenaLists::KeepArenasEnum) (Utility.h:257) by 0x14B1A2DD: FinalizeArenas(js::FreeOp*, js::gc::Arena**, js::gc::SortedArenaList&, js::gc::AllocKind, js::SliceBudget&, js::gc::ArenaLists::KeepArenasEnum) (jsgc.cpp:613) by 0x14B1BDAC: js::gc::ArenaLists::backgroundFinalize(js::FreeOp*, js::gc::Arena*, js::gc::Arena**) (jsgc.cpp:2827) and then which was malloc'ed earlier, ALLOCATION: The area was originally allocated by the following function call. Block was alloc'd at at 0x4C2BB8D: malloc (vg_replace_malloc.c:299) by 0x1500EC20: js::Nursery::allocateBuffer(JSObject*, unsigned long) (Utility.h:229) by 0x14CFE279: js::HeapSlot* js::AllocateObjectBuffer<js::HeapSlot>(JSContext*, JSObject*, unsigned int) (Nursery-inl.h:56) by 0x14CFE50C: js::NativeObject::growSlots(JSContext*, unsigned int, unsigned int) (NativeObject.cpp:400) by 0x14B992F8: js::NativeObject::setLastProperty(JSContext*, js::Shape*) (NativeObject-inl.h:468) by 0x14DBAFE6: js::NativeObject::getChildProperty(JSContext*, JS::Handle<js::NativeObject*>, JS::Handle<js::Shape*>, JS::MutableHandle<js::StackShape>) (Shape.cpp:349) by 0x14DBB429: js::NativeObject::addPropertyInternal(JSContext*, JS::Handle<js::NativeObject*>, JS::Handle<jsid>, bool (*)(JSContext*, JS::Handle<JSObject*>,
Reporter | ||
Comment 5•4 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #4)
Chiaki, does this issue still exist?
I will check this again in a few days.
|make mozmill| is gone and I had to modify all my local test scripts accordingly.
PLUS, the clumsy handling of corona virus infection by government here in Japan has really diverted my attention to this literally life and death issue for the last several weeks.
Finally, I have begun telecommuting most of the week since last week,
and thus have more time (less commute time) to use for TB debuggin/development.
I will report my finding hopefully later this week.
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 6•4 years ago
|
||
I have not been able to run |mach mochitest|, the replacement of |make mozmill| test suite under varlgrind due to a strange issue.
Bug 1629433: Running thunderbird binary under valgrind while trying to run mochitest requires way TOO MANY threads (> 1500) !?
https://bugzilla.mozilla.org/show_bug.cgi?id=1629433
Once that is fixed, and I can run TB under valgrind during |mach mochitest| I can say for sure if the problem still persists or not.
Reporter | ||
Comment 7•4 years ago
|
||
I am still stuck.:
bug 1630345
./mach bootstrap fails with python-pip dependency issue: python-pip : Depends: python-pip-whl (= 18.1-5) but 20.0.2-4 is to be installed
bug 1629433
Running thunderbird binary under valgrind while trying to run mochitest requires way TOO MANY threads (> 1500) !?
(Well, actually, I can increase the # of threads to 5,000 and the test proceeded. But once a stack trace is printed, and external stack address decoder is invoked, and that external stack address decoder needs to be installed by |./mach bootstrap| which unfortunately fails.
.
Comment 8•4 years ago
|
||
(In reply to ISHIKAWA, Chiaki from comment #7)
I am still stuck.:
bug 1630345
With a priority of P3, that bug doesn't have a great chance of being fixed in coming months, unless something moves the priority needle.
Reporter | ||
Comment 9•4 years ago
|
||
I got stuck with another bug: bug 1633092
TB build failure: GLSL optimizer output causes a compiler error (GCC-9) error: comparison of integer expressions of different signedness: ‘long int’ and ‘size_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]
Hmm...
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 10•4 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #4)
Chiaki, does this issue still exist?
Wayne, I don't see it any more. (I have figured out to work around the build bugs for now.)
Bug 1578951 might have fixed the underlying cause.
So we can close it for now.
Comment 11•4 years ago
|
||
-> WFM then
Reporter | ||
Comment 12•4 years ago
|
||
Sorry, I was wrong.
The mochitest framework is not quite valgrind-friendly.
I realize that I may not see the bug BECAUSE the test times out prematurely due to the global mochitest timeout value, which I need to enlarge somehow.
Bug 1636945 - make timeout of mochitest much longer for running mochitest under valgrind successfully.
So until I can cover the whole of mochitest under valgrind, this bug needs to be reopened.
Comment 13•2 years ago
|
||
(In reply to ISHIKAWA, Chiaki from comment #12)
...
So until I can cover the whole of mochitest under valgrind, this bug needs to be reopened.
How is that going? What, if anything, from the Thunderbird team would help your work moved forward?
Reporter | ||
Comment 14•2 years ago
|
||
The problem was reported by mozmill.
But we now have switched to mochitest framework.
From my old logs, I found that the errors were reported by
two tests.
mail/test/mozmill/content-tabs/test-about-support.js
mail/test/mozmill/content-tabs/test-plugin-crashing.js
But they are no longer in mochitest.
I wonder if someone can recreate the tests for mochitest framework, or if the equivalent of ech already exists under a different name.
Running mochitest under valgrind takes more than 24 hours including the failed timeout tests.
And thus the past test did NOT cover all the tests due to timeout (setting proper timeout value for each test is rather difficult. I have been trying to see if I can set reasonable timeout, but it is on the back burner.)
Oh yes, there will be a very useful feature if we can set
timeout for each mochitest test as in xpcshell.ini for xpcshell-test.
Actually, something better than xpcshell.ini.: xpcshell.ini is for general processing (not under valgrind).
We specifically want to state that valgrind timeout for each test.
I also find xpcshell.ini gets edited quite often and my effort to put in proper xpcshell timout locally for valgrind is
really difficult to sync with the latest changes.
Thus I am not so sure if I am setting proper timeout value for valgrind execution of xpcshell-test now.
(But I have more other bugs/paches on hand right now.)
So if somone can re-create the equivlalent tests above in mochitest framework, I can try
running them individually.
I searched for PR_SetEnv that appears in the valgrind stack in my local logs.
The test name that caused valgrind is inserted.
log842-mozmill.txt.gz
TEST-START | /NREF-COMM-CENTRAL/comm-central/ | test_display_about_supportmail/test/mozmill/content-tabs/test-about-support.js
...
==28905== by 0x40B5DB8: PR_SetEnv (prenv.c:110)
fun:PR_SetEnv
log973-mozmill-memcheck.txt
TEST-START | /NREF-COMM-CENTRAL/comm-central/mail/test/mozmill/content-tabs/test-plugin-crashing.js | setupTest
...
==25564== by 0x5B2A2D1: __add_to_environ (setenv.c:143)
==25564== by 0x40B2998: PR_SetEnv (prenv.c:110)
fun:PR_SetEnv
log974-mozmill-memcheck-opt.txt
TEST-START | /NREF-COMM-CENTRAL/comm-central/mail/test/mozmill/content-tabs/test-plugin-crashing.js | setupTest
==4154== by 0x5B2A2D1: __add_to_environ (setenv.c:143)
==4154== by 0x40B2998: PR_SetEnv (prenv.c:110)
fun:PR_SetEnv
This one is just FYI:
log637-mozmill-memcheck.txt.gz
--11108-- used_suppression: 9 PR_SetEnv requires its argument to be leaked, but does not appear on stacks. (See bug 793534 and see bug 793549.) /home/ishikawa/TB-NEW/TB-3HG/new-src/mozilla/build/valgrind/cross-architecture.sup:8 suppressed: 160 bytes in 9 blocks
TIA
Comment 15•2 years ago
|
||
Tests were all converted over, changing the naming convetion
mail/test/mozmill/content-tabs/test-about-support.js -> browser_aboutSupport.js
mail/test/mozmill/content-tabs/test-plugin-crashing.js (deleted since plugins do not work at all anymore?)
Reporter | ||
Comment 16•2 years ago
|
||
(In reply to Magnus Melin [:mkmelin] from comment #15)
Tests were all converted over, changing the naming convetion
mail/test/mozmill/content-tabs/test-about-support.js -> browser_aboutSupport.js
mail/test/mozmill/content-tabs/test-plugin-crashing.js (deleted since plugins do not work at all anymore?)
Thank you, Magnus!
I will check running browser_aboutSupport.js test under valgrind and see how far I can debug.
Reporter | ||
Comment 17•2 years ago
|
||
Short story:
It seems that this particular problem is no longer reported by valgrind. Good.
Long story:
What is not so good is this.
I could run valgrind + TB on August 15.
Something has changed since then.
I think the binary tool chain (probably rust compiler?) has changed.
Suddenly today when I tried running the particular test in the mochitest test suite, during the execution of mochitest with "valgrind + TB" combination terminated because valgrind could not read certain dwarf2 data in libxul.a.
It did not happen on August 15 with then compiled binaries.
There have been problems of reading dwaref2 data by valgrind.
https://bugzilla.mozilla.org/buglist.cgi?quicksearch=dwarf&list_id=16216759
One of them is reported by me (!) : Bug 1534109
But luckily, for the last several months, valgrind did not crash.
Well, for the last several months I have been compiling C-C TB without -gsplit-dwarf, but I recompiled the whole binary
WITH -gpslit-dwarf, and the fatal crash of valgrind disappeared.
(There are still cases where valgrind could not read some debug info and reported as such.
But it does not get killed due to fatal error any more.)
In the log, only the previously reported valgrind warnings are visible now.
So, this is problem is no longer visible, but I have a bitter taste in my mouth.
I wonder if valgrind is working perfectly these days.
We need more people testing TB under valgrind.
My previous comment. : Maybe I should submit a bugzilla as Request for Enhancment.
Oh yes, there will be a very useful feature if we can set
timeout for each mochitest test as in xpcshell.ini for xpcshell-test.
Actually, something better than xpcshell.ini.: xpcshell.ini is for general processing (not under valgrind).
We specifically want to state that valgrind timeout for each test.
I also find xpcshell.ini gets edited quite often and my effort to put in proper xpcshell timout locally for valgrind is
really difficult to sync with the latest changes.
Thus I am not so sure if I am setting proper timeout value for valgrind execution of xpcshell-test now.
(But I have more other bugs/paches on hand right now.)
Description
•