Closed Bug 1591177 Opened 5 years ago Closed 5 years ago

Conditional jump or move depends on uninitialised value(s) js::jit::FinishOffThreadBuilder(JSRuntime*, js::jit::IonBuilder*, js::AutoLockHelperThreadState const&) (Ion.cpp:457)

Categories

(Core :: JavaScript Engine: JIT, defect, P2)

x86_64
Linux
defect

Tracking

()

RESOLVED INVALID

People

(Reporter: ishikawa, Unassigned)

Details

Attachments

(2 files)

This is with M-C and C-C updated a couple of days ago, but the same issue was visible last month with then current source tree.
(I have used gcc-8 then, but now use gcc-9. and the problem is printed by valgrind.)

Found by valgrind while I was testing locally build FULL DEBUG vesion of TB by |make mozmill| test suite.

I have excerpted only the first few lines of a couple of dozen similar warnings printed by valgrind.
==7335== Conditional jump or move depends on uninitialised value(s)
==7335== at 0xE572F3F: js::jit::FinishOffThreadBuilder(JSRuntime*, js::jit::IonBuilder*, js::AutoLockHelperThreadState const&) (Ion.cpp:457)
==7335== by 0xDEEA9AE: js::CancelOffThreadIonCompile(mozilla::Variant<JSScript*, JS::Realm*, JS::Zone*, js::ZonesInState, JSRuntime*, js::CompilationsUsingNursery> const&) (HelperThreads.cpp:361)
==7335== by 0xE0E4D72: js::TypeZone::addPendingRecompile(JSContext*, JSScript*) (HelperThreads.h:569)
==7335== by 0xE0D3F5F: js::ConstraintTypeSet::addType(js::AutoSweepBase const&, JSContext*, js::TypeSet::Type) (TypeInference.cpp:765)
==7335== by 0xE0E73DA: js::jit::JitScript::MonitorBytecodeTypeSlow(JSContext*, JSScript*, unsigned char*, js::StackTypeSet*, js::TypeSet::Type) (TypeInference.cpp:3435)
==7335== by 0xE42A865: js::jit::TypeMonitorResult(JSContext*, js::jit::ICMonitoredFallbackStub*, js::jit::BaselineFrame*, JS::Handle<JSScript*>, unsigned char*, JS::Handle<JS::Value>) (TypeInference-inl.h:625)
==7335== by 0xE42CEEF: js::jit::DoCallFallback(JSContext*, js::jit::BaselineFrame*, js::jit::ICCall_Fallback*, unsigned int, JS::Value*, JS::MutableHandle<JS::Value>) (BaselineIC.cpp:3236)
==7335== by 0x3C106999CE46: ???
==7335== by 0x27934F5F: ???
==7335== by 0x3C1069994A23: ???
==7335== by 0xE5CA6F3: EnterJit(JSContext*, js::RunState&, unsigned char*) (Jit.cpp:109)
==7335== by 0xDD9F8FC: Interpret(JSContext*, js::RunState&) (Interpreter.cpp:3160)
==7335== by 0xDDA319D: js::RunScript(JSContext*, js::RunState&) (Interpreter.cpp:424)
==7335== by 0xDDA3891: js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) (Interpreter.cpp:592)
==7335== by 0xDDA3E6D: InternalCall(JSContext*, js::AnyInvokeArgs const&, js::CallReason) (Interpreter.cpp:620)
...

Top three lines are the same for this error/warning.

I understand from the comment to
https://bugzilla.mozilla.org/show_bug.cgi?id=1578953
that it could be a false positive due to the nature of generated Rust code, but for the sake of completeness of bug reports related to valgrind, I am submitting this bugzilla entry.

The following is the bugzilla entry that mentions the valgrind and rust issue.

https://bugzilla.mozilla.org/show_bug.cgi?id=1340231
Ensure valgrind works well with rust code in Firefox

Can you give the m-c revision? Ion.cpp:457 is an empty line on m-c tip (well, Searchfox).

Flags: needinfo?(ishikawa)

(In reply to Jan de Mooij [:jandem] from comment #2)

Can you give the m-c revision? Ion.cpp:457 is an empty line on m-c tip (well, Searchfox).

I was puzzled at that, too.

  454	  // Clean up if compilation did not succeed.
  455	  if (script->isIonCompilingOffThread()) {
  456	    script->jitScript()->clearIsIonCompilingOffThread(script);
* 457	
  458	    AbortReasonOr<Ok> status = builder->getOffThreadStatus();
  459	    if (status.isErr() && status.unwrapErr() == AbortReason::Disable) {
  460	      script->disableIon();
  461	    }

I have been updating M-C, C-C tree every two or three days for now.
And since the log quoted was taken on Oct 21, I can only say that the log was taken with then current (maybe a couple of days old )
revision I fetched previously.
However, I have checked and found that the same warning at the same line 457 existed in the local log of Sept 25.

So, the problem has been persisting for a while and the line # reported has not changed.

And since the line # is both at 457, I think there is something fishy about this particular line #.
(I understand line # can be a bit fuzzy depending on the code motion due to even simple optimization.
I suspect it could be either 456 or 458. Most likely 456?)

All I can say is that the problem is there between a relatively new head that is possibly only a couple of days old on Oct 21.

TIA

Flags: needinfo?(ishikawa)
Priority: -- → P2

Looking at the stack containing js::CancelOffThreadIonCompile, my blind guess would be to look at the initialization of the value returned by builder->getOffThreadStatus().

And maybe due to the way it is copied line 458 and tested on line 459. Which can be one of the known issue of Valgrind when faced with clever compiler.
It used to be the case with some comparisons produced by LLVM, not sure about gcc.

(In reply to Nicolas B. Pierron [:nbp] from comment #4)

Looking at the stack containing js::CancelOffThreadIonCompile, my blind guess would be to look at the initialization of the value returned by builder->getOffThreadStatus().

And maybe due to the way it is copied line 458 and tested on line 459. Which can be one of the known issue of Valgrind when faced with clever compiler.
It used to be the case with some comparisons produced by LLVM, not sure about gcc.

Thank you for your analysis.
I will keep my eyes open on this issue.

There seem to be more clear-cut uninitialized issues uncovered by running C-C TB under valgrind during |mach xpcshell-tests|.
Once they are taken care of, I will come back to this.
(It is a bit above my head to figure out how a clever compiler optimization can fool valgrind.
I suspect valgrind's built-in on-the-fly compiler or something similar to emulate CPU behavior needs some improvement.)

Thank you again.

The following post regarding false positive of valgrind may be relevant.

https://bugs.kde.org/show_bug.cgi?id=413642
gcc 9 with -O2 results in false positive: Uninitialised value was created by a stack allocation

(But I think I turned off the optimization to -Og to see it helps. Maybe -Og is still too eager? I will come back to the issue of optimization level once I have taken care of other easy uninitialized variable access issues.)

With gcc-9 and the latest valgrind, I still see the issue.

I now believe the issue is for real. Not caused by some clever optimization because of the
memory area is intentionally set as "undefined" from the viewpoint of valgrind.

I enabled the option --track-origin=yes to valgrind, and obtained the attached trace.

Now it is clear that the memory reference in question is with the area claimed by BumpChunk, and the line that causes the memory reference is as follows.

Where the reference takes place.:

IonCoimpileTask.cpp
  170	
  171	  // Clean up if compilation did not succeed.
  172	  if (script->isIonCompilingOffThread()) {
  173	    script->jitScript()->clearIsIonCompilingOffThread(script);
  174	
  175	    AbortReasonOr<Ok> status = task->mirGen().getOffThreadStatus();
* 176	    if (status.isErr() && status.unwrapErr() == AbortReason::Disable) {       <====
  177	      script->disableIon();
  178	    }
  179	  }
  180	

The attached trace shows the uninitialized memory is created by setBump().:

22:54.15 pid:1649024 ==1649024==  Uninitialised value was created by a client request
22:54.15 pid:1649024 ==1649024==    at 0xB5D39A7: js::detail::BumpChunk::setBump(unsigned char*) (LifoAlloc.h:324)

Now looking at the code in LifeAlloc.h, I realize that the code EXPLICITLY
set the memory area as undefined to catch uninitiazlied memory access when the memory is newly
added/bumped.

So I suspect there IS an incorrect assumption by the code.
I think we happen to have a working code today, but if memory allocation routine changes behavior, we may see real failures.

  298	
  299	  // Cast |this| into a uint8_t* pointer.
  300	  //
  301	  // Warning: Are you sure you do not want to use begin() instead?
  302	  const uint8_t* base() const { return reinterpret_cast<const uint8_t*>(this); }
  303	  uint8_t* base() { return reinterpret_cast<uint8_t*>(this); }
  304	
  305	  // Update the bump pointer to any value contained in this chunk, which is
  306	  // above the private fields of this chunk.
  307	  //
  308	  // The memory is poisoned and flagged as no-access when the bump pointer is
  309	  // moving backward, such as when memory is released, or when a Mark is used
  310	  // to unwind previous allocations.
  311	          //
  312	  // The memory is flagged as undefined when the bump pointer is moving
  313	  // forward.
  314	  void setBump(uint8_t* newBump) {
  315	    assertInvariants();
  316	    MOZ_ASSERT(begin() <= newBump);
  317	    MOZ_ASSERT(newBump <= capacity_);
  318	#if defined(LIFO_HAVE_MEM_CHECKS)
  319	    // Poison/Unpoison memory that we just free'd/allocated.
  320	    if (bump_ > newBump) {
  321	      LIFO_MAKE_MEM_NOACCESS(newBump, bump_ - newBump);
  322	    } else if (newBump > bump_) {
  323	      MOZ_ASSERT(newBump - RedZoneSize >= bump_);
* 324	      LIFO_MAKE_MEM_UNDEFINED(bump_, newBump - RedZoneSize - bump_);  <===
  325	      // The area [newBump - RedZoneSize .. newBump[ is already flagged as
  326	      // no-access either with the previous if-branch or with the
  327	      // BumpChunk constructor. No need to mark it twice.
  328	    }
  329	#endif
  330	    bump_ = newBump;
  331	  }
  332	
 

I think this setBump() is to extend the alocated memory area when (newBump > bump) holds.
If we fill the area with quasi-random value newly bumped, we will see failures. I am not sure what |RedZoneSize| is. Maybe it is a no man's land, so to speak, to make sure the memory allocation (and a possible memory overrun) doesn't interfere with each other.

(In reply to ISHIKAWA, Chiaki from comment #8)

I think this setBump() is to extend the alocated memory area when (newBump > bump) holds.

Yes, and to shrink and it as well, and flag everything as no_access.

If we fill the area with quasi-random value newly bumped, we will see failures. I am not sure what |RedZoneSize| is.

RedZoneSize is the size reserved between 2 allocations to have a meaningful red-zone for ASan and Valgrind. Thus any write/read over these RedZone should cause the code to Segfault/Report an issue.
https://searchfox.org/mozilla-central/source/js/src/ds/LifoAlloc.h#267-268

Maybe it is a no man's land, so to speak, to make sure the memory allocation (and a possible memory overrun) doesn't interfere with each other.

This is already done in Debug builds of SpiderMonkey:
https://searchfox.org/mozilla-central/source/js/src/ds/LifoAlloc.h#245,254

The error seen with LLVM was that AbortReasonOr<Ok> was implemented with a compact version of the Result type.
https://searchfox.org/mozilla-central/source/js/src/ds/LifoAlloc.h#267-268
https://searchfox.org/mozilla-central/source/mfbt/Result.h#324-327
https://searchfox.org/mozilla-central/source/mfbt/Result.h#198-200
https://searchfox.org/mozilla-central/source/mfbt/Result.h#176-180

This compact version caused the compiler to implement the 0-size Ok type, with 1 byte to prevent pointer aliasing, which is then read as multiple bytes when doing the comparison, then reported by valgrind as a comparison against undefined bytes.

Can you dump the assembly code of this condition, where the problem is reported:
status.isErr() && status.unwrapErr() == AbortReason::Disable

(In reply to Nicolas B. Pierron [:nbp] from comment #9)

(In reply to ISHIKAWA, Chiaki from comment #8)

I think this setBump() is to extend the alocated memory area when (newBump > bump) holds.

Yes, and to shrink and it as well, and flag everything as no_access.

If we fill the area with quasi-random value newly bumped, we will see failures. I am not sure what |RedZoneSize| is.

RedZoneSize is the size reserved between 2 allocations to have a meaningful red-zone for ASan and Valgrind. Thus any write/read over these RedZone should cause the code to Segfault/Report an issue.
https://searchfox.org/mozilla-central/source/js/src/ds/LifoAlloc.h#267-268

Maybe it is a no man's land, so to speak, to make sure the memory allocation (and a possible memory overrun) doesn't interfere with each other.

This is already done in Debug builds of SpiderMonkey:
https://searchfox.org/mozilla-central/source/js/src/ds/LifoAlloc.h#245,254

The error seen with LLVM was that AbortReasonOr<Ok> was implemented with a compact version of the Result type.
https://searchfox.org/mozilla-central/source/js/src/ds/LifoAlloc.h#267-268
https://searchfox.org/mozilla-central/source/mfbt/Result.h#324-327
https://searchfox.org/mozilla-central/source/mfbt/Result.h#198-200
https://searchfox.org/mozilla-central/source/mfbt/Result.h#176-180

This compact version caused the compiler to implement the 0-size Ok type, with 1 byte to prevent pointer aliasing, which is then read as multiple bytes when doing the comparison, then reported by valgrind as a comparison against undefined bytes.

Thank you for the detailed explanation. I now see a lot goes on under the hood, so to speak.

Can you dump the assembly code of this condition, where the problem is reported:
status.isErr() && status.unwrapErr() == AbortReason::Disable

Will do.

Used version of GCC:
gcc (Debian 9.3.0-10) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Relevant option to gcc were -O2 -g.
(Actually there were many options used.)

I issued disassemble/s and disassemble (without /s) to gdb after I set a breakpoint to

 where
#0  js::jit::FinishOffThreadTask(JSRuntime*, js::jit::IonCompileTask*, js::AutoLockHelperThreadState const&)
    (runtime=0x555555bbda00, task=task@entry=0x555555e95688, locked=...)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/jit/IonCompileTask.cpp:150
#1  0x00007ffff07bf657 in js::jit::LinkIonScript(JSContext*, JS::Handle<JSScript*>)
    (cx=0x555555993020, calleeScript=...) at /usr/include/c++/9/bits/atomic_base.h:413
#2  0x00007ffff0805a8c in js::jit::Compile(JSContext*, JS::HandleScript, js::jit::BaselineFrame*, uint32_t, jsbytecode*, bool)
    (cx=0x555555993020, script=..., osrFrame=0x7fffffff7558, osrFrameSize=<optimized out>, osrPc=<optimized out>, forceRecompile=<optimized out>) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/jit/Ion.cpp:1899
#3  0x00007ffff08061bc in BaselineCanEnterAtEntry
    (frameSize=72, frame=0x7fffffff7558, script=..., cx=0x555555993020)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/jit/Ion.cpp:2051
#4  IonCompileScriptForBaseline(JSContext*, js::jit::BaselineFrame*, uint32_t, jsbytecode*)
    (cx=<optimized out>, frame=0x7fffffff7558, frameSize=72, pc=<optimized out>)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/jit/Ion.cpp:2176
#5  0x00000013c98acae5 in  ()
#6  0x00007fffffff7570 in  ()

Probably, the relevant part is as follows.


/NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/jit/IonCompileTask.cpp:
175	    AbortReasonOr<Ok> status = task->mirGen().getOffThreadStatus();
   0x00007ffff081383a <+378>:	mov    0x20(%r12),%rax

/NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/Result.h:
377	  bool isErr() const { return !mImpl.isOk(); }
   0x00007ffff081383f <+383>:	cmpb   $0x3,0x31(%rax)
   0x00007ffff0813843 <+387>:	jne    0x7ffff08137af <js::jit::FinishOffThreadTask(JSRuntime*, js::jit::IonCompileTask*, js::AutoLockHelperThreadState const&)+239>
   0x00007ffff0813849 <+393>:	cmpb   $0x1,0x32(%rax)
   0x00007ffff081384d <+397>:	je     0x7ffff08137af <js::jit::FinishOffThreadTask(JSRuntime*, js::jit::IonCompileTask*, js::AutoLockHelperThreadState const&)+239>

/NEW-SSD/NREF-COMM-CENTRAL/mozilla/js/src/vm/SharedStencil.h:
113	  void setFlag(EnumType flag, bool b) {
   0x00007ffff0813853 <+403>:	orl    $0x20000,0x34(%r13)

It is a bit strange I don't see |status.unwrapErr() |, maybe it is a simple wrapper to retrieve a value?

Hope this helps.

It looks like GCC is indeed reordering the two sides of the &&:

if (status.isErr() && status.unwrapErr() == AbortReason::Disable) {

Note that AbortReason::Disable is 3 and the ok bool is stored in the next byte and that's tested later.

It looks like GCC is indeed reordering the two sides of the &&:

This is a standard transformation (I have looked at it quite a bit). The upcoming Valgrind 3.16 does not report false positives on these. It is not yet released, but you can pull the latest trunk with those fixes in with git clone git://sourceware.org/git/valgrind.git.

(In reply to Jan de Mooij [:jandem] from comment #12)

It looks like GCC is indeed reordering the two sides of the &&:

if (status.isErr() && status.unwrapErr() == AbortReason::Disable) {

Note that AbortReason::Disable is 3 and the ok bool is stored in the next byte and that's tested later.

Dear Jan,

Wow. Given that there is no side effect of accessing the memory location (well on embedded computers sometimes memory locations are mapped to I/O device ports [in that case, I think we either declare the memory location volatile, or use a special access functions to make sure there is proper delay between calls to that I/O port address]), and that unless BOTH conditions hold, the body of if-statement is not executed, so I guess reordering is OK.

But I find it a bit surprising after all because I am so used to the notion that "A && B" means A is evaluated and only if it is true, B is evaluated.
I thought maybe some cache line issue may affect the reordering, but I don't believe that is the reason by comparing the machine instructions that follow.

Now, does this mean the following?
If when there is an error, the unwrapErr() accesses the memory location where the error is stored.
When there is NO error, the location accessed by unwrapErr() has not been set after setBump().

Voila, the code generated by GCC-9 caused the reference to undefined memory location (!)

Is my understanding correct?

If so, I suspect I should submit a bug report to GCC bugzilla.

Or should I?
I changed my mind after reading Julian's post.

(In reply to Julian Seward [:jseward] from comment #13)

It looks like GCC is indeed reordering the two sides of the &&:

This is a standard transformation (I have looked at it quite a bit). The upcoming Valgrind 3.16 does not report false positives on these. It is not yet released, but you can pull the latest trunk with those fixes in with git clone git://sourceware.org/git/valgrind.git.

Thank you, Julian.

The last time I pulled a few months ago, I could not get it compile on Debian GNU/Linux. (I am using testing repo).
This was due to MPI header macro issues which affected other packages, too.
But a few weeks ago, when I tried to add support for sched_getattr() and sched_setattr() thinking that strange issue I saw with valgrind was
due to the missing support of these syscalls, valgrind did build under Debian GNU/Linux.
So I am pulling from git again and compiling.

BUT, do you think this transformation is OK? Can you check my understanding of the scenario which I wrote to Jan, above?

Suppose, I have a code like this.;

typedef struct {
char has_an_error;
char error_code;
} result_code;

And if my code allocates |struct result| by result_code * resultp = malloc(sizeof(result_code) and sets result_code->has_an_error = 0 at the time of allocation, and sets resultp->error_code ONLY WHEN I set has_an_error to non-zero value to signify there IS an error.

Then can gcc-9 be allowed to modify the following code with the said transformation?

if (resultp->has_an_error && resultp->error_code == CERTAIN_VALUE)

Into
resultp->error_code == CERTAIN_VALUE
&& resultp->has_an_error

Oh well, come to think of it, valgrind needs to deal with such idiosyncrasy of a compiler since the memory definedness is outside the scope of a C or C++ compiler.

Looking at the final code sequence, two expressions being reduced to mere memory accesses, it does not make sense to reorder the expressions.: maybe at the early stage of optimization, isErr() s a function call whereas
unWrap() is a mere wrapper of a simple memory access and thus, evaluating unWrap() first in this case was a win from the view point of optimization, but later during transformation such as inlining, isErr() also becomes a simple memory access.
Then it makes sense.

Interesting.

TIA

As I understand it, C/C++-semantics A && B can be transformed B && A if the compiler can prove that A is false whenever B is undefined, and that B is not going to trap or otherwise change the state of the computation in any visible way.

(In reply to ISHIKAWA, Chiaki from comment #14)

Looking at the final code sequence, two expressions being reduced to mere memory accesses, it does not make sense to reorder the expressions.: maybe at the early stage of optimization, isErr() s a function call whereas
unWrap() is a mere wrapper of a simple memory access and thus, evaluating unWrap() first in this case was a win from the view point of optimization, but later during transformation such as inlining, isErr() also becomes a simple memory access.
Then it makes sense.

Maybe GCC assumes the condition on the right is more likely to be false than the one on the left. Then reordering makes sense because in most cases only one branch is executed instead of two.

(In reply to Julian Seward [:jseward] from comment #15)

As I understand it, C/C++-semantics A && B can be transformed B && A if the compiler can prove that A is false whenever B is undefined, and that B is not going to trap or otherwise change the state of the computation in any visible way.

I see. So the GCC-9 compiler deduced that then (!),
(In reply to Jan de Mooij [:jandem] from comment #16)

(In reply to ISHIKAWA, Chiaki from comment #14)

Looking at the final code sequence, two expressions being reduced to mere memory accesses, it does not make sense to reorder the expressions.: maybe at the early stage of optimization, isErr() s a function call whereas
unWrap() is a mere wrapper of a simple memory access and thus, evaluating unWrap() first in this case was a win from the view point of optimization, but later during transformation such as inlining, isErr() also becomes a simple memory access.
Then it makes sense.

Maybe GCC assumes the condition on the right is more likely to be false than the one on the left. Then reordering makes sense because in most cases only one branch is executed instead of two.

I see. So maybe this was the result of optimization.

Hmm... The compiler optimization technology has advanced much more than I had expected.
I won't file an issue to gcc's bug tracker. :-)

I believe that there are many opportunities for optimization since source files are compiled by concatenating many source files together in mozilla's build framework (to save the processing time of myriads of complex header files was the original intention. )

Thank you again.

I will check if the locally compiled new valgrind still reports the error.
The version reported is
$ valgrind --version
valgrind-3.16.0.GIT

So I believe this is the latest valgrind from git.
(I updated the source.)

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID

(In reply to ISHIKAWA, Chiaki from comment #18)

I will check if the locally compiled new valgrind still reports the error.
The version reported is
$ valgrind --version
valgrind-3.16.0.GIT

So I believe this is the latest valgrind from git.
(I updated the source.)

Strange, the locally created valgrind (the version shown above) still prints the same memory access error. I will simply ignore it by creating appropriate suppression for now.

Thank you again for all the tips.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: