Closed Bug 1245477 Opened 4 years ago Closed 3 years ago

Crashes in lul::DerefTUW

Categories

(Core :: Gecko Profiler, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox52 --- fixed
firefox-esr52 --- fixed
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: nical, Assigned: jseward)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file, 1 obsolete file)

https://crash-stats.mozilla.com/report/list?product=Firefox&signature=lul%3A%3ADerefTUW#tab-reports

I've had several crashes in the last few days with the same signature on nightly (Linux). The crash happens in unwinding related code, which sounds like it might be related to the gecko profiler addon, the latter being installed, indeed.

I don't have clear steps to reproduce. Once I was watching stuff on air mozilla, a few times I was going through my emails (fastmail.com), so no clear correlation there, unless its a tab in the background (fastmail being always open as an app tab).
Mozilla/5.0 (X11; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0
Build ID: 20160211030242

Unfortunately, I couldn't install the gecko profiler addon on the latest Nightly on my Ubuntu 14.04 64 bit machine, I'm getting the message that the add-on appears to be corrupt. Tried to install it from: https://github.com/bgirard/Gecko-Profiler-Addon

Moving this to Gecko Profiler for further investigation.
Crash Signature: [@lul::DerefTUW]
Component: Untriaged → Gecko Profiler
Keywords: crash
OS: Unspecified → Linux
Hardware: Unspecified → x86_64
Got this too.
(and disabled Gecko profiler for now)
(In reply to Simona B [:simonab ] from comment #1)
> Unfortunately, I couldn't install the gecko profiler addon on the latest
> Nightly [...] I'm getting the message that the add-on appears to be corrupt

For the record, the place to get the modern profiler addon is now: https://new.cleopatra.io/
(That has a link to the addon download, which is hosted in mstange's github-space.)

(I just hit this crash with the current version of the addon installed & latest Nightly, too -- so this still happening.)
I tried the profiler on nightly, but I can't get it to crash, after
quite some minutes of surfing and 7 tabs open.  Does anyone have a
repeatable way to make it crash?
I just got this too. Was looking at a Google map. I'm guessing that it's mostly a matter of what code happens to be running when you sample the stack, not so much what you're doing, but I don't know. Sadly, the stack above doNativeBacktrace looks totally corrupt.

All of the crash addresses are 0xfffff something. It appears to be dereferencing 

  *(uintptr_t)(aStackImg->mContents + aAddr.Value() - aStackImg->mStartAvma))

Is it possible for aStackImg->mContents to be NULL? That would match the crash addrs.
Just hit this on Nightly.
(In reply to Steve Fink [:sfink] [:s:] from comment #5)
> I just got this too. Was looking at a Google map. I'm guessing that it's
> mostly a matter of what code happens to be running when you sample the
> stack, not so much what you're doing, but I don't know. Sadly, the stack
> above doNativeBacktrace looks totally corrupt.
> 
> All of the crash addresses are 0xfffff something. It appears to be
> dereferencing 
> 
>   *(uintptr_t)(aStackImg->mContents + aAddr.Value() - aStackImg->mStartAvma))
> 
> Is it possible for aStackImg->mContents to be NULL? That would match the
> crash addrs.

StackImage::mContents is an array:

https://dxr.mozilla.org/mozilla-central/source/tools/profiler/lul/LulMain.h#173

so mContents being NULL seems unlikely.  IIUC, that expression is trying to reference something inside of mContents, and it's being bounds-checked prior to the dereference:

https://hg.mozilla.org/mozilla-central/annotate/9c06e744b1be/tools/profiler/lul/LulMain.cpp#l1014

And the crashing address from https://crash-stats.mozilla.com/report/index/ae5f5753-0778-4057-9d52-d0f4d2170131 is 0xffffffffffff79a7, which is just a bit over the N_STACK_BYTES size of mContents.  So it looks like the bounds checks here are not quite right?
Flags: needinfo?(jseward)
Heh. Thanks for reading the actual code. It raises more mysteries, though.

  if (aAddr.Value() < aStackImg->mStartAvma) {
    return TaggedUWord();
  }
  return TaggedUWord(*(uintptr_t*)(aStackImg->mContents + aAddr.Value() - aStackImg->mStartAvma));

From the first, we get Value >= mStartAvma, so Value - mStartAvma is positive. How the heck are we getting a negative address with mContents + positive, then? Value and mStartAvma are both uintptr_t, so I don't see a weird overflow issue. I guess Value and/or mStartAvma are bogus, so that mContents + Value > 2**64 and mStartAvma is smallish?

Anyway, I think I'd better leave this for jseward.
Nathan and I looked at the bounds checks in DerefTUW in some detail.
It seems to us that the checks are correct -- no off-by-ones -- but
they are insufficient.  In particular they fail to guard against the
case where |aAddr| itself wraps around.  For example, consider aAddr
being 0xFFFF...FFFF (right at the end of the address space).  It is
not (quite) beyond the realms of possibility that the unwinder picks
up such a value from the stack (stack end marker?) or register, and
tries to dereference it.

Then what we have is

  // lower limit check
  if 0xFFFF....FFFF  <unsigned  (aAddr.Value() < aStackImg->mStartAvma
     reject

So that doesn't kick in, and

  // upper limit check
  if 0xFFFF....FFFF + sizeof(uintptr_t) >unsigned ...
     reject

we are testing 7 > ..., which is very unlikely, so that doesn't
kick in either.  So we've accepted an obviously-invalid address.

Unfortunately I still can't reproduce this.  I can get an immediate
segfault though by placing this at the top of the function

  aAddr = TaggedUWord(~ (uintptr_t)1);

which validates the above theory a bit.
Flags: needinfo?(jseward)
Attached patch bug1245477-1.cset (obsolete) — Splinter Review
Guards against |aAddr| wraparounds.

How shall we proceed here, given that I can't check that it
fixes the reported problem?
Comment on attachment 8832508 [details] [diff] [review]
bug1245477-1.cset

Review of attachment 8832508 [details] [diff] [review]:
-----------------------------------------------------------------

Could you use a CheckedInt<uintptr_t> here?
(In reply to Julian Seward [:jseward] from comment #10)
> How shall we proceed here, given that I can't check that it
> fixes the reported problem?

I'd say land it, because we agree that the problem can theoretically happen; watch crash stats, and if the crash still occurs, file a new bug to do a new investigation.
Now using CheckedInt<uintptr_t> as suggested in comment #11.
Attachment #8832508 - Attachment is obsolete: true
Attachment #8832889 - Flags: review?(nfroyd)
Comment on attachment 8832889 [details] [diff] [review]
bug1245477-2.cset

Review of attachment 8832889 [details] [diff] [review]:
-----------------------------------------------------------------

FWIW, examining the register values in a crash report and correlating them with the assembly indicates that receiving 0xfff...ff is exactly what's happening.

::: tools/profiler/lul/LulMain.cpp
@@ +1029,5 @@
> +  // avoid overflow.  In particular if |aAddr| is 0xFFF...FFF or the
> +  // 3/7 values below that, then we will get overflow.  See bug #1245477.
> +  CheckedInt<uintptr_t> highest_requested_plus_one
> +    = CheckedInt<uintptr_t>(aAddr.Value())
> +      + CheckedInt<uintptr_t>(sizeof(uintptr_t));

FWIW, you ought to be able to say:

... = CheckedInt<uintptr_t>(aAddr.Value()) + sizeof(uintptr_t);

and the expected coercions will happen.

A local:

typedef CheckedInt<uintptr_t> CheckedWord;

and using that might make this a little less verbose.
Attachment #8832889 - Flags: review?(nfroyd) → review+
https://hg.mozilla.org/mozilla-central/rev/4f8613cea566
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Please request Aurora/Beta approval on this when you're comfortable doing so.
Assignee: nobody → jseward
Flags: needinfo?(jseward)
Comment on attachment 8832889 [details] [diff] [review]
bug1245477-2.cset

Approval Request Comment

[Feature/Bug causing the regression]:
Range checks in lul::DerefTUW are inadequate, causing invalid
acceptance of addresses that wrap around the end of the address space.

[User impact if declined]:
For users of the Gecko profiler on Linux-{x86_64,i686} -- occasional
crashes when profiling.  For all other scenarios -- none.

[Is this code covered by automated tests?]:
No

[Has the fix been verified in Nightly?]:
Yes

[Needs manual test from QE? If yes, steps to reproduce]: 
I never managed to reproduce it with the profiler.  The fix
was generated by inspecting the code and crash dump carefully.

[List of other uplifts needed for the feature/fix]:
None

[Is the change risky?]:
No

[Why is the change risky/not risky?]:
Very tiny fix, Linux only, just adds extra validity checks for this function.

[String changes made/needed]:
None
Flags: needinfo?(jseward)
Attachment #8832889 - Flags: approval-mozilla-beta?
Attachment #8832889 - Flags: approval-mozilla-aurora?
Comment on attachment 8832889 [details] [diff] [review]
bug1245477-2.cset

Fix a crash. Aurora53+.
Attachment #8832889 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8832889 [details] [diff] [review]
bug1245477-2.cset

address range check fix in profiler, beta52+
Attachment #8832889 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.