Closed Bug 790649 Opened 7 years ago Closed 7 years ago

Worker thread crash @ffi_call

Categories

(Toolkit :: OS.File, defect, blocker)

x86
Windows 7
defect
Not set
blocker

Tracking

()

VERIFIED FIXED
mozilla18
Tracking Status
firefox16 --- verified
firefox17 --- verified
firefox18 --- verified
firefox-esr10 --- unaffected

People

(Reporter: mats, Assigned: Yoric)

References

(Blocks 1 open bug, )

Details

(Keywords: crash, regression, sec-other, Whiteboard: [advisory-tracking-])

Attachments

(4 files, 4 obsolete files)

Attached file stack
STR
1. load URL

(mozilla-inbound, rev 2bbc276ffd2a, debug build)

ACTUAL RESULTS

First-chance exception at 0x7736f9d2 in firefox.exe: 0xC0000008: An invalid handle was specified.

ffi_call
js::ctypes::CDataFinalizer::CallFinalizer
js::ctypes::CDataFinalizer::Methods::Dispose
js::CallJSNative
js::InvokeKernel
js::Interpret
...

see attached stack for details
Maybe, marking dependent for now.  I'll re-test when that bug is fixed.
Depends on: 790633
Blocks: 765416
Sounds like it's in my code. I'll investigate.
QA Contact: dteller
Attached patch Experimental fix (obsolete) — Splinter Review
Attaching an experimental fix. Testing as soon as Windows build is complete.
Assignee: nobody → dteller
Attachment #661791 - Flags: feedback?
I have not been able to reproduce the problem yet, and I suspect that it depends on how many images are present in the thumbnails cache. I will keep trying, but in the meantime, I have attached an experimental fix, if anybody wants to test it.
(In reply to David Rajchenbach Teller [:Yoric] from comment #5)
> I have not been able to reproduce the problem yet, and I suspect that it
> depends on how many images are present in the thumbnails cache. I will keep
> trying, but in the meantime, I have attached an experimental fix, if anybody
> wants to test it.

I'll test to see if I can still reproduce on the latest mc. Will post back today. If I can reproduce it easily I'll apply your patch and see if it fixes it.

One note, I was poking through crash stats looking for crashes that fit the signature I was seeing but didn't find any. So unless I'm looking for the wrong stack token this doesn't seem wide spread.
Fwiw, I can still reproduce with the attached patch.

Re: crash stats - I'm pretty sure I haven't seen this crash stack before
for this particular test.  There's about 200 or so reports matching
"ffi_call" in the past 4 weeks:
https://crash-stats.mozilla.com/query/query?product=Firefox&version=ALL%3AALL&range_value=4&range_unit=weeks&query_search=signature&query_type=contains&query=ffi_call&reason=&build_id=&process_type=any&hang_type=any&do_query=1

"_PR_SocketSetSocketOption | ffi_call_win32" seems similar to this
in the sense that it has "dom::workers" on the stack.  (the top stack
frame is bogus)
bp-ae38b16c-6324-457e-bb95-7ba3d2120905

"@0x0 | ffi_call_win32 " have reports for recent Nightlies, eg.
bp-65e93cba-9308-47ca-bb51-c7a632120914
these are on the main thread, with "js::CrossCompartmentWrapper::call"
on the stack.
I tend to believe that the PR_SocketSetSocketOption crash is unrelated.
Attached patch Experimental fix, v2 (obsolete) — Splinter Review
Found a second error in the code. Mats, could you try it?
Attachment #661791 - Attachment is obsolete: true
Attachment #661791 - Flags: feedback?
v2 crashed with the same stack
Attached patch Experimental fix, v3 (obsolete) — Splinter Review
Attaching fix v3.
Attachment #662353 - Attachment is obsolete: true
Attachment #662464 - Flags: feedback?(matspal)
v3 looks good so far; the crash used to occur within a couple of minutes,
but now it's still alive after an hour.  Note that I have commented out
the "compartment mismatch" assertion though, since that's easily triggered
by this test (I have a JS stack for that assertion if it helps).

I'll leave the test running overnight...
This effectively stops me from debugging on Windows, since anytime I attach a debugger I just get Invalid Handle errors over and over again.
Severity: critical → blocker
Comment on attachment 662464 [details] [diff] [review]
Experimental fix, v3

fwiw, still no crash with this.  khuey, you want to review?  ;-)
Attachment #662464 - Flags: review?(khuey)
Attachment #662464 - Flags: feedback?(matspal)
Attachment #662464 - Flags: feedback+
Comment on attachment 662464 [details] [diff] [review]
Experimental fix, v3

Review of attachment 662464 [details] [diff] [review]:
-----------------------------------------------------------------

Not really.  I'm not familiar with ctypes or the Windows APIs here.
Attachment #662464 - Flags: review?(khuey)
Assignee: dteller → nobody
Component: DOM: Workers → OS.File
Product: Core → Toolkit
QA Contact: dteller
Attachment #662464 - Flags: review?(nfroyd)
Comment on attachment 662464 [details] [diff] [review]
Experimental fix, v3

Review of attachment 662464 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/osfile/osfile_win_back.jsm
@@ +89,5 @@
>               // == would always return |false|.
>               return INVALID_HANDLE;
>             }
>           return ctypes.CDataFinalizer(maybe, _CloseHandle);
> +       };

Nit^2: I think the 'return' line above this is actually the one that's indented wrong.

@@ +106,5 @@
> +             // == would always return |false|.
> +             return INVALID_HANDLE;
> +           }
> +         return ctypes.CDataFinalizer(maybe, _FindClose);
> +       };

Please change this into something like:

function possibly_invalid_HANDLE_import_from_C(finalizer) {
  function import_from_C(maybe)
    if (Types.int.cast(maybe).value == INVALID_HANDLE) {
      // Explanatory comment
    }
    return ctypes.CDataFinalizer(maybe, finalizer);
  }
  return import_from_C;
}

and use that function in the appropriate places.  I don't have a lot of experience with the Win32 APIs, but hopefully the refactoring will avoid issues in the future.
Attachment #662464 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd (:froydnj) from comment #16)
> @@ +106,5 @@
> > +             // == would always return |false|.
> > +             return INVALID_HANDLE;
> > +           }
> > +         return ctypes.CDataFinalizer(maybe, _FindClose);
> > +       };
> 
> Please change this into something like:
> 
> function possibly_invalid_HANDLE_import_from_C(finalizer) {
>   function import_from_C(maybe)
>     if (Types.int.cast(maybe).value == INVALID_HANDLE) {
>       // Explanatory comment
>     }
>     return ctypes.CDataFinalizer(maybe, finalizer);
>   }
>   return import_from_C;
> }
> 
> and use that function in the appropriate places.  I don't have a lot of
> experience with the Win32 APIs, but hopefully the refactoring will avoid
> issues in the future.

This is actually slightly more complicated, due to initialization order of variables, so cleanup and factorization of code will probably involve something a little more OO. Due to the blocker status of this bug, do you mind if I land the patch as is and cleanup in a followup?
(In reply to David Rajchenbach Teller [:Yoric] from comment #17)
> This is actually slightly more complicated, due to initialization order of
> variables, so cleanup and factorization of code will probably involve
> something a little more OO. Due to the blocker status of this bug, do you
> mind if I land the patch as is and cleanup in a followup?

I will take your word for it. :)  Landing this as-is is fine.  Please do file a followup after this bug has been made public.
Attached patch Temporary fix (obsolete) — Splinter Review
Attaching the tested fix. Filed a followup refactoring as bug 792668.
Assignee: nobody → dteller
Attachment #662464 - Attachment is obsolete: true
Attachment #662810 - Flags: review+
I finally have a scenario that could explain why the testsuite passed but FF itself crashes for something that should not be a fatal error: according to http://msdn.microsoft.com/en-us/library/windows/desktop/ms724211%28v=vs.85%29.aspx 

« If the application is running under a debugger, the function will throw an exception if it receives either a handle value that is not valid or a pseudo-handle value. This can happen if you close a handle twice, or if you call CloseHandle on a handle returned by the FindFirstFile function instead of calling the FindClose function. »

In other words, it is possible that the crashes happen only in presence of a debugger.
If confirmed, this is not a security bug after all.
Comment on attachment 662810 [details] [diff] [review]
Temporary fix

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 707681 (original error), 754671 (code is correct but triggers the dormant error).
User impact if declined: Attempting to debug Firefox under Windows fails because of crashes.
Testing completed (on m-c, etc.): With patch applied, we cannot reproduce the bug. Companion patch that makes the error visible to the testsuite will land soon on m-c.
Risk to taking this patch (and alternatives if risky): Very low. No API change. It fixes an error that was otherwise silent.
String or UUID changes made by this patch: None.
Attachment #662810 - Flags: approval-mozilla-beta?
Attachment #662810 - Flags: approval-mozilla-aurora?
(In reply to Nathan Froyd (:froydnj) from comment #16)
> Nit^2: I think the 'return' line above this is actually the one that's
> indented wrong.

Looks like this nit was not addressed?
Here is the same fix, with the whitespace nit^2 addressed. Anyway, the followup bug contains a patch that generally looks nicer.
Attachment #662810 - Attachment is obsolete: true
Attachment #662810 - Flags: approval-mozilla-beta?
Attachment #662810 - Flags: approval-mozilla-aurora?
Attachment #663039 - Flags: review+
Comment on attachment 663039 [details] [diff] [review]
Temporary fix, nit addressed

(Updated nitpick as per my understanding of Gavin's request)

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 707681 (original error), 754671 (code is correct but triggers the dormant error).
User impact if declined: Attempting to debug Firefox under Windows fails because of crashes.
Testing completed (on m-c, etc.): With patch applied, we cannot reproduce the bug. Companion patch that makes the error visible to the testsuite will land soon on m-c.
Risk to taking this patch (and alternatives if risky): Very low. No API change. It fixes an error that was otherwise silent.
String or UUID changes made by this patch: None.
Attachment #663039 - Flags: approval-mozilla-beta?
Attachment #663039 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/a5eaeae97650
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: in-testsuite?
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
Comment on attachment 663039 [details] [diff] [review]
Temporary fix, nit addressed

[Triage Comment]
Approving for Aurora/Beta based upon the risk evaluation and the benefit to debugging.
Attachment #663039 - Flags: approval-mozilla-beta?
Attachment #663039 - Flags: approval-mozilla-beta+
Attachment #663039 - Flags: approval-mozilla-aurora?
Attachment #663039 - Flags: approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/c7280b298af5

This doesn't apply cleanly to beta. Please post a new patch for it.
Post-mortem for this bug, for future reference:

1. The design Microsoft APIs, as often, is "interesting": a |HANDLE| must be closed with |CloseHandle|. However, there are exceptions, and a |HANDLE| obtained through |FindFirstFile| must be closed with |FindClose|.
2. In the Windows back-end of OS.File, I made the double error of closing a find |HANDLE| with |CloseHandle| and not checking the result for possible errors. Consequently, the test suite did not notice the error.
3. The corresponding code was triggered by Tim's work on thumbnails, and also passed Tim's test suite.
4. At some point, people started running that code in the Microsoft Visual Debugger, to catch unrelated bugs. "Interestingly", when the Visual Debugger is launched, the otherwise silent error of attempting to close a find |HANDLE| with |CloseHandle| becomes a C++-style "First-chance exception", i.e. a crash.
5. Hilarity ensues.
looks like the original code here is from FF12, marking esr10 unaffected.
Based on this being exposed by bug 754671, I assume that Firefox 15 is unaffected. Can someone confirm?
Blocks: 754671
Marking for not-tracking based on talking to Dveditz.
Whiteboard: [advisory-tracking-]
Blocks: 707681
In trying to regress this bug, it continues to crash. I'm seeing this on Windows 7, with the following builds:

2012-11-13, 17.0b6
2012-11-15, Aurora
2012-11-15, Nightly

The test URL in this bug seems to cycle through many tests. I don't know if we're guaranteed to be looking at the same tests when the crash occurs.

Can someone look at the the logs to determine if this is the same issue? Or if we should be filing new bug(s) instead? Thank you.

In the meantime, this fix can't be verified.
Are you sure it's the same bug? I don't see any sign in your crash logs that point to ffi_call.
(In reply to Matt Wobensmith from comment #36)
> Can someone look at the the logs to determine if this is the same issue? Or
> if we should be filing new bug(s) instead? Thank you.

I am nearly certain that this is a different issue.
Today I used the exact same builds on Win7 as in comment 36, and no crashes. 

I think the problem is that our baseline test (above) is a URL that contains a randomized suite of tests. Every time it's run, it is different. 

Based on this, I will mark this bug verified and chalk the other crashes up to something I cannot reproduce for now. If they resurface in the future, I'll file a new bug.

Verified original crash on 2012-9-12, nightly

Confirmed fixed:
2012-10-24, 16.02
2012-11-13, 17.0b6
2012-11-15, Aurora
2012-11-15, Nightly
Group: core-security
Blocks: crossfuzz
Keywords: verifyme
You need to log in before you can comment on or make changes to this bug.