Closed Bug 510627 Opened 15 years ago Closed 15 years ago

Windows CE hanging on some SSL sites

Categories

(Core :: Networking, defect)

ARM
Windows CE
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
status1.9.2 --- beta1-fixed

People

(Reporter: Dolske, Assigned: Dolske)

References

Details

(Keywords: verified1.9.2, Whiteboard: [nv])

Attachments

(2 files)

We've found a number of bugs where the Tegra device locks up hard (ie, mouse pointer frozen, kernel debugger can't connect to it) after visiting a SSL site. Not every SSL site does this, however.
Whiteboard: [nv]
I've previously sent some info out via email regarding this, appending here for posterity.

I took a spin through existing hang bugs, and verified that the workaround mentioned at the end of these email seems to make all the sites work fine. So, it's most likely a common root cause, so I've duped them all to this bug.
Severity: normal → critical
Priority: P1 → --
Kalle noticed that the kernel's KITL thread has a priority level of 131, and our thread was setting a priority level of 116 (0 is the highest priority). This ends up being the reason that the kernel debugger wasn't working...

If I change the CeSetPriorityLevel() call to use 132 instead of 116 (so it's 1 priority level lower than the KITL thread), I can still reproduce the hang but now I can use the kernel debugger to poke at the device. Yay -- progress!

[I'm not having any luck getting symbols or breakpoints working, though. This is partially a result of almost always breaking into some point in the kernel (for which I don't have symbols), but symbols for my Mozilla build also seem to be missing. The debugger seems to be very finicky, so I'll try again with a full debug build, maybe that will help.]

Looking at the system process/thread info is revealing, though. The high-priority (132) process thread normally has a very low CPU user-time total (ie, it doesn't actually do much), and the main Firefox thread (where we actually do almost everything) accumulates CPU user-time as it's used. But in the hang conditions, the high-priority thread accumulates CPU rapidly while all the other threads' counters stay frozen.

So, this very much looks like one thread is spinning on the CPU and starving everything else. There seems to be a potential loop between nsSSLIOLayerPoll() and nsSSLThread::requestPoll() -- the last line of requestPoll() is actually calling nsSSLIOLayerPoll() again, so it's possible the code is relying on something else being scheduled in between.
Kaie, are you familiar with this code? Lots of history in this bug, but the last paragraph of comment 7 seems to be the key. It certainly looks suspicious, but perhaps I'm missing something.
Blocks: 499852
Assignee: nobody → dolske
Attachment #394972 - Flags: superreview?(cbiesinger)
Attachment #394972 - Flags: review?(vladimir)
Component: General → Networking
QA Contact: general → networking
Comment on attachment 394972 [details] [diff] [review]
Patch v.1 (remove CeSetThreadPriority)

Get rid of the #include <windows.h> at the top of the file as well; it got added for this, iirc.
Attachment #394972 - Flags: review?(vladimir) → review+
Attachment #394972 - Flags: superreview?(cbiesinger) → superreview+
SSL thread is used to decouple Mozilla's single-threaded network engine and the SSL read/write calls (which may sometimes blocked with an UI prompt).

The "request..." functions are called by the main/network thread to make decisions what I/O requests are OK to be sent to the decoupled SSL thread.

A pollable event is used by SSL thread to wake up the network thread when some previously requested I/O is ready to be fetched.

Important question:
  On your platform, did SSL thread succeed to create a "pollable event"?

In other words, is
  nsSSLIOLayerHelpers::mSharedPollableEvent
null or non-null ?

On platforms where we fail to get a pollable event we need to live with a busy loop while waiting for I/O to complete. But we use sleep calls to reduce that effect. See want_sleep_and_wakeup_on_any_socket_activity.


You said:
> the last line of requestPoll() is actually
> calling nsSSLIOLayerPoll() again

Yes, but only in some limited scenarios.
You should see that in most scenarios the poll call is not reached but function requestPoll will "return early".

Question 2:
Can you test? In your busy loop, does requestPoll call poll (often)?
Pushed Patch v.1: http://hg.mozilla.org/mozilla-central/rev/30a01dc450d7

(leaving open for the moment)
Using Mozilla/5.0 (Windows; U; WindowsCE 6.0; en-US; rv:1.9.2a2pre)
Gecko/20090827 Namoroka/3.6a2pre as well as yesterday's build, I visited a bunch of various SSL sites and encountered no issues.
I think we can just call this fixed. Comment 11 indicates the busy-poll can be normal, so due to the high-priority we were assigning that thread nothing else would be able to run (including the network stack, most likely). Good enough for me.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Keywords: verified1.9.2
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: