Closed Bug 510627 Opened 15 years ago Closed 15 years ago

Windows CE hanging on some SSL sites

Tracking

()

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

status1.9.2

---

beta1-fixed

People

(Reporter: Dolske, Assigned: Dolske)

References

Details

(Keywords: verified1.9.2, Whiteboard: [nv])

Attachments

(2 files)

Debugging info so far (from email) 15 years ago Justin Dolske [:Dolske] 4.91 KB, text/plain		Details
Patch v.1 (remove CeSetThreadPriority) 15 years ago Justin Dolske [:Dolske] 787 bytes, patch	vlad : review+ Biesinger : superreview+	Details \| Diff \| Splinter Review

Justin Dolske [:Dolske]

Assignee

Description

•

15 years ago

We've found a number of bugs where the Tegra device locks up hard (ie, mouse pointer frozen, kernel debugger can't connect to it) after visiting a SSL site. Not every SSL site does this, however.

Justin Dolske [:Dolske]

Assignee

Updated

•

15 years ago

Whiteboard: [nv]

Justin Dolske [:Dolske]

Assignee

Comment 5

•

15 years ago

Attached file Debugging info so far (from email) — Details

I've previously sent some info out via email regarding this, appending here for posterity. I took a spin through existing hang bugs, and verified that the workaround mentioned at the end of these email seems to make all the sites work fine. So, it's most likely a common root cause, so I've duped them all to this bug.

Justin Dolske [:Dolske]

Assignee

Updated

•

15 years ago

Severity: normal → critical

Priority: P1 → --

Justin Dolske [:Dolske]

Assignee

Comment 7

•

15 years ago

Kalle noticed that the kernel's KITL thread has a priority level of 131, and our thread was setting a priority level of 116 (0 is the highest priority). This ends up being the reason that the kernel debugger wasn't working... If I change the CeSetPriorityLevel() call to use 132 instead of 116 (so it's 1 priority level lower than the KITL thread), I can still reproduce the hang but now I can use the kernel debugger to poke at the device. Yay -- progress! [I'm not having any luck getting symbols or breakpoints working, though. This is partially a result of almost always breaking into some point in the kernel (for which I don't have symbols), but symbols for my Mozilla build also seem to be missing. The debugger seems to be very finicky, so I'll try again with a full debug build, maybe that will help.] Looking at the system process/thread info is revealing, though. The high-priority (132) process thread normally has a very low CPU user-time total (ie, it doesn't actually do much), and the main Firefox thread (where we actually do almost everything) accumulates CPU user-time as it's used. But in the hang conditions, the high-priority thread accumulates CPU rapidly while all the other threads' counters stay frozen. So, this very much looks like one thread is spinning on the CPU and starving everything else. There seems to be a potential loop between nsSSLIOLayerPoll() and nsSSLThread::requestPoll() -- the last line of requestPoll() is actually calling nsSSLIOLayerPoll() again, so it's possible the code is relying on something else being scheduled in between.

Justin Dolske [:Dolske]

Assignee

Comment 8

•

15 years ago

Kaie, are you familiar with this code? Lots of history in this bug, but the last paragraph of comment 7 seems to be the key. It certainly looks suspicious, but perhaps I'm missing something.

Justin Dolske [:Dolske]

Assignee

Updated

•

15 years ago

Blocks: 499852

Justin Dolske [:Dolske]

Assignee

Comment 9

•

15 years ago

Attached patch Patch v.1 (remove CeSetThreadPriority) — Details — Splinter Review

Assignee: nobody → dolske

Attachment #394972 - Flags: superreview?(cbiesinger)

Attachment #394972 - Flags: review?(vladimir)

Justin Dolske [:Dolske]

Assignee

Updated

•

15 years ago

Component: General → Networking

QA Contact: general → networking

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 10

•

15 years ago

Comment on attachment 394972 [details] [diff] [review] Patch v.1 (remove CeSetThreadPriority) Get rid of the #include <windows.h> at the top of the file as well; it got added for this, iirc.

Attachment #394972 - Flags: review?(vladimir) → review+

Christian :Biesinger (don't email me, ping me on IRC)

Updated

•

15 years ago

Attachment #394972 - Flags: superreview?(cbiesinger) → superreview+

Kai Engert [:KaiE:]

Comment 11

•

15 years ago

SSL thread is used to decouple Mozilla's single-threaded network engine and the SSL read/write calls (which may sometimes blocked with an UI prompt). The "request..." functions are called by the main/network thread to make decisions what I/O requests are OK to be sent to the decoupled SSL thread. A pollable event is used by SSL thread to wake up the network thread when some previously requested I/O is ready to be fetched. Important question: On your platform, did SSL thread succeed to create a "pollable event"? In other words, is nsSSLIOLayerHelpers::mSharedPollableEvent null or non-null ? On platforms where we fail to get a pollable event we need to live with a busy loop while waiting for I/O to complete. But we use sleep calls to reduce that effect. See want_sleep_and_wakeup_on_any_socket_activity. You said: > the last line of requestPoll() is actually > calling nsSSLIOLayerPoll() again Yes, but only in some limited scenarios. You should see that in most scenarios the poll call is not reached but function requestPoll will "return early". Question 2: Can you test? In your busy loop, does requestPoll call poll (often)?

Justin Dolske [:Dolske]

Assignee

Comment 12

•

15 years ago

Pushed Patch v.1: http://hg.mozilla.org/mozilla-central/rev/30a01dc450d7 (leaving open for the moment)

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Updated

•

15 years ago

Flags: blocking1.9.2+

Justin Dolske [:Dolske]

Assignee

Comment 13

•

15 years ago

Pushed to 192: http://hg.mozilla.org/releases/mozilla-1.9.2/rev/3b20798e3368

Keywords: fixed1.9.2

Marcia Knous [:marcia]

Comment 14

•

15 years ago

Using Mozilla/5.0 (Windows; U; WindowsCE 6.0; en-US; rv:1.9.2a2pre) Gecko/20090827 Namoroka/3.6a2pre as well as yesterday's build, I visited a bunch of various SSL sites and encountered no issues.

Justin Dolske [:Dolske]

Assignee

Comment 15

•

15 years ago

I think we can just call this fixed. Comment 11 indicates the busy-poll can be normal, so due to the high-priority we were assigning that thread nothing else would be able to run (including the network stack, most likely). Good enough for me.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Samuel Sidler (old account; do not CC)

Updated

•

15 years ago

status1.9.2: --- → beta1-fixed

Keywords: fixed1.9.2

Marcia Knous [:marcia]

Updated

•

15 years ago

Keywords: verified1.9.2

You need to log in before you can comment on or make changes to this bug.