Closed Bug 579870 Opened 14 years ago Closed 14 years ago

crash [@ normalize_character ] looks like a startup crash

Categories

(MailNews Core :: Database, defect)

1.9.2 Branch
x86
macOS
defect
Not set
critical

Tracking

(blocking-thunderbird3.1 .2+, thunderbird3.1 .2-fixed, thunderbird3.0 unaffected)

RESOLVED FIXED
Thunderbird 3.3a1
Tracking Status
blocking-thunderbird3.1 --- .2+
thunderbird3.1 --- .2-fixed
thunderbird3.0 --- unaffected

People

(Reporter: Usul, Assigned: m_kato)

References

()

Details

(Keywords: crash, regression, Whiteboard: [sg:low (persistent DoS)])

Crash Data

Attachments

(1 file)

0  	thunderbird-bin  	normalize_character  	 mailnews/extensions/fts3/src/Normalize.c:1960
1 	thunderbird-bin 	isDelim 	mailnews/extensions/fts3/src/fts3_porter.c:849
2 	thunderbird-bin 	porterNext 	mailnews/extensions/fts3/src/fts3_porter.c:1021
3 	libsqlite3.dylib 	fts3PendingTermsAdd 	db/sqlite3/src/sqlite3.c:104031
4 	libsqlite3.dylib 	fts3UpdateMethod 	db/sqlite3/src/sqlite3.c:104109
5 	libsqlite3.dylib 	sqlite3Step 	db/sqlite3/src/sqlite3.c:57829
6 	libsqlite3.dylib 	sqlite3_step 	db/sqlite3/src/sqlite3.c:50662
7 	thunderbird-bin 	mozilla::storage::AsyncExecuteStatements::executeStatement 	storage-backport/src/mozStorageAsyncStatementExecution.cpp:345
8 	thunderbird-bin 	mozilla::storage::AsyncExecuteStatements::executeAndProcessStatement 	storage-backport/src/mozStorageAsyncStatementExecution.cpp:292
9 	thunderbird-bin 	mozilla::storage::AsyncExecuteStatements::bindExecuteAndProcessStatement 	storage-backport/src/mozStorageAsyncStatementExecution.cpp:274
10 	thunderbird-bin 	mozilla::storage::AsyncExecuteStatements::Run 	storage-backport/src/mozStorageAsyncStatementExecution.cpp:597
11 	libxpcom_core.dylib 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:527
12 	libxpcom_core.dylib 	NS_ProcessNextEvent_P 	nsThreadUtils.cpp:250
13 	libxpcom_core.dylib 	nsThread::ThreadFunc 	xpcom/threads/nsThread.cpp:254
14 	libnspr4.dylib 	_pt_root 	nsprpub/pr/src/pthreads/ptthread.c:228
15 	libSystem.B.dylib 	_pthread_start 	
16 	libSystem.B.dylib 	thread_start
Signature	normalize_character
UUID	102f276b-6067-48a2-b4bb-017c62100712
Time 	2010-07-12 06:08:43.481904
Uptime	18
Last Crash	4355 seconds (1.2 hours) before submission
Install Age	25 seconds since version was first installed.
Product	Thunderbird
Version	3.1.1
Build ID	20100701181850
Branch	1.9.2
OS	Mac OS X
OS Version	10.5.8 9L31a
CPU	x86
CPU Info	GenuineIntel family 6 model 23 stepping 10
Crash Reason	EXC_BAD_ACCESS / KERN_PROTECTION_FAILURE
Crash Address	0x1
User Comments	
Processor Notes	
EMCheckCompatibility	False
Blocks: 525537
Makoto San can you look into it ?
Assignee: nobody → m_kato
Attached patch fixSplinter Review
Attachment #458606 - Flags: review?(bugmail)
Requesting this for 3.1.2 ...
regression per crash-stats.

#2 Mac crash for 3.1 http://bit.ly/b8uBKX ... although it doesn't appear to be severe in comparison to #1 crash. but does it hit some intl users hard?  is patch needed before upgrading  people from 3.0
Keywords: regression
Comment on attachment 458606 [details] [diff] [review]
fix

Thanks for the quick investigation and fix!
Attachment #458606 - Flags: review?(bugmail) → review+
From a prioritization perspective, if we consider malicious attacks, this is a very very bad bug.  It is not a security problem, as such, because it is not exploitable.  It is almost certain to result in a crash because we read off the end of an array and then dereference what we find there.  The array we are reading off the end of is not dynamically allocated, and whatever is stored in the next word will always be stored there.  (Pretty easy to figure that out but it's not important as it appears to just end up being 0x1.)

The danger character is:
0x10000 LINEAR B SYLLABLE B008 A

You can see it in here:
http://www.unicode.org/charts/PDF/U10000.pdf

(We obviously do not want to paste it into this bug because then it will kill your thunderbird.)

I doubt the character comes up much on its own intentionally, but would not be surprised for mis-encoded messages to result in its appearance.

I would probably vote for making sure we spin a release with this fix before upgrading people from 2.0, as they are even more likely to have old messages with serious encoding prolems.
Group: mozilla-confidential
Status: NEW → ASSIGNED
Attachment #458606 - Flags: approval-thunderbird3.1.2?
Group: mozilla-confidential → core-security
Note: this is a DoS bug.
If this truly is frequently a startup or near startup crash, then we should fix it before upgrading users to 3.1.x or the users who see this will get stranded...

Is this something that could be avoided by marking a message as potentially bad before trying to index it, so that if we crash, next time we startup, we don't try to index it immediately?
(In reply to comment #9)
> Is this something that could be avoided by marking a message as potentially bad
> before trying to index it, so that if we crash, next time we startup, we don't
> try to index it immediately?

I had ruled out this particular option as too expensive in the steady state with the assumption we would quickly fix crashers.  Specifically, we would need to close out a SQLite transaction or commit the mork database before we index every single message.  Although we would not technically need to wait for an fsync to happen, I think in both cases the data sources would effectively trigger fsyncs, slowing gloda down by at least an order of magnitude.

Obviously, that assumption sort of glosses over the whole "gloda likes to try and index things it has not indexed at startup".

It definitely is not too expensive to have gloda, during its initial indexing sweep, put a file on disk that says "yo yo yo, I'm doing the initial indexing sweep".  If we start up and that file is still there, we could defer performing the indexing sweep (or perhaps any event driven indexing) until we complete a "is there a new version of thunderbird available?" check and verify that it has run to completion.  Which is, if there is an update, we wait for it to be installed.  If there is not an update, then maybe we say, "what the hay" and go index some more.  Or maybe we go into a paranoid mode of indexing where we do pay the fsync piper.  (Or we pay it on a per-folder basis initially and then next time we just skip that folder or something.)

The reality of the situation is that this is probably the last/only crasher bug that the tokenizer will have until it grows some more features.  Which means this would really be protection against libmime badness, which may not be a bad thing to protect against.

We are limited in what we can do without adding more strings in terms of explaining to the user why gloda has decided to stop indexing certain things.
Whiteboard: [sg:low (persistent DoS)]
Yes, I was mainly thinking about the various libmime crashes we've encountered over the years.

I'm sure there are lots of relatively cheap ways to figure out that we didn't shutdown successfully. I really like the idea of gloda going into paranoid mode of indexing when that is detected, since that gives us some breathing room until we can get an update out, and helps user retention.
Okay, so after some investigation of the crash reporting mechanism, it appears like we have two good ways to figure out when things have gone wrong in general:

a) We see crash reports in "Crash Reports/{pending,submitted}" and their time-stamp is within some time window.  (A few hours?)

b) We create a file when we start a gloda indexing sweep and delete it when we complete the indexing sweep.  (We may also want to cover event-driven indexing...)  This should detect both infinite loops as well as varieties of crashes.

We also have a middling-cost mechanism for figuring out what specifically betrayed us:

c) Open a file using an unbuffered writer, write the folder and possibly message key of the messages we are indexing as we do that.  This should not trigger an fsync but ensure that the OS has the bytes should Thunderbird die.  If Thunderbird manages to crash the OS, everyone is out of luck.  Since this will result in actual I/O traffic, we do not want to be logging a ton of data.  This will result in accumulated disk usage, but we can keep track of that and just truncate the file periodically when the limit is exceeded.


Our available mitigation strategies are:

1) Defer gloda indexing for some time duration...
1a) of all messages (sweep and event-driven).
1b) sweep-style; so new (event-driven) messages still get indexed.
1c) of messages outside of the inbox.
1d) of messages in folders we think have betrayed us.

2) Mark messages as deadly.  This requires that we were writing the message key (with folder) as we indexed messages.


My candidate strategy is:

- Always create a file that indicates we are performing an indexing sweep when we are performing an indexing sweep.  Write the folder url's unbuffered as we do this.  (We could write the id's to be more concise, but the url's are more useful for debugging.)

- When we initialize gloda, do the crash check and the indexing sweep check.  If either indicates something suspicious happened, crank up our paranoia file output so that every time we index a message (sweep or event-driven) we write the message key (and the folder is in there, but not redundantly).  Also, set a preference that tells us to be paranoid and for how long.  (Specifically, two pereferences; one that says to be paranoid and one that gives a timestamp for when we should stop being paranoid.)  We will also check this and possibly clear the paranoia status.

- If our indexing log contains message information, mark the message that killed us as deadly.

- If our indexing log contains only folder information, mark the folder that was being indexed at our death as suspicious with an in-memory (non-persisted) flag.

- If we believe a recent crash/indexing failure occurred, then defer all gloda indexing for 15 minutes.  Once the timer elapses, allow gloda to perform an indexing sweep, but staying out of any suspicious folders.  After another 15 minutes, clear the suspicious bits and trigger another indexing sweep.


The net results are:

- You can always use Thunderbird for at least 15 minutes before it explodes.  An active DoS via malicious e-mails would be 15 minutes.  A single folder full of infinitely many deadly messages would be 30 minutes.  Two folders full of infinitely many deadly messages would be 15 minutes because we will never accumulate enough knowledge to stay out of both of them.

- If someone is killing Thunderbird because of gloda (for legit or confused reasons), this will cause gloda to get out of their way for a while.

- After the first explosion, every new explosion at least lets us mark deadly messages as deadly.

- If the mechanism goes haywire and always activates itself, the side-effect is a limited slowdown to indexing, some approximately fixed disk usage bloat, and delayed gloda responsiveness.


The main downside is that my proposal is fairly complex.  I think the various bits can be made sufficiently orthogonal that it won't be a nightmare, at least.

Feedback appreciated.
So if we do an out-of-sync respin to enable us to fully push out the major update to 3.1.1, are we taking just the crash fix or the gloda robustness fix as well?

I would suggest taking just the crash fix, if so, please can we move the discussion on gloda robustness to another bug? - which we probably want to do anyway.
(In reply to comment #13)
 
> I would suggest taking just the crash fix, if so, please can we move the
> discussion on gloda robustness to another bug? - which we probably want to do
> anyway.

And it would allow more input than from the people cced on this bug and people with bugzilla's security bugs.

(In reply to comment #13)
> So if we do an out-of-sync respin to enable us to fully push out the major
> update to 3.1.1, are we taking just the crash fix or the gloda robustness fix
> as well?

we should just get that one.
The candidate strategy looks reasonable to me. Implied, I assume, is that we ignore deadly messages going forward, at least until the app version changes (perhaps the deadly marking can include the current app version).

Does gloda stream more than one message at a time? If so, that complicates the task of figuring out which message is bad, as does the fact that the storage statements execute on a different thread. Or do we make sure to only index one message at a time, soup to nuts, when in paranoid mode?
(In reply to comment #15)
> The candidate strategy looks reasonable to me. Implied, I assume, is that we
> ignore deadly messages going forward, at least until the app version changes
> (perhaps the deadly marking can include the current app version).

Right.  Since we have 31 usable sentinel gloda id values (the first valid one is 32 and 0 means "does not have a gloda id"), we would just change what stands for deadly in the next rev.

> Does gloda stream more than one message at a time? If so, that complicates the
> task of figuring out which message is bad, as does the fact that the storage
> statements execute on a different thread. Or do we make sure to only index one
> message at a time, soup to nuts, when in paranoid mode?

Gloda does not stream more than one message.  The message indexing process undergoes a de facto synchronization with the async database thread when it issues a SELECT to try and see if the message is already indexed (and conversation lookup) using message-id's.  As long as we write to the descriptor after we get that back and before we stream the message, we should be fine.
(In reply to comment #13)
> So if we do an out-of-sync respin to enable us to fully push out the major
> update to 3.1.1, are we taking just the crash fix or the gloda robustness fix
> as well?

Realistically, I think just the crash fix given the current timeline.
 
The flipside is that it doesn't make much sense to pursue a gloda paranoia strategy if we're not going to ship it imminently.  Specifically, implementing a pure JS mime parser that meets gloda's needs is in the same ballpark of effort but would eliminate crash potential entirely, provide performance gains (at least when JITing works if not always because of the XPConnect call reduction), and be useful to the raindrop team.

(Use of the phrase "meets gloda's needs" is significant in reducing the level of effort.  We just want the headers, structure, and body bits as blobs.  We do not need to resolve references from HTML subparts to other subparts or otherwise look inside the content blobs.  We already post-process the output of libmime to convert HTML to text, and that is the only time we look in the HTML at all.)
blocking-thunderbird3.1: --- → .2+
Attachment #458606 - Flags: approval-thunderbird3.1.2? → approval-thunderbird3.1.2+
Checked in on trunk and comm-1.9.2, and relbranch for 3.1.2:

http://hg.mozilla.org/comm-central/rev/25c94f70ecc4
http://hg.mozilla.org/releases/comm-1.9.2/rev/32877904325c
http://hg.mozilla.org/releases/comm-1.9.2/rev/822aba7e0789

(apologies for getting patch attribution wrong on the comm-1.9.2 repository).
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 3.2a1
Crash Signature: [@ normalize_character ]
Group: core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: