Closed Bug 1204381 Opened 9 years ago Closed 8 years ago

frequent hangs on Mac - looping reading cache file _CACHE_

Categories

(Thunderbird :: General, defect)

43 Branch
x86_64
macOS
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: calum.mackay, Unassigned)

References

Details

(Keywords: hang, regression, Whiteboard: [regression:TB42])

I've been having some "hang" issues with Earlybird on MacOS (10.9.5), and have just reproduced it with Daily 20150911.

Note this is not a hibernation/sleep issue :)

I tried Daily 20150911, with a fresh profile - it hangs about 50% of the time on startup. e.g when loading the "account sign up" box.

Using dtruss I can see it seems to be mostly sitting in a loop of read() syscalls, about 2,000 per second. e.g:

read(0x29, "\0", 0x2B0) = 0 0
read(0x29, "\0", 0x2B0) = 0 0
read(0x29, "\0", 0x2B0) = 0 0

The file descriptor in question is always one of the _CACHE_ files:

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
thunderbi 1341 calum 41u REG 1,4 24912 23533586 /Users/calum/Library/Caches/Thunderbird/Profiles/ht2u32xx.cdm.daily/Cache/_CACHE_003_

It's not always the same file, e.g. it could be 001.

This happens with both Earlybird and Daily recent builds. I've been seeing hangs on Earlybird for a few weeks, but only yesterday got around to tracing it, so can't say for sure it's the same issue.
Here's an example from today, Earlybird 20150913

EB fails to display a message when clicked on - blank body. cpu goes to 100%.

run dtruss; ctrl-c it after 10 seconds:

$ sudo dtruss -p $(pgrep thunderbird) -c

read(0x25, "\0", 0x2D)		 = 0 0
read(0x25, "\0", 0x2D)		 = 0 0
read(0x25, "\0", 0x2D)		 = 0 0
read(0x25, "\0", 0x2D)		 = 0 0
read(0x25, "\0", 0x2D)		 = 0 0
read(0x0, "\0", 0x0)		 = 0 0
^C
dtrace: 154128 dynamic variable drops with non-empty dirty list

CALL                                        COUNT
gettimeofday                                    6
psynch_cvbroad                                 15
psynch_cvwait                                  30
read                                        17416


Find out what fd 0x25 is; (0x25 == 37 decimal)

$ lsof -p $(pgrep thunderbird) | grep 37u
thunderbi 1606 calum   37u     REG                1,4     25555 23587136 /Users/calum/Library/Caches/Thunderbird/Profiles/ag10ufyl.default/Cache/_CACHE_002_



This is happening a few times an hour at the moment.
A regression range would help.  If you are not asserting it doesn't happen with daily 20150910, can you find the most recent daily that does not fail?
Severity: major → critical
Flags: needinfo?(calum.mackay)
Keywords: hang
Summary: frequent hangs - looping reading cache file _CACHE_ → frequent startup hangs - looping reading cache file _CACHE_
(In reply to Wayne Mery (:wsmwk, use Needinfo for questions) from comment #2)
> A regression range would help.  If you are not asserting it doesn't happen
> with daily 20150910, can you find the most recent daily that does not fail?

I can't find an archive of Mac Daily builds; would you point me at it please? I can only find the Mac Daily builds for "latest". The archives seems to be only Linux & win32/64.

Failing that, should I use Earlybird builds...? They do seem to be available...
removed "startup" from Summary: this happens all the time, not just at startup.
Flags: needinfo?(calum.mackay)
Summary: frequent startup hangs - looping reading cache file _CACHE_ → frequent hangs - looping reading cache file _CACHE_
Flags: needinfo?(calum.mackay)
Flags: needinfo?(calum.mackay) → needinfo?(vseerror)
I've been able to narrow it down, using Earlybird Aurora builds.

For each build, I created a fresh profile, and started EB, and selected "Set as default" on the default client box.

On repeated subsequent startups, for this build & profile:


- 0811 - BAD

http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2015/08/2015-08-11-00-40-04-comm-aurora/

hangs about 50% of the time, whilst displaying the "Would you like a new email address" box, without ever getting to listing any provider.

Using dtruss/lsof, the result is always looping on read() of a _CACHE_ file (see earlier comments).


- 0810 - GOOD

http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2015/08/2015-08-10-00-40-05-comm-aurora/

I'm unable to get this build to hang, no matter how many times I start it.
Flags: needinfo?(vseerror)
Interestingly, the first BAD build, 0811, was the switch-over (for EB Aurora) from 41.0a2 (0810) to 42.0a2 (0811).
Also, confirmed BAD is: Daily 0801

http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2015/08/2015-08-01-03-02-12-comm-central/

which is also 42.

This is the oldest Mac Daily build I could find.
The July Daily builds are a bit sketchy, but I was able to confirm:


- 0722 - BAD

http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2015/07/2015-07-22-03-02-08-comm-central/


- 0713 - GOOD

http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2015/07/2015-07-13-03-02-07-comm-central/


I could not find any Daily builds in between 0713 & 0722.
Excellent work - *daily* builds is exactly what we need tested, not earlybird.  

It hurts that we are missing nightly builds, because we are now stuck with a wide regression range [1]. Fortunately Firefox is using cachev2, so in recent months there is almost no code change to old cache code.

I've skimmed SOME of changelogs of http://hg.mozilla.org/mozilla-central/file/fba4b0cd3823/netwerk/cache (mostly the .cpp) and some of the change are: 
* bug 1170646 (ishikawa)
* bug 354493
* bug 1048271
* bug 1182996


[1] 
http://hg.mozilla.org/comm-central/pushloghtml?startdate=2015-07-13+03%3A00&enddate=2015-07-22+05%3A00
http://hg.mozilla.org/mozilla-central/pushloghtml?startdate=2015-07-13+03%3A00&enddate=2015-07-22+05%3A00
Flags: needinfo?(ishikawa)
Keywords: regression
And I'm not finding anything any other worthwhile bug reports filed since July 13 that involve likely cache
http://mzl.la/1OaREQM
http://mzl.la/1OaS9dr
great, thanks much indeed, Wayne.
So, in https://hg.mozilla.org/mozilla-central/rev/f600f0cd7bb3

if the read() syscall returns 0, I presume that will be passed up to busy_beaver_PR_Read(), and that new function looks like it will loop on n==0?

And the dtruss shows that read() is returning 0.
Mac users have also reported hanging in addons manager.
Blocks: 1170646
Flags: needinfo?(ishikawa)
Summary: frequent hangs - looping reading cache file _CACHE_ → frequent hangs on Mac - looping reading cache file _CACHE_
See Also: → mail-cache2
thanks very much indeed, Wayne, that looks promising.

I'll test that backout as soon as I find a Daily build for macos.

I can't find any Daily builds more recent than August 12th, at least at:

http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/2015/

am I looking in the wrong place, please?
Flags: needinfo?(vseerror)
Mac nightlies are indeed broken. Hopefully will be better when you get the automated bugmail that bug 1195442 is fixed.
Depends on: 1195442
Flags: needinfo?(vseerror)
thanks again Wayne, much appreciated.
(In reply to Calum Mackay from comment #15)
> thanks very much indeed, Wayne, that looks promising.
> 
> I'll test that backout as soon as I find a Daily build for macos.

Calum, is it still an issue?
Flags: needinfo?(calum.mackay)
Apols, I should have updated this sooner.

No, the issue went away as expected, when I next got a build.

thanks again.
Flags: needinfo?(calum.mackay)
Unfortunate that we won't know what fixed or caused it, but not your fault by any means, and it's one less thing to worry about.  Thanks for the help and the update!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Whiteboard: [regression:TB42]
You need to log in before you can comment on or make changes to this bug.