Closed Bug 208843 Opened 21 years ago Closed 19 years ago

large downloads fail (partial file downloaded)

Categories

(Core Graveyard :: Networking: FTP, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bugzilla, Assigned: sfraser_bugs)

References

()

Details

(4 keywords)

Attachments

(2 files)

simon pointed me to this bug. he can reproduce it on mozilla and camino (trunk
based), and i was able to reproduce it using today's commercial branch
(2003.06.09.05-1.4) on OS X 10.2.6.

large downloads from external sites fail because only part of the file is
downloaded. this differs from bug 198076 because i don't have a problem with
Stuffit Expander hanging.

0. this test makes use of an Apple Developer account --iirc, it's free to
register. adding 'qawanted' if there's an easier test for this.

1. sign on to Apple's Developer Connection at http://connect.apple.com

2. go to the Download Software section, and click on Developer Tools, which
contains big files.

3. attempt to download the OS X Developer Tools (mac binary disk image), which
is about 301MB.

results: the download progress seems unusually fast. ultimately, i get a Disk
Copy Progress error, where the .dmg file fails to mount "due to error 95 (no
mountable file systems)." also, back on the browser side i get an alert, "426
Data connection: Broken pipe."
can this be reproduce on windows or linux?
Isn't it caused by bug 69938 ?
so far cannot repro on win2k and linux rh8.0, at least using the above test
case. i used Winzip as the opening app on win2k and just saved the whole (not
partial) file on linux.

i'm trying to see if this was a regression, but it looks like this was a problem
(looking at OS X builds) since at lease 31 March. :-/
this might've regressed btwn the 2003.01.22.07 (works, but not sure) and
2003.01.23.07 (broken) OS X mach-o comm builds.

currently saying "not sure" since the dmg file, while 301.2MB in size on both my
desktop and as described in the Apple page, doesn't seem to open up.
I couldn't reproduce based on the steps provided. At the beginning of the
download, I specified the "save to disk" option. Tested under 2003-06-05-05
branch build under 10.2.6.
for OS X, i didn't select "save to disk", i took the default (OS-associated
helper app) of opening with Stuffit.
Chunk of socket transport log (the only log available in opt builds) where
download fails:

5327760[4f3a80]: nsSocketTransport::SendStatus [this=7c7dca0 status=804b0006]
5327760[4f3a80]: nsSocketInputStream::Read [this=7c7dd14 count=2728]
5327760[4f3a80]:   calling PR_Read [count=2728]
5327760[4f3a80]:   PR_Read returned [n=1360]
5327760[4f3a80]: nsSocketTransport::SendStatus [this=7c7dca0 status=804b0006]
5327760[4f3a80]: nsSocketInputStream::Read [this=7c7dd14 count=1368]
5327760[4f3a80]:   calling PR_Read [count=1368]
5327760[4f3a80]:   PR_Read returned [n=-1]
5327760[4f3a80]: nsSocketInputStream::Read [this=7c7dd14 count=2736]
5327760[4f3a80]:   calling PR_Read [count=2736]
5327760[4f3a80]:   PR_Read returned [n=-1]
5327760[4f3a80]: nsSocketInputStream::AsyncWait [this=7c7dd14]
5327760[4f3a80]:   active [1] { handler=7c7dca0 condition=0 pollflags=5 }
5327760[4f3a80]:   active [0] { handler=7c806f0 condition=0 pollflags=5 }
5327760[4f3a80]:   idle [2] { handler=6e6d230 condition=0 pollflags=0 }
5327760[4f3a80]:   idle [1] { handler=6a13b30 condition=0 pollflags=0 }
5327760[4f3a80]:   idle [0] { handler=6db7920 condition=0 pollflags=0 }
5327760[4f3a80]:   calling PR_Poll [active=2 idle=3]
5327760[4f3a80]: nsSocketTransport::OnSocketReady [this=7c7dca0 outFlags=1]
5327760[4f3a80]: nsSocketInputStream::OnSocketReady [this=7c7dd14 cond=0]
5327760[4f3a80]: nsSocketInputStream::CloseEx [this=7c7dd14 reason=80470002]
5327760[4f3a80]: nsSocketTransport::OnMsgInputClosed [this=7c7dca0 reason=80470002]
5327760[4f3a80]: nsSocketTransportService::DetachSocket [handler=7c7dca0]
5327760[4f3a80]: nsSocketTransport::OnSocketDetached [this=7c7dca0 cond=80470002]
5327760[4f3a80]: nsSocketTransport::RecoverFromError [this=7c7dca0 state=4
cond=80470002]
5327760[4f3a80]: nsSocketInputStream::OnSocketReady [this=7c7dd14 cond=80470002]
5327760[4f3a80]: nsSocketOutputStream::OnSocketReady [this=7c7dd2c cond=80470002]
5327760[4f3a80]: nsSocketTransport: calling PR_Close [this=7c7dca0]

The error, 80470002, is NS_BASE_STREAM_CLOSED.
can you get a FTP log (not nspr log, but some other packet trace)?
btw, i downloaded the file successfully on windows using 1.4rc1.  Final file
size:  315,922,432 (size on disk 315,924,480).  
  nsSocketInputStream::CloseEx [this=7c7dd14 reason=80470002]

is equivalent to 

  nsSocketInputStream::Close [this=7c7dd14]

NS_BASE_STREAM_CLOSED doesn't indicate a real failure so to speak.
coudl it be that you ran out of tmp space.  simon says no, what about others?
I don't think temp space was the issue; consecutive downloads would fail in
different places.
is this still a problem with recent builds?
I still see this problem.  Actually, using the same test as mentioned in the
description:  I tried to download the Xcode 2.1 disk image many times (at least
5), and each time the download would finish (with no error), but the file would
be much smaller than expected.  After these repeated failures, I was able to
download it using Safari on the very first attempt.

Here is a snippet from my "nsHTTP" log:
======================================
50778112[1324c50]: nsHttpTransaction::ProcessData [this=2224a0b0 count=1360]
50778112[1324c50]: nsHttpTransaction::HandleContent [this=2224a0b0 count=1360]
50778112[1324c50]: nsHttpTransaction::HandleContent [this=2224a0b0 count=1360
read=1360 mContentRead=102016712 mContentLength=788343619]
50778112[1324c50]: nsHttpTransaction::OnSocketStatus [this=2224a0b0
status=804b0006 progress=102018419]
50778112[1324c50]: nsHttpTransaction::ProcessData [this=2224a0b0 count=1376]
50778112[1324c50]: nsHttpTransaction::HandleContent [this=2224a0b0 count=1376]
50778112[1324c50]: nsHttpTransaction::HandleContent [this=2224a0b0 count=1376
read=1376 mContentRead=102018088 mContentLength=788343619]
-1610551400[1306b30]: nsHttpChannel::OnDataAvailable [this=222ce790
request=22222ee0 offset=102015352 count=2736]
-1610551400[1306b30]: sending status notification [this=222ce790 status=804b0006
progress=102018088/788343619]
-1610551400[1306b30]: nsHttpChannel::OnStopRequest [this=222ce790
request=22222ee0 status=0]
-1610551400[1306b30]: nsHttpChannel::FinalizeCacheEntry [this=222ce790]
-1610551400[1306b30]:   calling OnStopRequest
50778112[1324c50]: nsHttpConnection::OnSocketReadable [this=1dfa9990]
50778112[1324c50]: nsHttpConnection::CloseTransaction[this=1dfa9990
trans=2224a0b0 reason=80470002]
50778112[1324c50]: nsHttpTransaction::Close [this=2224a0b0 reason=0]
50778112[1324c50]: nsHttpConnectionMgr::ReclaimConnection [conn=1dfa9990]
50778112[1324c50]: nsHttpTransaction::DeleteSelfOnConsumerThread [this=2224a0b0]
50778112[1324c50]: proxying delete to consumer thread...
======================================

As you can see from that, I also get the 80470002 error (as mentioned in comment
#7).  But Darin says that might be nothing.  

So, what's the best way to debug this?
Some other observations: I cannot reproduce this on a Mac Mini 1.42Ghz, whereas
it would always fail for me on a dual-processor G5 1.8Ghz (both using DP2
release).  I could get it to fail consistently on a debug build on the
dual-processor G5, but as soon as I enabled NSPR logging for "nsPipe:5", or set
a conditional breakpoint at a certain spot in the debugger, then the download
would finish as expected.  NSPR logging of "nsHTTP:5" and/or "nsStreamPump:5"
still showed the failure; only "nsPipe:5" resulted in sucess.  Timing issue, maybe?
Sounds like a nasty race condition in NSPR.
Saw this today downloading a 600Mb file on a fast network, on a dual G5, with Camino 1.0b1.
I was able to reproduce this in about 1 out of 8 tries downloading a 300Mb file over my local network (about about 10MB/sec), using an optimized build. In a debug build, it's much less frequent.
I also see this problem. Saw it twice in a row downloading a 6GB file with Camino 1.0b1.
By debugging, I tracked this problem to a premature nsPipeInputStream::Close() call, coming out of nsPipeInputStream::Release() when mReaderRefCnt dropped to zero. However, the lifetimes of nsPipeInputStream/nsPipeOutputStream are tied to their owning nsPipe, so it seemed that the refCnt should never drop to 0 while the pipe was alive. And, since nsPipe has threadsafe addref/release, I think we should use threadsafe increment/decrement for the reader/writer counts too.
Assignee: dougt → sfraser_bugs
Status: NEW → ASSIGNED
Attachment #202768 - Flags: superreview?
Attachment #202768 - Flags: review?(dougt)
Attachment #202768 - Flags: superreview? → superreview?(darin)
Comment on attachment 202768 [details] [diff] [review]
Patch: use atomic increment/decrement for reader/writer counts

nice detective work.
Attachment #202768 - Flags: review?(dougt) → review+
Comment on attachment 202768 [details] [diff] [review]
Patch: use atomic increment/decrement for reader/writer counts

> NS_IMETHODIMP_(nsrefcnt)
> nsPipeOutputStream::Release()
> {
>-    if (--mWriterRefCnt == 0)
>+    PR_AtomicDecrement((PRInt32 *)&mWriterRefCnt);
>+    if (mWriterRefCnt == 0)
>         Close();
>     return mPipe->Release();
> }

You can't read mWriterRefCnt naked.  Instead,
use the return value of PR_AtomicDecrement as
you do in nsPipeInputStream::Release.

Is nsrefcnt always four bytes?  Otherwise the
(PRInt32 *) cast is not safe.
Attachment #202768 - Flags: superreview?(darin) → review-
Attached patch Revised patch — — Splinter Review
Fix nsPipeOutputStream::Release() to use return value from PR_AtomicDecrement.
Wan-Teh: nsrefcnt is defined like this:
#if defined(XP_WIN) && PR_BYTES_PER_LONG == 4
typedef unsigned long nsrefcnt;
#else
typedef PRUint32 nsrefcnt;
#endif

so will always be 4 bytes.
Attachment #202777 - Flags: superreview?(darin)
Attachment #202777 - Flags: review?(dougt)
Attachment #202777 - Flags: review?(wtchang)
Attachment #202777 - Flags: review?(wtchang) → review+
I would strongly recommend putting this on the 1.8 branch.
Flags: blocking1.8.1?
Comment on attachment 202777 [details] [diff] [review]
Revised patch 

>+    nsrefcnt count = PR_AtomicDecrement((PRInt32 *)&mReaderRefCnt);
>+    if (count == 0)
>         Close();

nit: PR_AtomicDecrement returns PRInt32

maybe just do:

      if (PR_AtomicDecrement((PRInt32 *)&mReaderRefCnt) == 0)
          Close();


Good find!  This code probably assumed that a stream would only be
used on a single thread.  It forgot about the fact that the stream
might be created on a different thread :-(

sr=darin
Attachment #202777 - Flags: superreview?(darin) → superreview+
Checked in on trunk:
Checking in nsPipe3.cpp;
/cvsroot/mozilla/xpcom/io/nsPipe3.cpp,v  <--  nsPipe3.cpp
new revision: 1.14; previous revision: 1.13

Seeking approval for branch.
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Comment on attachment 202777 [details] [diff] [review]
Revised patch 

I have some concerns about the AddRef and Release
methods of nsPipeInputStream and nsPipeOutputStream.

These methods increment or decrement two reference
counts.  For example, nsPipeInputStream::AddRef
increments mReaderRefCnt and the reference count
of mPipe.  This patch makes all the increments and
decrements atomic, but this may not be enough if
the two reference counts must be incremented or
decremented as a single transaction.  I don't have
time to study the code to see whether this is
necessary.

A common example of the need for a transaction is
to transfer money between bank accounts.  It is
not sufficient that the "from" account is debited
atomically and the "to" account is credited atomically.
You also need the debit and credit to be performed
"at the same time", so that the sum of the balances
in the two bank accounts stays the same.

Back to this patch.  If we need to maintain a
certain invariant on the two reference counts,
then it is not sufficient that they each be
incremented or decremented atomically; we need
to use a higher-level lock to ensure that the
two reference counts stay consistent with each
other.

Another note: it may be a good idea to do
the increments and decrements in opposite
orders.  That is, we may want to do AddRef
in this order:

 NS_IMETHODIMP_(nsrefcnt)
 nsPipeInputStream::AddRef(void)
 {
     nsrefcnt rv = mPipe->AddRef();
     PR_AtomicIncrement((PRInt32*)&mReaderRefCnt);
     return rv;
 }
Which Solaris (sparc or x86) build of Mozilla can I get with this fix ? I have had a similar problem - see bug 248414 . I never had time to provide the trace, but the problem was easily reproducible with the URL I gave, and I could try that again. If this fixes it, I would like to mark the other bug as a dupe of this one.
Attachment #202777 - Attachment description: Revised patch → Revised patch
Attachment #202777 - Flags: review?(dougt) → review+
Attachment #202777 - Flags: approval1.8rc2?
Excellent catch.  I was completely unable to download an 800MB file over a fast connection with a dual-CPU machine until I updated to get this patch.
*** Bug 248414 has been marked as a duplicate of this bug. ***
OS: MacOS X → All
Hardware: Macintosh → All
Flags: blocking1.8.1? → blocking1.8.0.1?
Attachment #202777 - Flags: approval1.8rc2? → approval1.8.0.1?
Darin - any thoughts on Comment 27?
WTC: I don't think we need to worry about building a transaction around the two AddRef calls.  The second AddRef, which affects the destruction of the object itself, can happen indepedently since the caller of the AddRef method must itself have an owning reference to the object.  So, the result of calling AddRef is always to change the reference count of the object to a value of 2 or more (except in the very special case of initial object construction).
I meant to add that I think we should take this fix for FF 1.5.0.1.  It has zero risk and big-time reward on dual-core systems.
i landed this on a minibranch for camino1.0b2 since we are getting no approval traction
Comment on attachment 202777 [details] [diff] [review]
Revised patch 

a=dveditz
Attachment #202777 - Flags: approval1.8.1+
Attachment #202777 - Flags: approval1.8.0.1?
Attachment #202777 - Flags: approval1.8.0.1+
Flags: blocking1.8.1+
Flags: blocking1.8.0.1?
Flags: blocking1.8.0.1+
Checked in to MOZILLA_1_8_BRANCH and MOZILLA_1_8_0_BRANCH.
verified mac os x, dual cpu, 1.5.0.1/1.5 from 20060110
This fix doesn't appear to have gone into any official release of Seamonkey, ie. Mozilla 1.7.x . Mozilla 1.8 is still not released, so it can't be used for production. Please backport this to MOZILLA_1_7_BRANCH so that the next official Seamonkey patch release can actually perform downloads on SMP workstations. Thanks.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
FIXED means fixed on the trunk. Flags are used to indicate branch status. You could request blocking1.7.13/14 if you think it should block the next 1.7.x release.
Status: REOPENED → RESOLVED
Closed: 19 years ago19 years ago
Resolution: --- → FIXED
Julien, I verified that this fix is in SeaMonkey 1.0.  (The
Mozilla Application Suite has been renamed SeaMonkey, so
SeaMonkey 1.0 is what would have been Mozilla 1.8 or 1.8.0.1.)
Wan-Teh,

I was unaware of that change. When I go to http://www.mozilla.org/products and click on mozilla suite, I get prompted to download mozilla 1.7.12 . The 1.7 is also what's being updated in solaris right now and I'm not sure if they will switch to 1.8. So I would like to see this fix on the 1.7 branch too.
Flags: blocking1.7.13?
Comment on attachment 202777 [details] [diff] [review]
Revised patch 

comment 31 applies to older supported releases, too.
Attachment #202777 - Flags: approval1.7.13+
Attachment #202777 - Flags: approval-aviary1.0.8+
Fix checked into old supported branches
Flags: blocking1.7.13?
Flags: blocking1.7.13+
Flags: blocking-aviary1.0.8+
verified on the 1.0.1 branch using Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.13) Gecko/20060302 Firefox/1.0.8. I followed the original reporter's steps and downloaded the same file from the Apple Developer connection and no issues were encountered. Adding keyword. Thanks to Simon for the good debugging of the problem.
verified on the 1.7 branch using Mozilla 1.7.13 Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.13) Gecko/20060309 following same steps in Comment 44.
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: