Closed Bug 840109 Opened 11 years ago Closed 10 years ago

Pandas running b2g stop sending network traffic

Categories

(Testing Graveyard :: SUTAgent, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mcote, Unassigned, Mentored)

References

Details

Attachments

(4 files, 1 obsolete file)

When I try to pull or cat a file that is not trivially small (a few kB or more), the transfer takes far longer than it probably should.  It also feels rather bursty.  I've verified this with my home panda and with one in production.

When running the agent on a Linux box or VMs, the transfers are reasonably fast.
Pushing files seems to work fine.  I am pretty sure this has to do with the way Negatus is polling for socket write events.
I ran Negatus on a Galaxy SII just to make sure it wasn't something to do with the ARM build.  Worked fine, so it is almost certainly something specific to panda boards.
Tested a Linaro image on my panda board here, and Negatus worked fine.

Thomas, wondering if I could get some help from you.  At this point I'm pretty sure this has something to do with the b2g kernel.

The easiest way to replicate this problem is

1. Put a ~100 kB text file on the panda as /data/local/mytext.txt.  You can use adb or dm to do this (dm is part of mozdevice in mozbase which talks to a sutagent; e.g. "dm --host <panda IP> -m sut push mytext.txt /data/local/mytext.txt").

2. Connect to port 20701.  You should see the $> prompt.

3. Enter "cat /data/local/mytext.txt"

Expected results: the file is immediately and smoothly streamed to the connection.

Actual: there is a very long pause, then the file is streamed in large bursts.  Total transfer time is a few minutes.  If you interrupt the connection, you can't reconnect to the port for about the same duration.

This is almost certainly going to be a problem for unit tests.
Flags: needinfo?(tzimmermann)
Hi,

Port 20701 is used by sutagent, right?

When opening the socket within sutagent, do you set the option SO_REUSEADDR [1]? Normally the kernel's network stack respects the connection timeouts of the underlying protocols (probably TCP in your case). This means that you cannot re-use a network address and port until the connection gets correctly closed (by closing the socket on both end points), or the connection timeout has been reached. If you interrupt an open connection, there is no way of closing the sockets; so you have to wait until the timeout has been reached. Setting SO_REUSEADDR for a newly opened socket will allow you to reuse the network address and port before the old connection's timeout has been reached.

Could you try this first? This should make interrupting a connection work. If there are still problems when SO_REUSEADDR is set, I'll take a closer look at the kernel side. (I'm a bit busy this week, so I'd like to rule out the simple causes before digging through kernel code.)

BTW, have you ever tried this on a phone?

Best regards
Thomas

[1] http://www.beej.us/guide/bgnet/output/html/multipage/setsockoptman.html
Flags: needinfo?(tzimmermann)
Oh, I just saw that it works on a Galaxy SII.
Yup, port 20701 is the SUT agent.  In this case, it is actually the new C++ agent, Negatus (https://github.com/mozilla/Negatus), since we can't run the Android-Java agent on b2g.

Yes, the listening socket is set to reuse addresses.  The fact that I can't reuse the port for a little while is only a small part of the main problem, though.  It's the length and burstiness of the transfer that is the problem.

More details:

I'm using a select-based approach (using NSPR's PR_Poll).  The write events for the socket come in as expected for the few few kB then suddenly pause for a while.  If I try to write anyway, it appears that the buffer fills up, as PR_Send() will return an amount less than the block size that I tried to write (e.g. 224 vs 1024) and then will return an error after that.  So it appears that the send buffer fills up, and it takes a long time for the data to be sent out and the buffer to empty.

As I say, I have tried this not only on a real phone but also a Linaro android image on the panda board itself, and everything worked fine.  So I'm fairly certain this isn't a problem with the agent's code.
I could also write a really simple server in C if you wanted a minimal test case.
Ok, I see. The network chip is connected to the USB chip and we had problems with the latter. I'll look into this issue asap. I'm currently busy with a tef+ bug, which is supposed to be fixed by the end of the week. If you could provide me with a minimal test case, that would be great.

Meanwhile, another thing you could do is to try setting the TCP_NODELAY option for your server-side socket. Normally the network stack tries to queue up outbound data to maximize packet size. With TCP_NODELAY it'll send data asap.
Assignee: mcote → tzimmermann
Yeah I tried TCP_NODELAY as well.  Same result.

I wrote a simple blocking TCP server in C which replicates the problem.  I uploaded the source and an android build to http://people.mozilla.com/~mcote/writetest.tar.gz.  There's a little README in there explaining how to use it; in short, you can connect to it and issue a command in the form "<number of blocks> <block size>" and it will write that many blocks (of ASCII characters) of the given size to the socket.  For example, entering "2 512" (without quotes) will cause the server to write out two blocks of 512 characters each.  This allowed me to play with block sizes and see where the problem starts.  The server spits out lots of info as it's reading and writing.

Seems that it works more-or-less fine for really small block sizes.  It sent 10000 blocks of 8 characters without any trouble.  Problems start as the block size gets larger; the trouble really becomes apparent with blocks of 512 bytes.  Trying to send even one or two 512-byte blocks results in the server thinking it has written them, but nothing coming through on the client side for some time.  The server also starts slowing down when reading from the socket (e.g. try sending "3 512", "1 16", "1 16" in quick succession--for me, it read the first two lines immediately but took a long time to read the third).  If you give it enough blocks or a big enough block it appears to freeze entirely, or at least pauses for a really long time.  In those cases, I have to stop the server (which still responds to signals just fine) and then do "netcfg eth0 down; netcfg eth0 dhcp".  Trying to restart the server without resetting the interface causes it to appear to start but not notice any connection attempts (i.e. it stays blocked on accept()).

Btw the program works as expected, with no delays, on Linux.

I hope this helps; let me know if I'm doing anything crazy, but I'm fairly confident that this exposes a kernel problem. :)
Hi Marc,

Thanks for providing the test program, which allowed me to reproduce the problem here. Normally the data transmission gets stuck for a few seconds, but when I move the mouse, which is connected to the Panda's USB port, the data transmission immediately continues.

There in an interesting thread in the TI forum at

>  http://e2e.ti.com/support/omap/f/849/t/196732.aspx

regarding the problem. It seems that this bug is known and maybe related to USB and/or power management, but no patch is available.

You mentioned that some systems are not affected. Could you provide me with the kernel versions of these systems ('uname -a') and attach these systems' /proc/config.gz? Maybe it is possible to workaround the problem with the right kernel config.

Thanks!
And in wireshark, I see a lot of incorrect TCP ACKs coming from the panda board.
This is the same panda board running a linaro image.  I will attach /proc/config.gz.

root@android:/ # uname -a
Linux localhost 3.2.0+ #1 SMP PREEMPT Wed May 23 15:35:51 UTC 2012 armv7l GNU/Linux
I'm pretty sure that the kernel suspends before all data has been sent and the NIC driver fails to keep the system running. When I press the GPIO button on the board, the system wakes up and transmits the data. But acquiring a wake lock from within writetest doesn't help.
Wow, I just disabled kernel suspend altogether, but the problem persists. So it's not directly related related to suspends. Transmission first works, but once b2g disables its screen output, the transmission problems begin. When I press the GPIO button on the board, b2g 'wakes up' and transmission continues flawlessly until b2g clears the screen again.

@Josh: Josh, I think b2g might modify some network settings when it starts sleeping, which in turn causes the networking problems. I cc'd you because I think you might know the answer or know some engineer in the networking team who can help here. Thanks!
Flags: needinfo?(joshmoz)
Honza, Jason - ideas here?
Flags: needinfo?(joshmoz) → needinfo?(honzab.moz)
(In reply to Thomas Zimmermann [:tzimmermann] from comment #15)
> Wow, I just disabled kernel suspend altogether, but the problem persists. So
> it's not directly related related to suspends. Transmission first works, but
> once b2g disables its screen output, the transmission problems begin. When I
> press the GPIO button on the board, b2g 'wakes up' and transmission
> continues flawlessly until b2g clears the screen again.

How is the screen being disabled?  It's some kind of a screen or power saver?

> 
> @Josh: Josh, I think b2g might modify some network settings when it starts
> sleeping, which in turn causes the networking problems. 

I don't quite follow this line.  What settings and when, what? :)

So, is sleeping involved or not?  You just said suspend has been disabled, so I a bit don't understand.

CC'ing Michal, a linux guru.
Component: SUTAgent → XPCShell Harness
Flags: needinfo?(honzab.moz)
Component: XPCShell Harness → SUTAgent
hmm, so B2G screen going to sleep causes linux kernel to stop sending/recving TCP... Does sound like we might be changing some kind of kernel setting (network or power management) when we shut down the screen.  I don't know those APIs well. Perhaps :mcmanus does.
> 
> How is the screen being disabled?  It's some kind of a screen or power saver?

We release the wake locks after a timeout expired. This normally suspends the phone and the screen gets turned off.

> > 
> > @Josh: Josh, I think b2g might modify some network settings when it starts
> > sleeping, which in turn causes the networking problems. 
> 
> I don't quite follow this line.  What settings and when, what? :)
> 
> So, is sleeping involved or not?  You just said suspend has been disabled,
> so I a bit don't understand.

Well, suspend has been disabled in the kernel configuration. When b2g's suspend timeouts expires, releasing the wake locks shouldn't have any effect on the system. I can see that the screen remains turned on.

But once the suspend timer expires, the transmission of network packets mostly stops. So I thought that b2g modifies some network setting when starting to sleep, maybe for saving battery power.
I spent some time on debugging this problem. The NIC is based on USB. AFAICS the network packets get encoded for USB and enqueued into the USB controller's list of requests. They are later delivered to the NIC hardware by another function. Below is a stack trace for the function that finally frees the requests after they have been written to the NIC. The function runs from within an interrupt handler. I'd guess that the USB controller maybe disables its interupts when we start sleeping; and thus stops writing requests to the NIC. In this case we can probably make the USB controller keep sending packets by keeping its interupts enabled. I have to investigate this further, though.

>>>>>

[  117.209014] Backtrace: 
[  117.212005] [<c004e980>] (dump_backtrace+0x0/0x10c) from [<c0539630>] (dump_stack+0x18/0x1c)
[  117.221313]  r7:eb572230 r6:eb572248 r5:ebb6f380 r4:ec417460
[  117.228302] [<c0539618>] (dump_stack+0x0/0x1c) from [<c02b9880>] (tx_complete+0x110/0x170)
[  117.237426] [<c02b9770>] (tx_complete+0x0/0x170) from [<c02d32c4>] (usb_hcd_giveback_urb+0x58/0xc8)
[  117.247406]  r7:ec5960f4 r6:ebb6f380 r5:00000000 r4:ebb6f380
[  117.254333] [<c02d326c>] (usb_hcd_giveback_urb+0x0/0xc8) from [<c02e2118>] (ehci_urb_done+0xe4/0x100)
[  117.264465]  r5:00000000 r4:ec596000
[  117.268707] [<c02e2034>] (ehci_urb_done+0x0/0x100) from [<c02e2564>] (qh_completions+0x430/0x540)
[  117.278503]  r9:eb424740 r8:ebb6f380 r7:ffdfe240 r6:c06f82a0 r5:c0700e50
[  117.286254] r4:ffdfd600
[  117.289520] [<c02e2134>] (qh_completions+0x0/0x540) from [<c02e3f50>] (ehci_work+0x124/0xb48)
[  117.298950] [<c02e3e2c>] (ehci_work+0x0/0xb48) from [<c02e8e48>] (ehci_irq+0x268/0x2e8)
[  117.307647] [<c02e8be0>] (ehci_irq+0x0/0x2e8) from [<c02d2890>] (usb_hcd_irq+0x38/0x8c)
[  117.316558] [<c02d2858>] (usb_hcd_irq+0x0/0x8c) from [<c00d202c>] (handle_irq_event_percpu+0x54/0x188)
[  117.326843]  r7:00000000 r6:00000000 r5:c06ec56c r4:ebd9a880
[  117.333740] [<c00d1fd8>] (handle_irq_event_percpu+0x0/0x188) from [<c00d21a4>] (handle_irq_event+0x44/0x64)
[  117.344451] [<c00d2160>] (handle_irq_event+0x0/0x64) from [<c00d4968>] (handle_fasteoi_irq+0xa0/0x10c)
[  117.354705]  r7:0000006d r6:e4880000 r5:c06ec56c r4:c06ec520
[  117.361755] [<c00d48c8>] (handle_fasteoi_irq+0x0/0x10c) from [<c00d1b94>] (generic_handle_irq+0x3c/0x48)
[  117.372009]  r5:c0042e80 r4:c06f8568
[  117.376495] [<c00d1b58>] (generic_handle_irq+0x0/0x48) from [<c0044054>] (asm_do_IRQ+0x54/0xb4)
[  117.385986] [<c0044000>] (asm_do_IRQ+0x0/0xb4) from [<c004a4c8>] (__irq_svc+0x48/0xe0)
[  117.394622] Exception stack(0xe4881a70 to 0xe4881ab8)
[  117.400329] 1a60:                                     ec417548 20000113 00000000 00000000
[  117.409210] 1a80: 20000113 ec417000 ebb6f380 ec41753c ec417548 00000000 20000113 e4881acc
[  117.418243] 1aa0: e4881ab8 e4881ab8 c053cefc c053cf00 60000113 ffffffff
[  117.425598]  r9:e4880000 r8:00000202 r7:00000203 r6:0000006d r5:fa240100
[  117.433471] r4:ffffffff
[  117.436828] [<c053cee8>] (_raw_spin_unlock_irqrestore+0x0/0x44) from [<c02b95ec>] (usbnet_start_xmit+0x1e0/0x364)
[  117.448089]  r5:ec417000 r4:00000001
[  117.452667] [<c02b940c>] (usbnet_start_xmit+0x0/0x364) from [<c03b1aec>] (dev_hard_start_xmit+0x32c/0x5a4)
[  117.463287] [<c03b17c0>] (dev_hard_start_xmit+0x0/0x5a4) from [<c03c91c4>] (sch_direct_xmit+0xb8/0x1e4)
[  117.473449] [<c03c910c>] (sch_direct_xmit+0x0/0x1e4) from [<c03b1ed0>] (dev_queue_xmit+0x16c/0x57c)
[  117.483459] [<c03b1d64>] (dev_queue_xmit+0x0/0x57c) from [<c0404200>] (ip_finish_output+0x144/0x3d4)
[  117.493560] [<c04040bc>] (ip_finish_output+0x0/0x3d4) from [<c04045b0>] (ip_output+0x120/0x140)
[  117.503173] [<c0404490>] (ip_output+0x0/0x140) from [<c0403500>] (ip_local_out+0x30/0x34)
[  117.512207]  r9:00003890 r8:ebbc8c68 r7:00000000 r6:eb88e8e4 r5:eb572230
[  117.519989] r4:eb572230
[  117.523284] [<c04034d0>] (ip_local_out+0x0/0x34) from [<c0403938>] (ip_queue_xmit+0x11c/0x3ac)
[  117.532775]  r5:eb572230 r4:ebbc8a40
[  117.537139] [<c040381c>] (ip_queue_xmit+0x0/0x3ac) from [<c0418d34>] (tcp_transmit_skb+0x388/0x8c4)
[  117.547119] [<c04189ac>] (tcp_transmit_skb+0x0/0x8c4) from [<c041b6c0>] (tcp_write_xmit+0x1f8/0xa5c)
[  117.557220] [<c041b4c8>] (tcp_write_xmit+0x0/0xa5c) from [<c041bfa8>] (__tcp_push_pending_frames+0x30/0x8c)
[  117.567932] [<c041bf78>] (__tcp_push_pending_frames+0x0/0x8c) from [<c040dd00>] (tcp_sendmsg+0x8ec/0xce0)
[  117.578460]  r4:000005a8
[  117.581512] [<c040d414>] (tcp_sendmsg+0x0/0xce0) from [<c042ebc0>] (inet_sendmsg+0x9c/0xc0)
[  117.590759] [<c042eb24>] (inet_sendmsg+0x0/0xc0) from [<c039dc28>] (sock_sendmsg+0xa4/0xb8)
[  117.599822]  r7:00000000 r6:e4881f54 r5:00000001 r4:ec3b2400
[  117.606903] [<c039db84>] (sock_sendmsg+0x0/0xb8) from [<c039f0f8>] (sys_sendto+0xb8/0xdc)
[  117.615966]  r7:00000000 r6:ec3b2400 r5:00008d56 r4:00000001
[  117.622833] [<c039f040>] (sys_sendto+0x0/0xdc) from [<c004aa00>] (ret_fast_syscall+0x0/0x30)
FYI, I found that the interupts for the USB requests are generated correctly, but the USB controller fails to process the outstanding requests. The latest active request simply never get forwarded to the NIC until the board resumes.

To see if the problem has been fixed in a later kernel version, I ported the drivers for the NIC and the USB controller from the 3.9-rc1 mainline kernel to the panda board's kernel. This didn't help however.
Correcting the bug summary, since this has nothing to do with the SUTAgent.  Not sure what Product/Component it should be under, though; feel free to change. :)
Summary: [Negatus] pull and cat are very, very slow on pandas → Pandas running b2g stop sending network traffic
I did some further research on this problem. It also happens when I connect over Wifi, which is not attached to USB. And there exists a wifi wakelock in b2g, but it's unrelated and not really used on the pandas.

As a workaround, you could set the screen timeout to 'never', which prevents the problem from happening.
> I did some further research on this problem. It also happens when I connect
> over Wifi, which is not attached to USB.

I was _not_ able to reproduce the problem with Wifi.
Attached file PCAP trace of failed packets (obsolete) —
This is a packet trace of a failed transmission. It can be opened with Wireshark.

The IP address ending with .24 is my notebook, .33 is the pandaboard. The first two packets are the telnet handshake. The third packet is a request for 10 strings of 300 bytes length each. The following packets are data transmissions from the pandaboard to the notebook, and the ACKs from the notebook. The pandaboard then retransmits the first data packet, which indicates that is did not correctly receive or process the ACK for this packet.

I finally resolved this problem be generating some workload on the pandaboard.
Updated for clarity
Attachment #729532 - Attachment is obsolete: true
dmesg output for trace #2

The lines 'usbnet_skb_return:270' signal received packets, the lines 'usbnet_start_xmit:1111' signal sent packets. The numbers at the end correlate them with packets in the trace. Interestingly my notebook's ACKs are received and processed by the pandaboard. At some point, my notebook stops ACK-ing packets, even though they have received. The pandaboard sent some retransmits, but these are not received by my notebook.
Whiteboard: [mentor=:tdz]
Assignee: tzimmermann → nobody
Whiteboard: [mentor=:tdz] → [mentor=:tdz][good first bug]
Not a good first bug.
Whiteboard: [mentor=:tdz][good first bug]
Right, but you can keep me as mentor.
Whiteboard: [mentor=:tdz]
Mentor: tzimmermann
Whiteboard: [mentor=:tdz]
We aren't running pandas in b2g automation any longer.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: