840109 - Pandas running b2g stop sending network traffic

Reporter

Description

•

11 years ago

When I try to pull or cat a file that is not trivially small (a few kB or more), the transfer takes far longer than it probably should.  It also feels rather bursty.  I've verified this with my home panda and with one in production.

When running the agent on a Linux box or VMs, the transfers are reasonably fast.

Mark Côté [:mcote]

Reporter

Comment 1

•

11 years ago

Pushing files seems to work fine.  I am pretty sure this has to do with the way Negatus is polling for socket write events.

Mark Côté [:mcote]

Reporter

Comment 2

•

11 years ago

I ran Negatus on a Galaxy SII just to make sure it wasn't something to do with the ARM build.  Worked fine, so it is almost certainly something specific to panda boards.

Mark Côté [:mcote]

Reporter

Comment 3

•

11 years ago

Tested a Linaro image on my panda board here, and Negatus worked fine.

Thomas, wondering if I could get some help from you.  At this point I'm pretty sure this has something to do with the b2g kernel.

The easiest way to replicate this problem is

1. Put a ~100 kB text file on the panda as /data/local/mytext.txt.  You can use adb or dm to do this (dm is part of mozdevice in mozbase which talks to a sutagent; e.g. "dm --host <panda IP> -m sut push mytext.txt /data/local/mytext.txt").

2. Connect to port 20701.  You should see the $> prompt.

3. Enter "cat /data/local/mytext.txt"

Expected results: the file is immediately and smoothly streamed to the connection.

Actual: there is a very long pause, then the file is streamed in large bursts.  Total transfer time is a few minutes.  If you interrupt the connection, you can't reconnect to the port for about the same duration.

This is almost certainly going to be a problem for unit tests.

Flags: needinfo?(tzimmermann)

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 4

•

11 years ago

Hi,

Port 20701 is used by sutagent, right?

When opening the socket within sutagent, do you set the option SO_REUSEADDR [1]? Normally the kernel's network stack respects the connection timeouts of the underlying protocols (probably TCP in your case). This means that you cannot re-use a network address and port until the connection gets correctly closed (by closing the socket on both end points), or the connection timeout has been reached. If you interrupt an open connection, there is no way of closing the sockets; so you have to wait until the timeout has been reached. Setting SO_REUSEADDR for a newly opened socket will allow you to reuse the network address and port before the old connection's timeout has been reached.

Could you try this first? This should make interrupting a connection work. If there are still problems when SO_REUSEADDR is set, I'll take a closer look at the kernel side. (I'm a bit busy this week, so I'd like to rule out the simple causes before digging through kernel code.)

BTW, have you ever tried this on a phone?

Best regards
Thomas

[1] http://www.beej.us/guide/bgnet/output/html/multipage/setsockoptman.html

Flags: needinfo?(tzimmermann)

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 5

•

11 years ago

Oh, I just saw that it works on a Galaxy SII.

Mark Côté [:mcote]

Reporter

Comment 6

•

11 years ago

Yup, port 20701 is the SUT agent.  In this case, it is actually the new C++ agent, Negatus (https://github.com/mozilla/Negatus), since we can't run the Android-Java agent on b2g.

Yes, the listening socket is set to reuse addresses.  The fact that I can't reuse the port for a little while is only a small part of the main problem, though.  It's the length and burstiness of the transfer that is the problem.

More details:

I'm using a select-based approach (using NSPR's PR_Poll).  The write events for the socket come in as expected for the few few kB then suddenly pause for a while.  If I try to write anyway, it appears that the buffer fills up, as PR_Send() will return an amount less than the block size that I tried to write (e.g. 224 vs 1024) and then will return an error after that.  So it appears that the send buffer fills up, and it takes a long time for the data to be sent out and the buffer to empty.

As I say, I have tried this not only on a real phone but also a Linaro android image on the panda board itself, and everything worked fine.  So I'm fairly certain this isn't a problem with the agent's code.

Mark Côté [:mcote]

Reporter

Comment 7

•

11 years ago

I could also write a really simple server in C if you wanted a minimal test case.

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 8

•

11 years ago

Ok, I see. The network chip is connected to the USB chip and we had problems with the latter. I'll look into this issue asap. I'm currently busy with a tef+ bug, which is supposed to be fixed by the end of the week. If you could provide me with a minimal test case, that would be great.

Meanwhile, another thing you could do is to try setting the TCP_NODELAY option for your server-side socket. Normally the network stack tries to queue up outbound data to maximize packet size. With TCP_NODELAY it'll send data asap.

Assignee: mcote → tzimmermann

Mark Côté [:mcote]

Reporter

Comment 9

•

11 years ago

Yeah I tried TCP_NODELAY as well. Same result.

I wrote a simple blocking TCP server in C which replicates the problem. I uploaded the source and an android build to http://people.mozilla.com/~mcote/writetest.tar.gz. There's a little README in there explaining how to use it; in short, you can connect to it and issue a command in the form "<number of blocks> <block size>" and it will write that many blocks (of ASCII characters) of the given size to the socket. For example, entering "2 512" (without quotes) will cause the server to write out two blocks of 512 characters each. This allowed me to play with block sizes and see where the problem starts. The server spits out lots of info as it's reading and writing.

Seems that it works more-or-less fine for really small block sizes. It sent 10000 blocks of 8 characters without any trouble. Problems start as the block size gets larger; the trouble really becomes apparent with blocks of 512 bytes. Trying to send even one or two 512-byte blocks results in the server thinking it has written them, but nothing coming through on the client side for some time. The server also starts slowing down when reading from the socket (e.g. try sending "3 512", "1 16", "1 16" in quick succession--for me, it read the first two lines immediately but took a long time to read the third). If you give it enough blocks or a big enough block it appears to freeze entirely, or at least pauses for a really long time. In those cases, I have to stop the server (which still responds to signals just fine) and then do "netcfg eth0 down; netcfg eth0 dhcp". Trying to restart the server without resetting the interface causes it to appear to start but not notice any connection attempts (i.e. it stays blocked on accept()).

Btw the program works as expected, with no delays, on Linux.

I hope this helps; let me know if I'm doing anything crazy, but I'm fairly confident that this exposes a kernel problem. :)

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 10

•

11 years ago

Hi Marc,

Thanks for providing the test program, which allowed me to reproduce the problem here. Normally the data transmission gets stuck for a few seconds, but when I move the mouse, which is connected to the Panda's USB port, the data transmission immediately continues.

There in an interesting thread in the TI forum at

>  http://e2e.ti.com/support/omap/f/849/t/196732.aspx

regarding the problem. It seems that this bug is known and maybe related to USB and/or power management, but no patch is available.

You mentioned that some systems are not affected. Could you provide me with the kernel versions of these systems ('uname -a') and attach these systems' /proc/config.gz? Maybe it is possible to workaround the problem with the right kernel config.

Thanks!

Thomas Zimmermann [:tzimmermann] [:tdz]

Updated

•

11 years ago

Comment 11

•

11 years ago

And in wireshark, I see a lot of incorrect TCP ACKs coming from the panda board.

Mark Côté [:mcote]

Reporter

Comment 12

•

11 years ago

This is the same panda board running a linaro image.  I will attach /proc/config.gz.

root@android:/ # uname -a
Linux localhost 3.2.0+ #1 SMP PREEMPT Wed May 23 15:35:51 UTC 2012 armv7l GNU/Linux

Mark Côté [:mcote]

Reporter

Comment 13

•

11 years ago

Attached file /proc/config.gz from linaro image — Details

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 14

•

11 years ago

I'm pretty sure that the kernel suspends before all data has been sent and the NIC driver fails to keep the system running. When I press the GPIO button on the board, the system wakes up and transmits the data. But acquiring a wake lock from within writetest doesn't help.

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 15

•

11 years ago

Wow, I just disabled kernel suspend altogether, but the problem persists. So it's not directly related related to suspends. Transmission first works, but once b2g disables its screen output, the transmission problems begin. When I press the GPIO button on the board, b2g 'wakes up' and transmission continues flawlessly until b2g clears the screen again.

@Josh: Josh, I think b2g might modify some network settings when it starts sleeping, which in turn causes the networking problems. I cc'd you because I think you might know the answer or know some engineer in the networking team who can help here. Thanks!

Flags: needinfo?(joshmoz)

Josh Aas

Comment 16

•

11 years ago

Honza, Jason - ideas here?

Flags: needinfo?(joshmoz) → needinfo?(honzab.moz)

Honza Bambas (:mayhemer)

Comment 17

•

11 years ago

(In reply to Thomas Zimmermann [:tzimmermann] from comment #15)
> Wow, I just disabled kernel suspend altogether, but the problem persists. So
> it's not directly related related to suspends. Transmission first works, but
> once b2g disables its screen output, the transmission problems begin. When I
> press the GPIO button on the board, b2g 'wakes up' and transmission
> continues flawlessly until b2g clears the screen again.

How is the screen being disabled?  It's some kind of a screen or power saver?

> 
> @Josh: Josh, I think b2g might modify some network settings when it starts
> sleeping, which in turn causes the networking problems. 

I don't quite follow this line.  What settings and when, what? :)

So, is sleeping involved or not?  You just said suspend has been disabled, so I a bit don't understand.

CC'ing Michal, a linux guru.

Component: SUTAgent → XPCShell Harness

Flags: needinfo?(honzab.moz)

(not currently active) Ted Mielczarek

Updated

•

11 years ago

Component: XPCShell Harness → SUTAgent

Jason Duell

Comment 18

•

11 years ago

hmm, so B2G screen going to sleep causes linux kernel to stop sending/recving TCP... Does sound like we might be changing some kind of kernel setting (network or power management) when we shut down the screen.  I don't know those APIs well. Perhaps :mcmanus does.

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 19

•

11 years ago

> 
> How is the screen being disabled?  It's some kind of a screen or power saver?

We release the wake locks after a timeout expired. This normally suspends the phone and the screen gets turned off.

> > 
> > @Josh: Josh, I think b2g might modify some network settings when it starts
> > sleeping, which in turn causes the networking problems. 
> 
> I don't quite follow this line.  What settings and when, what? :)
> 
> So, is sleeping involved or not?  You just said suspend has been disabled,
> so I a bit don't understand.

Well, suspend has been disabled in the kernel configuration. When b2g's suspend timeouts expires, releasing the wake locks shouldn't have any effect on the system. I can see that the screen remains turned on.

But once the suspend timer expires, the transmission of network packets mostly stops. So I thought that b2g modifies some network setting when starting to sleep, maybe for saving battery power.

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 20

•

11 years ago

I spent some time on debugging this problem. The NIC is based on USB. AFAICS the network packets get encoded for USB and enqueued into the USB controller's list of requests. They are later delivered to the NIC hardware by another function. Below is a stack trace for the function that finally frees the requests after they have been written to the NIC. The function runs from within an interrupt handler. I'd guess that the USB controller maybe disables its interupts when we start sleeping; and thus stops writing requests to the NIC. In this case we can probably make the USB controller keep sending packets by keeping its interupts enabled. I have to investigate this further, though.

>>>>>

[  117.209014] Backtrace: 
[  117.212005] [<c004e980>] (dump_backtrace+0x0/0x10c) from [<c0539630>] (dump_stack+0x18/0x1c)
[  117.221313]  r7:eb572230 r6:eb572248 r5:ebb6f380 r4:ec417460
[  117.228302] [<c0539618>] (dump_stack+0x0/0x1c) from [<c02b9880>] (tx_complete+0x110/0x170)
[  117.237426] [<c02b9770>] (tx_complete+0x0/0x170) from [<c02d32c4>] (usb_hcd_giveback_urb+0x58/0xc8)
[  117.247406]  r7:ec5960f4 r6:ebb6f380 r5:00000000 r4:ebb6f380
[  117.254333] [<c02d326c>] (usb_hcd_giveback_urb+0x0/0xc8) from [<c02e2118>] (ehci_urb_done+0xe4/0x100)
[  117.264465]  r5:00000000 r4:ec596000
[  117.268707] [<c02e2034>] (ehci_urb_done+0x0/0x100) from [<c02e2564>] (qh_completions+0x430/0x540)
[  117.278503]  r9:eb424740 r8:ebb6f380 r7:ffdfe240 r6:c06f82a0 r5:c0700e50
[  117.286254] r4:ffdfd600
[  117.289520] [<c02e2134>] (qh_completions+0x0/0x540) from [<c02e3f50>] (ehci_work+0x124/0xb48)
[  117.298950] [<c02e3e2c>] (ehci_work+0x0/0xb48) from [<c02e8e48>] (ehci_irq+0x268/0x2e8)
[  117.307647] [<c02e8be0>] (ehci_irq+0x0/0x2e8) from [<c02d2890>] (usb_hcd_irq+0x38/0x8c)
[  117.316558] [<c02d2858>] (usb_hcd_irq+0x0/0x8c) from [<c00d202c>] (handle_irq_event_percpu+0x54/0x188)
[  117.326843]  r7:00000000 r6:00000000 r5:c06ec56c r4:ebd9a880
[  117.333740] [<c00d1fd8>] (handle_irq_event_percpu+0x0/0x188) from [<c00d21a4>] (handle_irq_event+0x44/0x64)
[  117.344451] [<c00d2160>] (handle_irq_event+0x0/0x64) from [<c00d4968>] (handle_fasteoi_irq+0xa0/0x10c)
[  117.354705]  r7:0000006d r6:e4880000 r5:c06ec56c r4:c06ec520
[  117.361755] [<c00d48c8>] (handle_fasteoi_irq+0x0/0x10c) from [<c00d1b94>] (generic_handle_irq+0x3c/0x48)
[  117.372009]  r5:c0042e80 r4:c06f8568
[  117.376495] [<c00d1b58>] (generic_handle_irq+0x0/0x48) from [<c0044054>] (asm_do_IRQ+0x54/0xb4)
[  117.385986] [<c0044000>] (asm_do_IRQ+0x0/0xb4) from [<c004a4c8>] (__irq_svc+0x48/0xe0)
[  117.394622] Exception stack(0xe4881a70 to 0xe4881ab8)
[  117.400329] 1a60:                                     ec417548 20000113 00000000 00000000
[  117.409210] 1a80: 20000113 ec417000 ebb6f380 ec41753c ec417548 00000000 20000113 e4881acc
[  117.418243] 1aa0: e4881ab8 e4881ab8 c053cefc c053cf00 60000113 ffffffff
[  117.425598]  r9:e4880000 r8:00000202 r7:00000203 r6:0000006d r5:fa240100
[  117.433471] r4:ffffffff
[  117.436828] [<c053cee8>] (_raw_spin_unlock_irqrestore+0x0/0x44) from [<c02b95ec>] (usbnet_start_xmit+0x1e0/0x364)
[  117.448089]  r5:ec417000 r4:00000001
[  117.452667] [<c02b940c>] (usbnet_start_xmit+0x0/0x364) from [<c03b1aec>] (dev_hard_start_xmit+0x32c/0x5a4)
[  117.463287] [<c03b17c0>] (dev_hard_start_xmit+0x0/0x5a4) from [<c03c91c4>] (sch_direct_xmit+0xb8/0x1e4)
[  117.473449] [<c03c910c>] (sch_direct_xmit+0x0/0x1e4) from [<c03b1ed0>] (dev_queue_xmit+0x16c/0x57c)
[  117.483459] [<c03b1d64>] (dev_queue_xmit+0x0/0x57c) from [<c0404200>] (ip_finish_output+0x144/0x3d4)
[  117.493560] [<c04040bc>] (ip_finish_output+0x0/0x3d4) from [<c04045b0>] (ip_output+0x120/0x140)
[  117.503173] [<c0404490>] (ip_output+0x0/0x140) from [<c0403500>] (ip_local_out+0x30/0x34)
[  117.512207]  r9:00003890 r8:ebbc8c68 r7:00000000 r6:eb88e8e4 r5:eb572230
[  117.519989] r4:eb572230
[  117.523284] [<c04034d0>] (ip_local_out+0x0/0x34) from [<c0403938>] (ip_queue_xmit+0x11c/0x3ac)
[  117.532775]  r5:eb572230 r4:ebbc8a40
[  117.537139] [<c040381c>] (ip_queue_xmit+0x0/0x3ac) from [<c0418d34>] (tcp_transmit_skb+0x388/0x8c4)
[  117.547119] [<c04189ac>] (tcp_transmit_skb+0x0/0x8c4) from [<c041b6c0>] (tcp_write_xmit+0x1f8/0xa5c)
[  117.557220] [<c041b4c8>] (tcp_write_xmit+0x0/0xa5c) from [<c041bfa8>] (__tcp_push_pending_frames+0x30/0x8c)
[  117.567932] [<c041bf78>] (__tcp_push_pending_frames+0x0/0x8c) from [<c040dd00>] (tcp_sendmsg+0x8ec/0xce0)
[  117.578460]  r4:000005a8
[  117.581512] [<c040d414>] (tcp_sendmsg+0x0/0xce0) from [<c042ebc0>] (inet_sendmsg+0x9c/0xc0)
[  117.590759] [<c042eb24>] (inet_sendmsg+0x0/0xc0) from [<c039dc28>] (sock_sendmsg+0xa4/0xb8)
[  117.599822]  r7:00000000 r6:e4881f54 r5:00000001 r4:ec3b2400
[  117.606903] [<c039db84>] (sock_sendmsg+0x0/0xb8) from [<c039f0f8>] (sys_sendto+0xb8/0xdc)
[  117.615966]  r7:00000000 r6:ec3b2400 r5:00008d56 r4:00000001
[  117.622833] [<c039f040>] (sys_sendto+0x0/0xdc) from [<c004aa00>] (ret_fast_syscall+0x0/0x30)

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 21

•

11 years ago

FYI, I found that the interupts for the USB requests are generated correctly, but the USB controller fails to process the outstanding requests. The latest active request simply never get forwarded to the NIC until the board resumes.

To see if the problem has been fixed in a later kernel version, I ported the drivers for the NIC and the USB controller from the 3.9-rc1 mainline kernel to the panda board's kernel. This didn't help however.

Mark Côté [:mcote]

Reporter

Comment 22

•

11 years ago

Correcting the bug summary, since this has nothing to do with the SUTAgent.  Not sure what Product/Component it should be under, though; feel free to change. :)

Summary: [Negatus] pull and cat are very, very slow on pandas → Pandas running b2g stop sending network traffic

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 23

•

11 years ago

I did some further research on this problem. It also happens when I connect over Wifi, which is not attached to USB. And there exists a wifi wakelock in b2g, but it's unrelated and not really used on the pandas.

As a workaround, you could set the screen timeout to 'never', which prevents the problem from happening.

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 24

•

11 years ago

> I did some further research on this problem. It also happens when I connect
> over Wifi, which is not attached to USB.

I was _not_ able to reproduce the problem with Wifi.

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 25

•

11 years ago

Attached file PCAP trace of failed packets (obsolete) — Details

This is a packet trace of a failed transmission. It can be opened with Wireshark.

The IP address ending with .24 is my notebook, .33 is the pandaboard. The first two packets are the telnet handshake. The third packet is a request for 10 strings of 300 bytes length each. The following packets are data transmissions from the pandaboard to the notebook, and the ACKs from the notebook. The pandaboard then retransmits the first data packet, which indicates that is did not correctly receive or process the ACK for this packet.

I finally resolved this problem be generating some workload on the pandaboard.

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 26

•

11 years ago

Attached file PCAP trace of failed packets — Details

Updated for clarity

Attachment #729532 - Attachment is obsolete: true

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 27

•

11 years ago

Attached file PCAP trace of failed packets #2 — Details

Another PCAP trace

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 28

•

11 years ago

Attached file Dmesg log of failed packets #2 — Details

dmesg output for trace #2

The lines 'usbnet_skb_return:270' signal received packets, the lines 'usbnet_start_xmit:1111' signal sent packets. The numbers at the end correlate them with packets in the trace. Interestingly my notebook's ACKs are received and processed by the pandaboard. At some point, my notebook stops ACK-ing packets, even though they have received. The pandaboard sent some retransmits, but these are not received by my notebook.

Thomas Zimmermann [:tzimmermann] [:tdz]

Updated

•

10 years ago

Whiteboard: [mentor=:tdz]

Thomas Zimmermann [:tzimmermann] [:tdz]

Updated

•

10 years ago

Assignee: tzimmermann → nobody

David Burns :automatedtester

Updated

•

10 years ago

Whiteboard: [mentor=:tdz] → [mentor=:tdz][good first bug]

William Lachance (:wlach)

Comment 29

•

10 years ago

Not a good first bug.

Whiteboard: [mentor=:tdz][good first bug]

Thomas Zimmermann [:tzimmermann] [:tdz]

Comment 30

•

10 years ago

Right, but you can keep me as mentor.

Whiteboard: [mentor=:tdz]

Nobody; OK to take it and work on it

Assignee

Updated

•

10 years ago

Mentor: tzimmermann

Whiteboard: [mentor=:tdz]

Jonathan Griffin (:jgriffin)

Comment 31

•

10 years ago

We aren't running pandas in b2g automation any longer.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → WONTFIX

BMO Automation

Updated

•

6 years ago

Product: Testing → Testing Graveyard

/proc/config.gz from linaro image 11 years ago Mark Côté [:mcote] 17.26 KB, application/x-gzip		Details
PCAP trace of failed packets 11 years ago Thomas Zimmermann [:tzimmermann] [:tdz] 1.79 MB, application/vnd.tcpdump.pcap		Details
PCAP trace of failed packets 11 years ago Thomas Zimmermann [:tzimmermann] [:tdz] 15.87 KB, application/vnd.tcpdump.pcap		Details
PCAP trace of failed packets #2 11 years ago Thomas Zimmermann [:tzimmermann] [:tdz] 17.55 KB, application/vnd.tcpdump.pcap		Details
Dmesg log of failed packets #2 11 years ago Thomas Zimmermann [:tzimmermann] [:tdz] 42.79 KB, text/plain		Details