lwIP Wiki

In contrast to tuning for low code size, many users want to tune lwIP for maximum throughput. This page wants to give an overview what influences the performance of an ethernet device using lwIP.

Architecture design[]

  • Favour big-endian systems over little-endian systems if you have the choice (since network byte order is big-endian, so conversion can be omitted)
  • One bottle neck of the system is the ethernet MAC driver (called "netif-driver" with lwIP):
    • Use interrupts and DMA if possible
    • Make sure it is as fast as it can be
    • Often, drivers can be written in a way to prefer TX or RX. If, for your application, one direction is more important than the other one, make sure this direction is preferred in high load situations!
    • If the hardware allows, make sure the driver supports scatter-gather. This allows the driver to DMA a packet consisting of multiple pbufs (e.g. one pbuf for the protocol headers and another pbuf for the application data, which can then be sent zero-copy).
  • The other big bottleneck is (TCP- and UDP-) checksum calculation (creating checksums when transmitting data, checking checksums when receiving data):
    • If the hardware allows it, leave checksum-generation and -checking to the hardware (see also configuration options CHECKSUM_CHECK_* and CHECKSUM_GEN_*)
    • If you do not have hardware support, make sure you have a really optimized software routine to calculate the checksums. This routine is probably the most critical path regarding throughput in the whole stack, so knowing the architecture well and writing a highly optimized assembler-routine is recommended!
  • Define a fast alternative (that copies the architecture's maximum word size) for the default memcpy (define MEMCPY), which results in (slow!) byte-copy on many targets

Configuration options influencing throughput[]

Options are only listed here if they must be changed from their default values in opt.h. Make sure to check your lwipopts.h for unnecessarily changing from defaults.

  • First of all, turn on statistics in a test-run (defines LWIP_STATS and *_STATS for each protocol) and check that none of the statistic counters reports an error (member '.err' != 0)
  • Generally, set the MEMP_NUM_* defines as high as your memory allows to prevent running out of pools in high-load situations.
  • Turn off debugging options (don't define LWIP_DEBUG, )
  • As mentioned in the previous paragraph, set the CHECKSUM_CHECK_* and CHECKSUM_GEN* defines to 0 if checksum is generated and/or checked by your hardware
  • If your memory allows it, set MEM_USE_POOLS to 1 and define LWIP_MALLOC_MEMPOOL's in lwippools.h. This may waste memory, but pools are way faster than a heap!
  • On 32-bit platforms, set ETH_PAD_SIZE to 2 to make sure data and headers are 32-bit aligned.
    • You may even turn off structure-packing for better performance, but this is not thoroughly tested, yet, so make sure you test it!
  • When using a version later than 1.3.2, make sure LWIP_CHECKSUM_ON_COPY is set to 1. This lets the stack calculate the checksum on-the-fly when copying data using memcpy. (This has no effect when the hardware generates/checks checksums.)
  • Set LWIP_RAW to 0 if you don't need raw pcbs (speeds up input processing).
  • For TCP optimizations, see Tuning TCP

Application design[]

If you want maximum throughput, you will want to use the raw API for your application since it provides much better throughput than the sequential APIs (netconn-/socket API).

Designing your application or protocol, you first have to chose between UDP or TCP:

  • UDP
    • pro: has less overhead and you can choose the message size yourself
    • contra: does not provide a safe communication path (the protocol can not tell you if the remote side has received a message or not)
  • TCP
    • pro: provides a safe communication path (you get informed when the remote side has successfully received the data)
    • contra: has more overhead and (to a certain degree) chooses the message size automatically

Having chosen the (IP-)protocol, you have to decide how your application passes data over the network:

  • UDP:
    • Make sure you do not pass data in smaller chunks than the maximum packet size of your network allows (i.e. pass 1472 bytes of data to udp_send when using standard ethernet) to maximize amount of data vs. header bytes in a packet and to minimize the inter-packet gaps in the network.
    • Pass the data in a PBUF_REF or PBUF_ROM if it does not change until the packet has been sent to prevent an extra memcpy. ATTENTION: keep in mind that for DMA-enabled MACs, the packet may not have been sent when udp_send returns!
  • TCP:
    • Although TCP can combine data from multiple calls to tcp_write into one packet, this may decrease performance since the packet is split into multiple pbufs (scatter-gather for DMA-enabled MACs)
    • Pass the data without the flag TCP_WRITE_FLAG_COPY if it does not change until the packet has been acknowledged to prevent an extra memcpy. ATTENTION: keep in mind that TCP, this may take up to several seconds after tcp_write/tcp_output returns, as the packet will be stored for retransmission until an ACK has been received by the remote host!
    • If you write small chunks turn off the nagle algorithm (see wikipedia for more info!) to let the stack send data right away instead of waiting for more data to form bigger packets (see tcp_nagle_disable()).
    • Try to prevent sending small chunks of data waiting for an ACK: delayed ACK on the remote host might destroy performance (often, only every 2nd packet is ACKed)