What Are The Advantages Napi Before The Irq Coalesce

What are the advantages NAPI before the IRQ Coalesce?

I view NAPI as a form of interrupt coalescing. I think your question may stem from a misunderstanding about NAPI. First of all, interrupts are involved with NAPI. Also, NAPI's polling is actually not "in vain". Remember, for NAPI, the idea is that high throughput traffic is bursty. NAPI only "starts" after an "packet received interrupt" happens.

Here's a quick overview of how NAPI is supposed to be used:

The kernel kicks off the "packet received" interrupt, which a network device driver using NAPI detects. The network device driver then disables interrupts related to receiving packets and using NAPI, tells the Linux networking subsystem to poll the device driver. The poll function is implemented by the device driver, and is passed to the networking subsystem, and contains the device driver's packet handler. After enough packets are received or a timeout is reached, packet-receiving interrupts are re-enabled, and everything starts over again.

So NAPI is basically just a centralized API in the Linux networking subsystem for supporting interrupt coalescing to reduce receive livelock situations. NAPI gives device driver developers a clean framework for interrupt coalescing. NAPI does not run all the time, but only happens when traffic is actually received, making it essentially an interrupt coalescing scheme... At least in my book.

Note: This was all in the context of a network device driver using NAPI, but in fact NAPI can be used for any kind of interrupt. This is one of the benefits of NAPI, as well.

If there are any errors in my understanding, please feel free to point them out!

Regarding NAPI implementation in Linux kernel

Now for NAPI enabled Ethernet driver initially whenever packets comes at interface ,it is notified to CPU and appropriate Ethernet driver code (Interrupt handler) is executed .Inside the interrupt handler code we check if type of interrupt is received packet.

What it mean by disabling further interrupts?

Normally a driver would clear the condition causing the interrupt. The NAPI driver, however, may also disable the receive interrupt when the ISR is done.

The assumption is that the arrival of one Ethernet frame may be the start of a burst or flood of frames. So instead of exiting interrupt mode and likely immediately reentering interrupt mode, why not test (i.e. poll) if more frames have already arrived?

Is it mean packets are still captured by device

Yes.

Each arriving frame is stored by the Ethernet controller in a frame buffer.

and kept in device memory

It's not typically "device memory".

It is typically a set of buffers (e.g. ring buffer) allocated in main memory assigned to the Ethernet controller.

but not notified to CPU about the availability of these packets?

Since the receive interrupt has been disabled, the NAPI driver is not notified of this event.

But since the driver is busy processing the previous frame, the interrupt request could not be serviced immediately anyway.

Also ,what it mean by CPU is pooling the device ,

Presumably you are actually asking about "polling"?

Polling simply means that the program (i.e. the driver) interrogates (i.e. reads and tests) status bit(s) for the condition it is waiting for.

If the condition is met, then it will process the event in a manner similar to an interrupt for that event.

If the condition is not met, then it may loop (in the generic case). But the NAPI driver, when the poll indicates that no more frames have arrived, will assume that the packet burst or flood is over, and will resume interrupt mode.

is it like CPU after every few second will run snull_poll() method and copy whatever number of packets are in device memory to DMA Buffer and pushed to Upper layer?

The NAPI driver would not delay or suspend itself for a "few second"s before polling.

The assumption is that Ethernet frames could be flooding the port, so the poll would be performed as soon as processing on the current frame is complete.

A possible bug in a NAPI driver is called "rotting packet".

When the driver transitions from the poll mode back to interrupt mode, a frame could arrive during this transition and be undetected by the driver.

Not until another frame arrives (and generates an interrupt) would the previous frame be "found" and processed by the NAPI driver.

BTW

You consistently write statements or questions similar to "the CPU does ..." or "notified to CPU".

The CPU is always (when not sleeping or powered off) executing machine instructions.

You should be concerned about to which logical entity (i.e. which program or source code module) those instructions belong.

You're asking software questions, so the fact that an interrupt causes a known, certain sequence by the CPU is a given and need not be mentioned.

ADDENDUM

I am just trying to understand drivers/net/ethernet/smsc/smsc911x.c in Linux source code.

The SMSC LAN911x Ethernet chips are more sophisticated than what I'm used to and have been describing above. Besides the MAC, these chips also have an integrated PHY, and have TX and RX FIFOs instead of using buffer ring or lists in main memory.

As per your suggestion I have started reading the SMSCLan9118 datasheet and trying to map it with smsc911x_irqhandler function where interrupt status (INT_STS) and interrupt enable (INT_EN) registers have been read but don't get how

if (likely(intsts & inten & INT_STS_RSFL_))

condition is checked here in line 1627.

INT_STS is defined in the header file as

#define INT_STS                         0x58

and the table in Section 5.3, System Control and Status Registers, in the datasheet lists the register at (relative) address 0x58 as

58h INT_STS Interrupt Status 

So the smsc911x device driver uses the exact same register name as the HW datasheet.

This 32-bit register is read using this register offset in the ISR using:

u32 intsts = smsc911x_reg_read(pdata, INT_STS);

So the 32 bits of the interrupt status (in variable intsts) is Boolean ANDed with the 32 bits of the interrupt mask (in variable inten).

This produces the interrupt status bits that the driver are actually interested in. This may also be good defensive programming in case the HW sets status bits anyway for interrupt conditions that have not been enabled (in the INT_EN register).

Then that if statement does another Boolean AND to extract the one bit (INT_STS_RSFL_) that is being checked.

5.3.3 INT_STS—Interrupt Status Register

RX Status FIFO Level Interrupt (RSFL).
Generated when the RX Status FIFO reaches the programmed level

The likely() operator is for compiler optimization to utilize branch prediction capabilities in the CPU. The driver's author is directing the compiler to optimize the code for a true result of the enclosed logic expression (e.g. the ANDing of three integers, which would indicate an interrupt condition that needs servicing).

Also on recieving the packet on interface which bit is set on which register.

My take on reading the LAN9118 datasheet is that there really is no interrupt specifically for the receipt of a frame.

Instead the host can be notified when the RX FIFO exceeds a threshold.

5.3.6 FIFO_INT—FIFO Level Interrupts

RX Status Level.
The value in this field sets the level, in number of DWORDs, at which the RX Status FIFO Level interrupt (RSFL) will be generated.
When the RX Status FIFO used space is greater than this value an RX Status FIFO Level interrupt (RSFL) will be generated.

The smsc911x driver apparently uses this threshold at its default value of zero.

Each entry in the RX Status FIFO occupies a DWORD. The default value of this threshold is 0x00 (i.e. interrupt on "first" frame). If this threshold is more than zero, then there is the possibility of "rotting packets".

who does Napi scheduling

Scheduling in the NAPI sense just means marking it as needing to run. In other words, you simply make the function call saying "schedule me to run in the NAPI softirq". This causes your driver's poll function to be added to a list of "need-to-be-polled" devices, and it also causes the NAPI softirq to be activated "on the way out."

So it typically works this way. Your driver configures your device to tell it to interrupt at some point in the future when some packets (ideally more than one so as to amortize the overhead) are ready to be processed. In the meantime, the kernel schedules ordinary user-space processes...

When your device interrupts:

  • the interrupt causes a transition to kernel-mode if not already in kernel-mode
  • the linux interrupt handling code finds your driver's interrupt handler routine and invokes it.
  • Your interrupt handler calls napi_schedule (placing your poll function on a list and triggering the softirq.
  • Your interrupt handler returns.
  • Just before returning to user-mode (or whatever the CPU was doing prior to the interrupt), the interrupt handling code sees that the network softirq needs to run and activates it.
  • The softirq calls your poll function to process incoming packets until you have no more packets or until the NAPI "budget" is exhausted. (In the latter case, the softirq is later re-invoked by the ksoftirqd thread [I think].)
  • Your driver would then only re-enable interrupts on your device if it had completed processing all the ready-to-process packets.

In my experience, work-queues are typically used only for certain long-duration operations that need to be done in a schedulable context (i.e. a real task context that can block [sleep] while waiting for something to complete).

For example, in the intel ixgbe driver, if the driver detects that the NIC needs to be reset, a work-queue entry is triggered to re-initialize the NIC. Some of the operations require relatively long sleeps and you don't want to tie up the processor in a softirq context if you can let some other task run. There may be other reasons as well -- for example, you need to allocate large amounts of memory that may require the allocation call to be made in task context.

What is the difference between Interrupt coalescing and the Nagle algorithm?

Interrupt coalescing concerns the network driver: the idea is to avoid invoking the interrupt handler anew every time a network packet shows up. Instead, after receiving a packet, the NIC waits until M packets are received or until N microseconds have passed before generating an interrupt. Then the driver can process many packets at once. (Otherwise, with modern gigabit and 10-gigabit adapters, the processor would need to field hundreds of thousands or millions of interrupts per second, which can prevent the system from being able to accomplish much else.) As your link points out, there is (or at least may be) a cost of additional latency since the OS doesn't start processing a received packet at the earliest possible instant.

Nagle's algorithm is focused on reducing the number of packets sent by coalescing payload data from multiple packets into one. The classic example is a telnet session. Without Nagle, every time you press a key, the system has to create an entire new packet (min 64 bytes on Ethernet) to send one byte.

So the intent of interrupt coalescing is to support greater bandwidth utilization, while the intent of Nagle's algorithm is actually to produce lower bandwidth (by sending fewer packets).

NAPI interrupt disabling and handling shared interrupt line

I wrote a comprehensive guide to understanding, tuning, and optimizing the Linux network stack which explains everything about network drivers, NAPI, and more, so check it out.

As far as your questions:

  1. Device IRQs are supposed to be disabled by the driver's IRQ handler after NAPI is enabled. Yes, there is a time gap, but it should be quite small. That is part of the tradeoff decision you must make: do you care more about throughput or latency? Depending on which, you can optimize your network stack appropriately. In any case, most NICs allow the user to increase (or decrease) the size of the ring buffer that tracks incoming network data. So, a pause is fine because packets will just be queued for processing later.

  2. It depends on the driver, but in general most drivers will enable NAPI poll mode in the IRQ handler, as soon as it is fired (usually) with a call to napi_schedule. You can find a walkthrough of how NAPI is enabled for the Intel igb driver here. Note that IRQ handlers are not necessarily fired for every single packet. You can adjust the rate at which IRQ handlers fire on most cards by using a feature called interrupt coalescing. Some NICs may not support this option.

  3. The IRQ handlers for other devices will be executed when the IRQ is fired because IRQ handlers have very high priority on the CPU. The NAPI poll loop (which runs in a SoftIRQ) will run on whichever CPU the device IRQ was handled. Thus, if you have multiple NICs and multiple CPUs, you can tune the IRQ affinity of the IRQs for each NIC to prevent starving a particular NIC.

  4. As for the example you asked about in the comments:

say NIC 1 and NIC 2 share IRQ line , lets assume NIC 1 is low load , NIC 2 high load and NIC 1 receives interrupt, driver of NIC 1 would disable interrupt until it's softirq is handled , say that time gap as t1 . So for time t1 NIC 2 interrupts are too disabled, right?

This depends on the driver, but in the normal case, NIC 1 only disables interrupts while the IRQ handler is being executed. The call to napi_schedule tells the softirq code that it should start running if it hasn't started yet. The softirq code runs asynchronously, so no NIC 1 does not wait for the softirq to be handled.

Now, as far as shared IRQs go: again it depends on the device and the driver. The driver should be written in such a way that it can handle shared IRQs. If the driver disables an IRQ that is being shared, all devices sharing that IRQ will not receive interrupts. This would be bad. One way that some devices solve this is by allowing a driver to read/write to a specific register causing that specific device to stop generating interrupts. This is a preferred solution as it does not block other devices generating the same IRQ.

When IRQs are disabled for NAPI, what is meant is that the driver asks the NIC hardware to stop sending IRQs. Thus, other IRQs on the same line (for other devices) will still continue to be processed. Here's an example of how the Intel igb driver turns off IRQs for that device by writing to registers.



Related Topics



Leave a reply



Submit