Tun, tap, virtio, packet and uml vector all use struct virtio_net_hdr
to communicate packet metadata to userspace.
For skbuffs with vlan, the first two return the packet as it may have
existed on the wire, inserting the VLAN tag in the user buffer. Then
virtio_net_hdr.csum_start needs to be adjusted by VLAN_HLEN bytes.
Commit f09e2249c4 ("macvtap: restore vlan header on user read")
added this feature to macvtap. Commit 3ce9b20f19 ("macvtap: Fix
csum_start when VLAN tags are present") then fixed up csum_start.
Virtio, packet and uml do not insert the vlan header in the user
buffer.
When introducing virtio_net_hdr_from_skb to deduplicate filling in
the virtio_net_hdr, the variant from macvtap which adds VLAN_HLEN was
applied uniformly, breaking csum offset for packets with vlan on
virtio and packet.
Make insertion of VLAN_HLEN optional. Convert the callers to pass it
when needed.
Fixes: e858fae2b0 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
Fixes: 1276f24eee ("packet: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-06-05
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add a new BPF hook for sendmsg similar to existing hooks for bind and
connect: "This allows to override source IP (including the case when it's
set via cmsg(3)) and destination IP:port for unconnected UDP (slow path).
TCP and connected UDP (fast path) are not affected. This makes UDP support
complete, that is, connected UDP is handled by connect hooks, unconnected
by sendmsg ones.", from Andrey.
2) Rework of the AF_XDP API to allow extending it in future for type writer
model if necessary. In this mode a memory window is passed to hardware
and multiple frames might be filled into that window instead of just one
that is the case in the current fixed frame-size model. With the new
changes made this can be supported without having to add a new descriptor
format. Also, core bits for the zero-copy support for AF_XDP have been
merged as agreed upon, where i40e bits will be routed via Jeff later on.
Various improvements to documentation and sample programs included as
well, all from Björn and Magnus.
3) Given BPF's flexibility, a new program type has been added to implement
infrared decoders. Quote: "The kernel IR decoders support the most
widely used IR protocols, but there are many protocols which are not
supported. [...] There is a 'long tail' of unsupported IR protocols,
for which lircd is need to decode the IR. IR encoding is done in such
a way that some simple circuit can decode it; therefore, BPF is ideal.
[...] user-space can define a decoder in BPF, attach it to the rc
device through the lirc chardev.", from Sean.
4) Several improvements and fixes to BPF core, among others, dumping map
and prog IDs into fdinfo which is a straight forward way to correlate
BPF objects used by applications, removing an indirect call and therefore
retpoline in all map lookup/update/delete calls by invoking the callback
directly for 64 bit archs, adding a new bpf_skb_cgroup_id() BPF helper
for tc BPF programs to have an efficient way of looking up cgroup v2 id
for policy or other use cases. Fixes to make sure we zero tunnel/xfrm
state that hasn't been filled, to allow context access wrt pt_regs in
32 bit archs for tracing, and last but not least various test cases
for fixes that landed in bpf earlier, from Daniel.
5) Get rid of the ndo_xdp_flush API and extend the ndo_xdp_xmit with
a XDP_XMIT_FLUSH flag instead which allows to avoid one indirect
call as flushing is now merged directly into ndo_xdp_xmit(), from Jesper.
6) Add a new bpf_get_current_cgroup_id() helper that can be used in
tracing to retrieve the cgroup id from the current process in order
to allow for e.g. aggregation of container-level events, from Yonghong.
7) Two follow-up fixes for BTF to reject invalid input values and
related to that also two test cases for BPF kselftests, from Martin.
8) Various API improvements to the bpf_fib_lookup() helper, that is,
dropping MPLS bits which are not fully hashed out yet, rejecting
invalid helper flags, returning error for unsupported address
families as well as renaming flowlabel to flowinfo, from David.
9) Various fixes and improvements to sockmap BPF kselftests in particular
in proper error detection and data verification, from Prashant.
10) Two arm32 BPF JIT improvements. One is to fix imm range check with
regards to whether immediate fits into 24 bits, and a naming cleanup
to get functions related to rsh handling consistent to those handling
lsh, from Wang.
11) Two compile warning fixes in BPF, one for BTF and a false positive
to silent gcc in stack_map_get_build_id_offset(), from Arnd.
12) Add missing seg6.h header into tools include infrastructure in order
to fix compilation of BPF kselftests, from Mathieu.
13) Several formatting cleanups in the BPF UAPI helper description that
also fix an error during rst2man compilation, from Quentin.
14) Hide an unused variable in sk_msg_convert_ctx_access() when IPv6 is
not built into the kernel, from Yue.
15) Remove a useless double assignment in dev_map_enqueue(), from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the ndo_xdp_flush call implementation virtnet_xdp_flush
as no callers of ndo_xdp_flush are left.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Use the common free functions while return successfully.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When passed the XDP_XMIT_FLUSH flag virtnet_xdp_xmit now performs the
same virtqueue_kick as virtnet_xdp_flush.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch only change the API and reject any use of flags. This is an
intermediate step that allows us to implement the flush flag operation
later, for each individual driver in a separate patch.
The plan is to implement flush operation via XDP_XMIT_FLUSH flag
and then remove XDP_XMIT_FLAGS_NONE when done.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fix to return a negative error code from the failover create fail error
handling case instead of 0, as done elsewhere in this function.
Fixes: ba5e4426e8 ("virtio_net: Extend virtio to use VF datapath when available")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch enables virtio_net to switch over to a VF datapath when STANDBY
feature is enabled and a VF netdev is present with the same MAC address.
It allows live migration of a VM with a direct attached VF without the need
to setup a bond/team between a VF and virtio net device in the guest.
It uses the API that is exported by the net_failover driver to create and
and destroy a master failover netdev. When STANDBY feature is enabled, an
additional netdev(failover netdev) is created that acts as a master device
and tracks the state of the 2 lower netdevs. The original virtio_net netdev
is marked as 'standby' netdev and a passthru device with the same MAC is
registered as 'primary' netdev.
The hypervisor needs to unplug the VF device from the guest on the source
host and reset the MAC filter of the VF to initiate failover of datapath
to virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.
This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This feature bit can be used by hypervisor to indicate virtio_net device to
act as a standby for another device with the same MAC address.
VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch change the API for ndo_xdp_xmit to support bulking
xdp_frames.
When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
Most of the slowdown is caused by DMA API indirect function calls, but
also the net_device->ndo_xdp_xmit() call.
Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
performance improved:
for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
With frames avail as a bulk inside the driver ndo_xdp_xmit call,
further optimizations are possible, like bulk DMA-mapping for TX.
Testing without CONFIG_RETPOLINE show the same performance for
physical NIC drivers.
The virtual NIC driver tun sees a huge performance boost, as it can
avoid doing per frame producer locking, but instead amortize the
locking cost over the bulk.
V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
V4: Isolated ndo, driver changes and callers.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We need to drop refcnt to xdp_page if we see a gso packet. Otherwise
it will be leaked. Fixing this by moving the check of gso packet above
the linearizing logic. While at it, remove useless comment as well.
Cc: John Fastabend <john.fastabend@gmail.com>
Fixes: 72979a6c35 ("virtio_net: xdp, add slowpath case for non contiguous buffers")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we successfully linearize the packet, num_buf will be set to zero
which may confuse error handling path which assumes num_buf is at
least 1 and this can lead the code tries to pop the descriptor of next
buffer. Fixing this by checking num_buf against 1 before decreasing.
Fixes: 4941d472bf ("virtio-net: do not reset during XDP set")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should not go for the error path after successfully transmitting a
XDP buffer after linearizing. Since the error path may try to pop and
drop next packet and increase the drop counters. Fixing this by simply
drop the refcnt of original page and go for xmit path.
Fixes: 72979a6c35 ("virtio_net: xdp, add slowpath case for non contiguous buffers")
Cc: John Fastabend <john.fastabend@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After a linearized packet was redirected by XDP, we should not go for
the err path which will try to pop buffers for the next packet and
increase the drop counter. Fixing this by just drop the page refcnt
for the original page.
Fixes: 186b3c998c ("virtio-net: support XDP_REDIRECT")
Reported-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 6870de435b ("bpf: make virtio compatible w/
bpf_xdp_adjust_tail") i didn't account for vi->hdr_len during new
packet's length calculation after bpf_prog_run in receive_mergeable.
because of this all packets, if they were passed to the kernel,
were truncated by 12 bytes.
Fixes:6870de435b90 ("bpf: make virtio compatible w/ bpf_xdp_adjust_tail")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
offloads is a buffer in virtio format, should use
the __virtio64 tag.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Programming vids (adding or removing them) still passes
guest-endian values in the DMA buffer. That's wrong
if guest is big-endian and when virtio 1 is enabled.
Note: this is on top of a previous patch:
virtio_net: split out ctrl buffer
Fixes: 9465a7a6f ("virtio_net: enable v1.0 support")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sending control commands, virtio net sets up several buffers for
DMA. The buffers are all part of the net device which means it's
actually allocated by kvmalloc so it's in theory (on extreme memory
pressure) possible to get a vmalloc'ed buffer which on some platforms
means we can't DMA there.
Fix up by moving the DMA buffers into a separate structure.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for virtio driver we need to adjust XDP_PASS handling by recalculating
length of the packet if it was passed to the TCP/IP stack
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff. This brings xdp_return_frame and ndp_xdp_xmit in sync.
This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.
V4: Adjust for commit 59655a5b6c ("tuntap: XDP_TX can use native XDP")
V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.
When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".
This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object. In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.
It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame. The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow. In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.
To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern. My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.
V7: Adjust for commit d9314c474d ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda42 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info. Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.
The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short. The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.
For more advanced allocators there is a need to store a pointer to the
registered allocator. Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU. The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().
It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function. In any case,
this must happen via RCU freeing.
Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.
V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
XDP_REDIRECT, even-though it's not strictly necessary when
allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).
V6: Per req of Alex Duyck
- Introduce rhashtable_lookup() call in later patch
V8: Address sparse should be static warnings (from kbuild test robot)
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The virtio_net driver assumes XDP frames are always released based on
page refcnt (via put_page). Thus, is only queues the XDP data pointer
address and uses virt_to_head_page() to retrieve struct page.
Use the XDP return API to get away from such assumptions. Instead
queue an xdp_frame, which allow us to use the xdp_return_frame API,
when releasing the frame.
V8: Avoid endianness issues (found by kbuild test robot)
V9: Change __virtnet_xdp_xmit from bool to int return value (found by Dan Carpenter)
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We tends to batch submitting packets during XDP_TX. This requires to
kick virtqueue after a batch, we tried to do it through
xdp_do_flush_map() which only makes sense for devmap not XDP_TX. So
explicitly kick the virtqueue in this case.
Reported-by: Kimitoshi Takahashi <ktaka@nii.ac.jp>
Tested-by: Kimitoshi Takahashi <ktaka@nii.ac.jp>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Fixes: 186b3c998c ("virtio-net: support XDP_REDIRECT")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The operstate update logic will leave an interface in the
default UNKNOWN operstate if the interface carrier state never changes
from the default carrier up state set at creation. This includes the
case of an explicit call to netif_carrier_on, as the carrier on to on
transition has no effect on operstate.
This affects virtio-net for the case that the virtio peer does
not support VIRTIO_NET_F_STATUS (the feature that provides carrier state
updates). Without this feature, the virtio specification states that
"the link should be assumed active," so, logically, the operstate should
be UP instead of UNKNOWN. This has impact on user space applications
that use the operstate to make availability decisions for the interface.
Resolve this by changing the virtio probe logic slightly to call
netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS
cases, and then the existing call to netif_carrier_on for the "without"
case will cause an operstate transition.
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
XDP_REDIRECT support for mergeable buffer was removed since commit
7324f5399b ("virtio_net: disable XDP_REDIRECT in receive_mergeable()
case"). This is because we don't reserve enough tailroom for struct
skb_shared_info which breaks XDP assumption. So this patch fixes this
by reserving enough tailroom and using fixed size of rx buffer.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We try to disable NAPI to prevent a single XDP TX queue being used by
multiple cpus. But we don't check if device is up (NAPI is enabled),
this could result stall because of infinite wait in
napi_disable(). Fixing this by checking device state through
netif_running() before.
Fixes: 4941d472bf ("virtio-net: do not reset during XDP set")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a driver implements the ndo_xdp_xmit() function, there is
(currently) no generic way to determine whether it is safe to call.
It is e.g. unsafe to call the drivers ndo_xdp_xmit, if it have not
allocated the needed XDP TX queues yet. This is the case for
virtio_net, which first allocates the XDP TX queues once an XDP/bpf
prog is attached (in virtnet_xdp_set()).
Thus, a crash will occur for virtio_net when redirecting to another
virtio_net device's ndo_xdp_xmit, which have not attached a XDP prog.
The sample xdp_redirect_map tries to attach a dummy XDP prog to take
this into account, but it can also easily fail if the virtio_net (or
actually underlying vhost driver) have not allocated enough extra
queues for the device.
Allocating more queue this is currently a manual config.
Hint for libvirt XML add:
<driver name='vhost' queues='16'>
<host mrg_rxbuf='off'/>
<guest tso4='off' tso6='off' ecn='off' ufo='off'/>
</driver>
The solution in this patch is to check that the device have loaded an
XDP/bpf prog before proceeding. This is similar to the check
performed in driver ixgbe.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
XDP_REDIRECT calling xdp_do_redirect() can fail for multiple reasons
(which can be inspected by tracepoints). The current semantics is that
on failure the driver calling xdp_do_redirect() must handle freeing or
recycling the page associated with this frame. This can be seen as an
optimization, as drivers usually have an optimized XDP_DROP code path
for frame recycling in place already.
The virtio_net driver didn't handle when xdp_do_redirect() failed.
This caused a memory leak as the page refcnt wasn't decremented on
failures.
The function __virtnet_xdp_xmit() did handle one type of failure,
when the xmit queue virtqueue_add_outbuf() is full, which "hides"
releasing a refcnt on the page. Instead the function __virtnet_xdp_xmit()
must follow API of xdp_do_redirect(), which on errors leave it up to
the caller to free the page, of the failed send operation.
Fixes: 186b3c998c ("virtio-net: support XDP_REDIRECT")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When configuring virtio_net to use the code path 'receive_small()',
in-order to get correct XDP_REDIRECT support, I discovered TCP packets
would get silently dropped when loading an XDP program action XDP_PASS.
The bug seems to be that receive_small() when XDP is loaded check that
hdr->hdr.flags is zero, which seems wrong as hdr.flags contains the
flags VIRTIO_NET_HDR_F_* :
#define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 /* Use csum_start, csum_offset */
#define VIRTIO_NET_HDR_F_DATA_VALID 2 /* Csum is valid */
TCP got dropped as it had the VIRTIO_NET_HDR_F_DATA_VALID flag set.
The flags that are relevant here are the VIRTIO_NET_HDR_GSO_* flags
stored in hdr->hdr.gso_type. Thus, the fix is just check that none of
the gso_type flags have been set.
Fixes: bb91accf27 ("virtio-net: XDP support for small buffers")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The virtio_net code have three different RX code-paths in receive_buf().
Two of these code paths can handle XDP, but one of them is broken for
at least XDP_REDIRECT.
Function(1): receive_big() does not support XDP.
Function(2): receive_small() support XDP fully and uses build_skb().
Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb().
The simple explanation is that receive_mergeable() is broken because
it uses napi_alloc_skb(), which violates XDP given XDP assumes packet
header+data in single page and enough tail room for skb_shared_info.
The longer explaination is that receive_mergeable() tries to
work-around and satisfy these XDP requiresments e.g. by having a
function xdp_linearize_page() that allocates and memcpy RX buffers
around (in case packet is scattered across multiple rx buffers). This
does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because
we have not implemented bpf_xdp_adjust_tail yet).
The XDP_REDIRECT action combined with cpumap is broken, and cause hard
to debug crashes. The main issue is that the RX packet does not have
the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing
skb_shared_info to overlap the next packets head-room (in which cpumap
stores info).
Reproducing depend on the packet payload length and if RX-buffer size
happened to have tail-room for skb_shared_info or not. But to make
this even harder to troubleshoot, the RX-buffer size is runtime
dynamically change based on an Exponentially Weighted Moving Average
(EWMA) over the packet length, when refilling RX rings.
This patch only disable XDP_REDIRECT support in receive_mergeable()
case, because it can cause a real crash.
IMHO we should consider NOT supporting XDP in receive_mergeable() at
all, because the principles behind XDP are to gain speed by (1) code
simplicity, (2) sacrificing memory and (3) where possible moving
runtime checks to setup time. These principles are clearly being
violated in receive_mergeable(), that e.g. runtime track average
buffer size to save memory consumption.
In the longer run, we should consider introducing a separate receive
function when attaching an XDP program, and also change the memory
model to be compatible with XDP when attaching an XDP prog.
Fixes: 186b3c998c ("virtio-net: support XDP_REDIRECT")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The main purpose of this patch is adding a way of checking per-queue stats.
It's useful to debug performance problems on multiqueue environment.
$ ethtool -S ens10
NIC statistics:
rx_queue_0_packets: 2090408
rx_queue_0_bytes: 3164825094
rx_queue_1_packets: 2082531
rx_queue_1_bytes: 3152932314
tx_queue_0_packets: 2770841
tx_queue_0_bytes: 4194955474
tx_queue_1_packets: 3084697
tx_queue_1_bytes: 4670196372
This change converts existing per-cpu stats structure into per-queue one.
This should not impact on performance since each queue counter is not
updated concurrently by multiple cpus.
Performance numbers:
- Guest has 2 vcpus and 2 queues
- Guest runs netserver
- Host runs 100-flow super_netperf
Before After Diff
UDP_STREAM 18byte 86.22 87.00 +0.90%
UDP_STREAM 1472byte 4055.27 4042.18 -0.32%
TCP_STREAM 16956.32 16890.63 -0.39%
UDP_RR 178667.11 185862.70 +4.03%
TCP_RR 128473.04 124985.81 -2.71%
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ability to set speed and duplex for virtio_net is useful in various
scenarios as described here:
16032be virtio_net: add ethtool support for set and get of settings
However, it would be nice to be able to set this from the hypervisor,
such that virtio_net doesn't require custom guest ethtool commands.
Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.
Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention
is that device feature bits are to grow down from bit 63, since the
transports are starting from bit 24 and growing up.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtio-dev@lists.oasis-open.org
Signed-off-by: David S. Miller <davem@davemloft.net>
The virtio_net driver doesn't dynamically change the RX-ring queue
layout and backing pages, but instead reject XDP setup if all the
conditions for XDP is not meet. Thus, the xdp_rxq_info also remains
fairly static. This allow us to simply add the reg/unreg to
net_device open/close functions.
Driver hook points for xdp_rxq_info:
* reg : virtnet_open
* unreg: virtnet_close
V3:
- bugfix, also setup xdp.rxq in receive_mergeable()
- Tested bpf-sample prog inside guest on a virtio_net device
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A couple of minor bugfixes.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJaKW26AAoJECgfDbjSjVRpDwIH/1tq1vcjd5i00nh+gQD7jWm2
qWrpcKbsS0ANTvhm9UNFY24BVoPuNpNrApQpuBxmgxE/XGqx7A+8xXhwPAM5lLiG
uDSPB2nsfjvEUOde7bgeR+t6ay+Ki2UWKzY46lSoCTcIN7BSVFWQZonsGu2xLDzz
kKpCtlobXRnXzeWm+fh1oOZu/cn/TuAF0mbb+6TUQqSsHAt6PQ3Hwsly2EmQV5xR
of6Si/TnxFOkhZUKpezbuTU/g/20ZkHUSzgfrOuZLGonaIvZIE6BUm8m21/IgCbw
WnwSqe66WsWF2vd0RDLFq4L3qaRyGn+W7iYo3KPbwIPEEotSRFMUgixBWBB3IgU=
=Pf4w
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio bugfixes from Michael Tsirkin:
"A couple of minor bugfixes"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio_net: fix return value check in receive_mergeable()
virtio_mmio: add cleanup for virtio_mmio_remove
virtio_mmio: add cleanup for virtio_mmio_probe
Since commit 39e6c8208d ("net: solve a NAPI race") napi has been able
to be rescheduled within napi_complete_done() even in non-busypoll case,
but virtnet_poll() always enabled interrupts before complete, and when
napi was rescheduled within napi_complete_done() it did not disable
interrupts.
This caused more interrupts when event idx is disabled.
According to commit cbdadbbf0c ("virtio_net: fix race in RX VQ
processing") we cannot place virtqueue_enable_cb_prepare() after
NAPI_STATE_SCHED is cleared, so disable interrupts again if
napi_complete_done() returned false.
Tested with vhost-user of OVS 2.7 on host, which does not have the event
idx feature.
* Before patch:
$ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
212992 1472 60.00 32763206 0 6430.32
212992 60.00 23384299 4589.56
Interrupts on guest: 9872369
Packets/interrupt: 2.37
* After patch
$ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
212992 1472 60.00 32794646 0 6436.49
212992 60.00 32793501 6436.27
Interrupts on guest: 4941299
Packets/interrupt: 6.64
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function virtqueue_get_buf_ctx() could return NULL, the return
value 'buf' need to be checked with NULL, not value 'ctx'.
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Merge updates from Andrew Morton:
- a few misc bits
- ocfs2 updates
- almost all of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (131 commits)
memory hotplug: fix comments when adding section
mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP
mm: simplify nodemask printing
mm,oom_reaper: remove pointless kthread_run() error check
mm/page_ext.c: check if page_ext is not prepared
writeback: remove unused function parameter
mm: do not rely on preempt_count in print_vma_addr
mm, sparse: do not swamp log with huge vmemmap allocation failures
mm/hmm: remove redundant variable align_end
mm/list_lru.c: mark expected switch fall-through
mm/shmem.c: mark expected switch fall-through
mm/page_alloc.c: broken deferred calculation
mm: don't warn about allocations which stall for too long
fs: fuse: account fuse_inode slab memory as reclaimable
mm, page_alloc: fix potential false positive in __zone_watermark_ok
mm: mlock: remove lru_add_drain_all()
mm, sysctl: make NUMA stats configurable
shmem: convert shmem_init_inodecache() to void
Unify migrate_pages and move_pages access checks
mm, pagevec: rename pagevec drained field
...
As the page free path makes no distinction between cache hot and cold
pages, there is no real useful ordering of pages in the free list that
allocation requests can take advantage of. Juding from the users of
__GFP_COLD, it is likely that a number of them are the result of copying
other sites instead of actually measuring the impact. Remove the
__GFP_COLD parameter which simplifies a number of paths in the page
allocator.
This is potentially controversial but bear in mind that the size of the
per-cpu pagelists versus modern cache sizes means that the whole per-cpu
list can often fit in the L3 cache. Hence, there is only a potential
benefit for microbenchmarks that alloc/free pages in a tight loop. It's
even worse when THP is taken into account which has little or no chance
of getting a cache-hot page as the per-cpu list is bypassed and the
zeroing of multiple pages will thrash the cache anyway.
The truncate microbenchmarks are not shown as this patch affects the
allocation path and not the free path. A page fault microbenchmark was
tested but it showed no sigificant difference which is not surprising
given that the __GFP_COLD branches are a miniscule percentage of the
fault path.
Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ndo_xdp is a control path callback for setting up XDP in the
driver. We can reuse it for other forms of communication
between the eBPF stack and the drivers. Rename the callback
and associated structures and definitions.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.
xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.
The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should set xdp_xmit only when xdp_do_redirect() succeed.
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch tries to add XDP_REDIRECT for virtio-net. The changes are
not complex as we could use exist XDP_TX helpers for most of the
work. The rest is passing the XDP_TX to NAPI handler for implementing
batching.
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's no need to add packet len average in the case of XDP_PASS
since it will be done soon after skb is created.
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change is needed to not fool drop monitor.
(perf record ... -e skb:kfree_skb )
Packets were properly sent and are consumed after TX completion.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The show and store functions don't need/use the attribute.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kernel coding style is to put paren around operand of sizeof.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The array guest_offloads is local to the source and does not need to
be in global scope, so make it static. Also tweak formatting.
Cleans up sparse warnings:
symbol 'guest_offloads' was not declared. Should it be static?
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two minor conflicts in virtio_net driver (bug fix overlapping addition
of a helper) and MAINTAINERS (new driver edit overlapping revamp of
PHY entry).
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Handle notifier registry failures properly in tun/tap driver, from
Tonghao Zhang.
2) Fix bpf verifier handling of subtraction bounds and add a testcase
for this, from Edward Cree.
3) Increase reset timeout in ftgmac100 driver, from Ben Herrenschmidt.
4) Fix use after free in prd_retire_rx_blk_timer_exired() in AF_PACKET,
from Cong Wang.
5) Fix SElinux regression due to recent UDP optimizations, from Paolo
Abeni.
6) We accidently increment IPSTATS_MIB_FRAGFAILS in the ipv6 code
paths, fix from Stefano Brivio.
7) Fix some mem leaks in dccp, from Xin Long.
8) Adjust MDIO_BUS kconfig deps to avoid build errors, from Arnd
Bergmann.
9) Mac address length check and buffer size fixes from Cong Wang.
10) Don't leak sockets in ipv6 udp early demux, from Paolo Abeni.
11) Fix return value when copy_from_user() fails in
bpf_prog_get_info_by_fd(), from Daniel Borkmann.
12) Handle PHY_HALTED properly in phy library state machine, from
Florian Fainelli.
13) Fix OOPS in fib_sync_down_dev(), from Ido Schimmel.
14) Fix truesize calculation in virtio_net which led to performance
regressions, from Michael S Tsirkin.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
samples/bpf: fix bpf tunnel cleanup
udp6: fix jumbogram reception
ppp: Fix a scheduling-while-atomic bug in del_chan
Revert "net: bcmgenet: Remove init parameter from bcmgenet_mii_config"
virtio_net: fix truesize for mergeable buffers
mv643xx_eth: fix of_irq_to_resource() error check
MAINTAINERS: Add more files to the PHY LIBRARY section
ipv4: fib: Fix NULL pointer deref during fib_sync_down_dev()
net: phy: Correctly process PHY_HALTED in phy_stop_machine()
sunhme: fix up GREG_STAT and GREG_IMASK register offsets
bpf: fix bpf_prog_get_info_by_fd to dump correct xlated_prog_len
tcp: avoid bogus gcc-7 array-bounds warning
net: tc35815: fix spelling mistake: "Intterrupt" -> "Interrupt"
bpf: don't indicate success when copy_from_user fails
udp6: fix socket leak on early demux
net: thunderx: Fix BGX transmit stall due to underflow
Revert "vhost: cache used event for better performance"
team: use a larger struct for mac address
net: check dev->addr_len for dev_set_mac_address()
phy: bcm-ns-usb3: fix MDIO_BUS dependency
...
Seth Forshee noticed a performance degradation with some workloads.
This turns out to be due to packet drops. Euan Kemp noticed that this
is because we drop all packets where length exceeds the truesize, but
for some packets we add in extra memory without updating the truesize.
This in turn was kept around unchanged from ab7db91705 ("virtio-net:
auto-tune mergeable rx buffer size for improved performance"). That
commit had an internal reason not to account for the extra space: not
enough bits to do it. No longer true so let's account for the allocated
length exactly.
Many thanks to Seth Forshee for the report and bisecting and Euan Kemp
for debugging the issue.
Fixes: 680557cf79 ("virtio_net: rework mergeable buffer handling")
Reported-by: Euan Kemp <euan.kemp@coreos.com>
Tested-by: Euan Kemp <euan.kemp@coreos.com>
Reported-by: Seth Forshee <seth.forshee@canonical.com>
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After removing the reset function, the freeze and restore functions
are now unused when CONFIG_PM_SLEEP is disabled:
drivers/net/virtio_net.c:1881:12: error: 'virtnet_restore_up' defined but not used [-Werror=unused-function]
static int virtnet_restore_up(struct virtio_device *vdev)
drivers/net/virtio_net.c:1859:13: error: 'virtnet_freeze_down' defined but not used [-Werror=unused-function]
static void virtnet_freeze_down(struct virtio_device *vdev)
A more robust way to do this is to remove the #ifdef around the callers
and instead mark them as __maybe_unused. The compiler will now just
silently drop the unused code.
Fixes: 4941d472bf ("virtio-net: do not reset during XDP set")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unregister the driver before removing multi-instance hotplug
callbacks. This order avoids the warning issued from
__cpuhp_remove_state_cpuslocked when the number of remaining
instances isn't yet zero.
Fixes: 8017c27919 ("net/virtio-net: Convert to hotplug state machine")
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Current XDP implementation wants guest offloads feature to be disabled
on device. This is inconvenient and means guest can't benefit from
offloads if XDP is not used. This patch tries to address this
limitation by disabling the offloads on demand through control guest
offloads. Guest offloads will be disabled and enabled on demand on XDP
set.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently reset the device during XDP set, the main reason is
that we allocate more headroom with XDP (for header adjustment).
This works but causes network downtime for users.
Previous patches encoded the headroom in the buffer context,
this makes it possible to detect the case where a buffer
with headroom insufficient for XDP is added to the queue and
XDP is enabled afterwards.
Upon detection, we handle this case by copying the packet
(slow, but it's a temporary condition).
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use ctx API to store headroom for small buffers.
Following patches will retrieve this info and use it for XDP.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pack headroom into ctx - this way when we get a buffer we can figure out
the actual headroom that was allocated for the buffer. Will be helpful
to optimize switching between XDP and non-XDP modes which have different
headroom requirements.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: commit d45b897b11 ("virtio_net: allow specifying context for rx")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't hold any tx lock when trying to disable TX during reset, this
would lead a use after free since ndo_start_xmit() tries to access
the virtqueue which has already been freed. Fix this by using
netif_tx_disable() before freeing the vqs, this could make sure no tx
after vq freeing.
Reported-by: Jean-Philippe Menil <jpmenil@gmail.com>
Tested-by: Jean-Philippe Menil <jpmenil@gmail.com>
Fixes commit f600b69050 ("virtio_net: Add XDP support")
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Robert McCabe <robert.mccabe@rockwellcollins.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to virtio_net to report bpf_prog ID during XDP_QUERY_PROG.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
A common pattern with skb_put() is to just want to memcpy()
some data into the new space, introduce skb_put_data() for
this.
An spatch similar to the one for skb_put_zero() converts many
of the places using it:
@@
identifier p, p2;
expression len, skb, data;
type t, t2;
@@
(
-p = skb_put(skb, len);
+p = skb_put_data(skb, data, len);
|
-p = (t)skb_put(skb, len);
+p = skb_put_data(skb, data, len);
)
(
p2 = (t2)p;
-memcpy(p2, data, len);
|
-memcpy(p, data, len);
)
@@
type t, t2;
identifier p, p2;
expression skb, data;
@@
t *p;
...
(
-p = skb_put(skb, sizeof(t));
+p = skb_put_data(skb, data, sizeof(t));
|
-p = (t *)skb_put(skb, sizeof(t));
+p = skb_put_data(skb, data, sizeof(t));
)
(
p2 = (t2)p;
-memcpy(p2, data, sizeof(*p));
|
-memcpy(p, data, sizeof(*p));
)
@@
expression skb, len, data;
@@
-memcpy(skb_put(skb, len), data, len);
+skb_put_data(skb, data, len);
(again, manually post-processed to retain some comments)
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d85b758f72 ("virtio_net: fix support for small rings")
was supposed to increase the buffer size for small rings but had an
unintentional side effect of decreasing it for large rings. This seems
to break some setups - it's not yet clear why, but increasing buffer
size back to what it was before helps.
Fixes: d85b758f72 ("virtio_net: fix support for small rings")
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: "J. Bruce Fields" <bfields@fieldses.org>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since virtio does not provide it's own ndo_features_check handler,
TSO, and now checksum offload, are disabled for stacked vlans.
Re-enable the support and let the host take care of it. This
restores/improves Guest-to-Guest performance over Q-in-Q vlans.
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A bunch of changes to virtio, most affecting virtio net.
ptr_ring batched zeroing - first of batching enhancements
that seems ready.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZEceWAAoJECgfDbjSjVRpg8YIAIbB1UJZkrHh/fdCQjM2O53T
ESS62W91LBT+weYH/N79RrfnGWzDOHrCQ8Or1nAKQZx2vU1aroqYXeJ9o0liRBYr
hGZB1/bg8obA5XkKUfME2+XClakvuXABUJbky08iDL9nILlrvIVLoUgZ9ouL0vTj
oUv4n6jtguNFV7tk/injGNRparEVdcefX0dbPxXomx5tSeD2fOE96/Q4q2eD3f7r
NHD4DailEJC7ndJGa6b4M9g8shkXzgEnSw+OJHpcJcxCnAzkVT94vsU7OluiDvmG
bfdGZNb0ohDYZLbJDR8aiMkoad8bIVLyGlhqnMBiZQEOF4oiWM9UJLvp5Lq9g7A=
=Sb7L
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
"Fixes, cleanups, performance
A bunch of changes to virtio, most affecting virtio net. Also ptr_ring
batched zeroing - first of batching enhancements that seems ready."
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
s390/virtio: change maintainership
tools/virtio: fix spelling mistake: "wakeus" -> "wakeups"
virtio_net: tidy a couple debug statements
ptr_ring: support testing different batching sizes
ringtest: support test specific parameters
ptr_ring: batch ring zeroing
virtio: virtio_driver doc
virtio_net: don't reset twice on XDP on/off
virtio_net: fix support for small rings
virtio_net: reduce alignment for buffers
virtio_net: rework mergeable buffer handling
virtio_net: allow specifying context for rx
virtio: allow extra context per descriptor
tools/virtio: fix build breakage
virtio: add context flag to find vqs
virtio: wrap find_vqs
ringtest: fix an assert statement
We are printing a decimal value for truesize so we shouldn't use an "0x"
prefix.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We already do a reset once in remove_vq_common -
there appears to be no point in doing another one
when we add/remove XDP.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When ring size is small (<32 entries) making buffers smaller means a
full ring might not be able to hold enough buffers to fit a single large
packet.
Make sure a ring full of buffers is large enough to allow at least one
packet of max size.
Fixes: 2613af0ed1 ("virtio_net: migrate mergeable rx buffers to page frag allocators")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We don't need to align length to any particular
value anymore. Aligning to L1 cache size probably
sill makes sense to reduce false sharing.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Small follow-up to d74a32acd5 ("xdp: use netlink extended ACK reporting")
in order to let drivers all use the same NL_SET_ERR_MSG_MOD() helper macro
for reporting. This also ensures that we consistently add the driver's
prefix for dumping the report in user space to indicate that the error
message is driver specific and not coming from core code. Furthermore,
NL_SET_ERR_MSG_MOD() now reuses NL_SET_ERR_MSG() and thus makes all macros
check the pointer as suggested.
References: https://www.spinics.net/lists/netdev/msg433267.html
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are going to add more parameters to find_vqs, let's wrap the call so
we don't need to tweak all drivers every time.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Try to carry error messages to the user via the netlink extended
ack message attribute.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid hashing the tx napi struct into napi_hash[], which is used for
busy polling receive queues.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As of tx napi, device down (`ip link set dev $dev down`) hangs unless
tx napi is enabled. Else napi_enable is not called, so napi_disable
will spin on test_and_set_bit NAPI_STATE_SCHED.
Only call napi_disable if tx napi is enabled.
Fixes: 5a719c2552ca ("virtio-net: transmit napi")
Reported-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tx napi mode increases the rate of transmit interrupts. Suppress some
by masking interrupts while more packets are expected. The interrupts
will be reenabled before the last packet is sent.
This optimization reduces the througput drop with tx napi for
unidirectional flows such as UDP_STREAM that do not benefit from
cleaning tx completions in the the receive napi handler.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Amortize the cost of virtual interrupts by doing both rx and tx work
on reception of a receive interrupt if tx napi is enabled. With
VIRTIO_F_EVENT_IDX, this suppresses most explicit tx completion
interrupts for bidirectional workloads.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An upcoming patch will call free_old_xmit_skbs indirectly from
virtnet_poll. Move the function above this to avoid having to
introduce a forward declaration.
This is a pure move: no code changes.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert virtio-net to a standard napi tx completion path. This enables
better TCP pacing using TCP small queues and increases single stream
throughput.
The virtio-net driver currently cleans tx descriptors on transmission
of new packets in ndo_start_xmit. Latency depends on new traffic, so
is unbounded. To avoid deadlock when a socket reaches its snd limit,
packets are orphaned on tranmission. This breaks socket backpressure,
including TSQ.
Napi increases the number of interrupts generated compared to the
current model, which keeps interrupts disabled as long as the ring
has enough free descriptors. Keep tx napi optional and disabled for
now. Follow-on patches will reduce the interrupt cost.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prepare virtio-net for tx napi by converting existing napi code to
use helper functions. This also deduplicates some logic.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts were simply overlapping changes. In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.
In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().
Signed-off-by: David S. Miller <davem@davemloft.net>
virtio attempts to clear the MTU feature bit if the value is out of the
supported range, but this has no real effect since FEATURES_OK has
already been set.
Fix this up by checking the MTU in the new validate callback.
Fixes: 14de9d114a ("virtio-net: Add initial MTU advice feature")
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
If one enables e.g. jumbo frames without mergeable
buffers, packets won't fit in 1500 byte buffers
we use. Switch to big packet mode instead.
TODO: make sizing more exact, possibly extend small
packet mode to use larger pages.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix double-free in batman-adv, from Sven Eckelmann.
2) Fix packet stats for fast-RX path, from Joannes Berg.
3) Netfilter's ip_route_me_harder() doesn't handle request sockets
properly, fix from Florian Westphal.
4) Fix sendmsg deadlock in rxrpc, from David Howells.
5) Add missing RCU locking to transport hashtable scan, from Xin Long.
6) Fix potential packet loss in mlxsw driver, from Ido Schimmel.
7) Fix race in NAPI handling between poll handlers and busy polling,
from Eric Dumazet.
8) TX path in vxlan and geneve need proper RCU locking, from Jakub
Kicinski.
9) SYN processing in DCCP and TCP need to disable BH, from Eric
Dumazet.
10) Properly handle net_enable_timestamp() being invoked from IRQ
context, also from Eric Dumazet.
11) Fix crash on device-tree systems in xgene driver, from Alban Bedel.
12) Do not call sk_free() on a locked socket, from Arnaldo Carvalho de
Melo.
13) Fix use-after-free in netvsc driver, from Dexuan Cui.
14) Fix max MTU setting in bonding driver, from WANG Cong.
15) xen-netback hash table can be allocated from softirq context, so use
GFP_ATOMIC. From Anoob Soman.
16) Fix MAC address change bug in bgmac driver, from Hari Vyas.
17) strparser needs to destroy strp_wq on module exit, from WANG Cong.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
strparser: destroy workqueue on module exit
sfc: fix IPID endianness in TSOv2
sfc: avoid max() in array size
rds: remove unnecessary returned value check
rxrpc: Fix potential NULL-pointer exception
nfp: correct DMA direction in XDP DMA sync
nfp: don't tell FW about the reserved buffer space
net: ethernet: bgmac: mac address change bug
net: ethernet: bgmac: init sequence bug
xen-netback: don't vfree() queues under spinlock
xen-netback: keep a local pointer for vif in backend_disconnect()
netfilter: nf_tables: don't call nfnetlink_set_err() if nfnetlink_send() fails
netfilter: nft_set_rbtree: incorrect assumption on lower interval lookups
netfilter: nf_conntrack_sip: fix wrong memory initialisation
can: flexcan: fix typo in comment
can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer
can: gs_usb: fix coding style
can: gs_usb: Don't use stack memory for USB transfers
ixgbe: Limit use of 2K buffers on architectures with 256B or larger cache lines
ixgbe: update the rss key on h/w, when ethtool ask for it
...
Looks like a quiet cycle for vhost/virtio, just a couple of minor
tweaks. Most notable is automatic interrupt affinity for blk and scsi.
Hopefully other devices are not far behind.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJYt1rRAAoJECgfDbjSjVRpEZsIALSHevdXWtRHBZUb0ZkqPLQb
/x2Vn49CcALS1p7iSuP9L027MPeaLKyr0NBT9hptBChp/4b9lnZWyyAo6vYQrzfx
Ia/hLBYsK4ml6lEwbyfLwqkF2cmYCrZhBSVAILifn84lTPoN7CT0PlYDfA+OCaNR
geo75qF8KR+AUO0aqchwMRL3RV3OxZKxQr2AR6LttCuhiBgnV3Xqxffg/M3x6ONM
0ffFFdodm6slem3hIEiGUMwKj4NKQhcOleV+y0fVBzWfLQG9210pZbQyRBRikIL0
7IsaarpaUr7OrLAZFMGF6nJnyRAaRrt6WknTHZkyvyggrePrGcmGgPm4jrODwY4=
=2zwv
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull vhost updates from Michael Tsirkin:
"virtio, vhost: optimizations, fixes
Looks like a quiet cycle for vhost/virtio, just a couple of minor
tweaks. Most notable is automatic interrupt affinity for blk and scsi.
Hopefully other devices are not far behind"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio-console: avoid DMA from stack
vhost: introduce O(1) vq metadata cache
virtio_scsi: use virtio IRQ affinity
virtio_blk: use virtio IRQ affinity
blk-mq: provide a default queue mapping for virtio device
virtio: provide a method to get the IRQ affinity mask for a virtqueue
virtio: allow drivers to request IRQ affinity when creating VQs
virtio_pci: simplify MSI-X setup
virtio_pci: don't duplicate the msix_enable flag in struct pci_dev
virtio_pci: use shared interrupts for virtqueues
virtio_pci: remove struct virtio_pci_vq_info
vhost: try avoiding avail index access when getting descriptor
virtio_mmio: expose header to userspace
Declaring the factor is counter-intuitive, and people are prone
to using small(-ish) values even when that makes no sense.
Change the DECLARE_EWMA() macro to take the fractional precision,
in bits, rather than a factor, and update all users.
While at it, add some more documentation.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add a struct irq_affinity pointer to the find_vqs methods, which if set
is used to tell the PCI layer to create the MSI-X vectors for our I/O
virtqueues with the proper affinity from the start. Compared to after
the fact affinity hints this gives us an instantly working setup and
allows to allocate the irq descritors node-local and avoid interconnect
traffic. Last but not least this will allow blk-mq queues are created
based on the interrupt affinity for storage drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This patch switch to use build_skb() for small buffer which can have
better performance for both TCP and XDP (since we can work at page
before skb creation). It also remove lots of XDP codes since both
mergeable and small buffer use page frag during refill now.
Before | After
XDP_DROP(xdp1) 64B : 11.1Mpps | 14.4Mpps
Tested with xdp1/xdp2/xdp_ip_tx_tunnel and netperf.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We already have counters for sent/recv packets and sent/recv bytes.
Doing a batched update to reduce the number of
u64_stats_update_begin/end().
Take care not to bother with stats update when called
speculatively.
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for XDP adjust head by allocating a 256B header region
that XDP programs can grow into. This is only enabled when a XDP
program is loaded.
In order to ensure that we do not have to unwind queue headroom push
queue setup below bpf_prog_add. It reads better to do a prog ref
unwind vs another queue setup call.
At the moment this code must do a full reset to ensure old buffers
without headroom on program add or with headroom on program removal
are not used incorrectly in the datapath. Ideally we would only
have to disable/enable the RX queues being updated but there is no
API to do this at the moment in virtio so use the big hammer. In
practice it is likely not that big of a problem as this will only
happen when XDP is enabled/disabled changing programs does not
require the reset. There is some risk that the driver may either
have an allocation failure or for some reason fail to correctly
negotiate with the underlying backend in this case the driver will
be left uninitialized. I have not seen this ever happen on my test
systems and for what its worth this same failure case can occur
from probe and other contexts in virtio framework.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For XDP we will need to reset the queues to allow for buffer headroom
to be configured. In order to do this we need to essentially run the
freeze()/restore() code path. Unfortunately the locking requirements
between the freeze/restore and reset paths are different however so
we can not simply reuse the code.
This patch refactors the code path and adds a reset helper routine.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factor out qp assignment.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At this point the do_xdp_prog is mostly if/else branches handling
the different modes of virtio_net. So remove it and handle running
the program in the per mode handlers.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For XDP use case and to allow ethtool reset tests it is useful to be
able to use reset paths from contexts where rtnl lock is already
held.
This requries updating virtnet_set_queues and free_receive_bufs the
two places where rtnl_lock is taken in virtio_net. To do this we
use the following pattern,
_foo(...) { do stuff }
foo(...) { rtnl_lock(); _foo(...); rtnl_unlock()};
this allows us to use freeze()/restore() flow from both contexts.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 364b605573 ("net: busy-poll: return busypolling status to
drivers"), napi_complete_done() returns a boolean that can be used
by drivers to conditionally rearm interrupts.
This patch changes virtio_net to use this boolean to avoid a bit of
overhead for busy-poll users.
Jason reports about 1.1% improvement for 1 byte TCP_RR (burst 100).
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Generic NAPI busy polling allows us to remove custom implementations
found in drivers.
It is possible further optimization could be done by testing
napi_complete_done() return value.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 17bedab272 ("bpf: xdp: Allow head adjustment in XDP prog")
added a new XDP helper to prepend and remove data from a frame.
Make virtio_net reject programs making use of this helper until
proper support is added.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the small buffer case during driver unload we currently use
put_page instead of dev_kfree_skb. Resolve this by adding a check
for virtnet mode when checking XDP queue type. Also name the
function so that the code reads correctly to match the additional
check.
Fixes: bb91accf27 ("virtio-net: XDP support for small buffers")
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This work adds a number of tracepoints to paths that are either
considered slow-path or exception-like states, where monitoring or
inspecting them would be desirable.
For bpf(2) syscall, tracepoints have been placed for main commands
when they succeed. In XDP case, tracepoint is for exceptions, that
is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
return code, or when error occurs during XDP_TX action and the packet
could not be forwarded.
Both have been split into separate event headers, and can be further
extended. Worst case, if they unexpectedly should get into our way in
future, they can also removed [1]. Of course, these tracepoints (like
any other) can be analyzed by eBPF itself, etc. Example output:
# ./perf record -a -e bpf:* sleep 10
# ./perf script
sock_example 6197 [005] 283.980322: bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
sock_example 6197 [005] 283.980721: bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
sock_example 6197 [005] 283.988423: bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
sock_example 6197 [005] 283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
[...]
sock_example 6197 [005] 288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
swapper 0 [005] 289.338243: bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
[1] https://lwn.net/Articles/705270/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I don't have any guests with PAGE_SIZE > 64k but the
code seems to be clearly broken in that case
as PAGE_SIZE / MERGEABLE_BUFFER_ALIGN will need
more than 8 bit and so the code in mergeable_ctx_to_buf_address
does not give us the actual true size.
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 501db51139 ("virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on
xmit") in fact disables VIRTIO_HDR_F_DATA_VALID on receiving path too,
fixing this by adding a hint (has_data_valid) and set it only on the
receiving path.
Cc: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.
Fix all drivers with ndo_get_stats64 to have a void function.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
when some other buffer is immediately copied into allocated region.
Replace calls to kmalloc followed by a memcpy with a direct
call to kmemdup.
Signed-off-by: Shyam Saini <mayhs11saini@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the state names got added a script was used to add the extra argument
to the calls. The script basically converted the state constant to a
string, but the cleanup to convert these strings into meaningful ones did
not happen.
Replace all the useless strings with 'subsys/xxx/yyy:state' strings which
are used in all the other places already.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20161221192112.085444152@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Commit f600b69050 ("virtio_net: Add XDP support") leaves the case of
small receive buffer untouched. This will confuse the user who want to
set XDP but use small buffers. Other than forbid XDP in small buffer
mode, let's make it work. XDP then can only work at skb->data since
virtio-net create skbs during refill, this is sub optimal which could
be optimized in the future.
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now we in fact don't allow XDP for big packets, remove its codes.
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When VIRTIO_NET_F_GUEST_UFO is negotiated, host could still send UFO
packet that exceeds a single page which could not be handled
correctly by XDP. So this patch forbids setting XDP when GUEST_UFO is
supported. While at it, forbid XDP for ECN (which comes only from GRO)
too to prevent user from misconfiguration.
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't update ewma rx buf size in the case of XDP. This will lead
underestimation of rx buf size which causes host to produce more than
one buffers. This will greatly increase the possibility of XDP page
linearization.
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We drop csumed packet when do XDP for packets. This breaks
XDP_PASS when GUEST_CSUM is supported. Fix this by allowing csum flag
to be set. With this patch, simple TCP works for XDP_PASS.
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When XDP_PASS were determined for linearized packets, we try to get
new buffers in the virtqueue and build skbs from them. This is wrong,
we should create skbs based on existed buffers instead. Fixing them by
creating skb based on xdp_page.
With this patch "ping 192.168.100.4 -s 3900 -M do" works for XDP_PASS.
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't put page during linearizing, the would cause leaking when
xmit through XDP_TX or the packet exceeds PAGE_SIZE. Fix them by
put page accordingly. Also decrease the number of buffers during
linearizing to make sure caller can free buffers correctly when packet
exceeds PAGE_SIZE. With this patch, we won't get OOM after linearize
huge number of packets.
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After we linearize page, we should xmit this page instead of the page
of first buffer which may lead unexpected result. With this patch, we
can see correct packet during XDP_TX.
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we use EWMA to estimate the size of rx buffer. When rx buffer
size is underestimated, it's usual to have a packet with more than one
buffers. Consider this is not a bug, remove the warning and correct
the comment before XDP linearizing.
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
virtio_net XDP support expects receive buffers to be contiguous.
If this is not the case we enable a slowpath to allow connectivity
to continue but at a significan performance overhead associated with
linearizing data. To make it painfully aware to users that XDP is
running in a degraded mode we throw an xdp buffer error.
To linearize packets we allocate a page and copy the segments of
the data, including the header, into it. After this the page can be
handled by XDP code flow as normal.
Then depending on the return code the page is either freed or sent
to the XDP xmit path. There is no attempt to optimize this path.
This case is being handled simple as a precaution in case some
unknown backend were to generate packets in this form. To test this
I had to hack qemu and force it to generate these packets. I do not
expect this case to be generated by "real" backends.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds support for the XDP_TX action to virtio_net. When an XDP
program is run and returns the XDP_TX action the virtio_net XDP
implementation will transmit the packet on a TX queue that aligns
with the current CPU that the XDP packet was processed on.
Before sending the packet the header is zeroed. Also XDP is expected
to handle checksum correctly so no checksum offload support is
provided.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
XDP requires using isolated transmit queues to avoid interference
with normal networking stack (BQL, NETDEV_TX_BUSY, etc). This patch
adds a XDP queue per cpu when a XDP program is loaded and does not
expose the queues to the OS via the normal API call to
netif_set_real_num_tx_queues(). This way the stack will never push
an skb to these queues.
However virtio/vhost/qemu implementation only allows for creating
TX/RX queue pairs at this time so creating only TX queues was not
possible. And because the associated RX queues are being created I
went ahead and exposed these to the stack and let the backend use
them. This creates more RX queues visible to the network stack than
TX queues which is worth mentioning but does not cause any issues as
far as I can tell.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds XDP support to virtio_net. Some requirements must be
met for XDP to be enabled depending on the mode. First it will
only be supported with LRO disabled so that data is not pushed
across multiple buffers. Second the MTU must be less than a page
size to avoid having to handle XDP across multiple pages.
If mergeable receive is enabled this patch only supports the case
where header and data are in the same buf which we can check when
a packet is received by looking at num_buf. If the num_buf is
greater than 1 and a XDP program is loaded the packet is dropped
and a warning is thrown. When any_header_sg is set this does not
happen and both header and data is put in a single buffer as expected
so we check this when XDP programs are loaded. Subsequent patches
will process the packet in a degraded mode to ensure connectivity
and correctness is not lost even if backend pushes packets into
multiple buffers.
If big packets mode is enabled and MTU/LRO conditions above are
met then XDP is allowed.
This patch was tested with qemu with vhost=on and vhost=off where
mergeable and big_packet modes were forced via hard coding feature
negotiation. Multiple buffers per packet was forced via a small
test patch to vhost.c in the vhost=on qemu mode.
Suggested-by: Shrijeet Mukherjee <shrijeet@gmail.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 4490001029 ("virtio-net: enable
multiqueue by default") blindly set the affinity instead of queues
during probe which can cause a mismatch of #queues between guest and
host. This patch fixes it by setting queues.
Reported-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Theodore Ts'o <tytso@mit.edu>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Fixes: 49000102901 ("virtio-net: enable multiqueue by default")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With CONFIG_VMAP_STACK=y, virtnet_set_mac_address() can be passed a
pointer to the stack and it will OOPS. Copy the address to the heap
to prevent the crash.
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Laura Abbott <labbott@redhat.com>
Reported-by: zbyszek@in.waw.pl
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We use single queue even if multiqueue is enabled and let admin to
enable it through ethtool later. This is used to avoid possible
regression (small packet TCP stream transmission). But looks like an
overkill since:
- single queue user can disable multiqueue when launching qemu
- brings extra troubles for the management since it needs extra admin
tool in guest to enable multiqueue
- multiqueue performs much better than single queue in most of the
cases
So this patch enables multiqueue by default: if #queues is less than or
equal to #vcpu, enable as much as queue pairs; if #queues is greater
than #vcpu, enable #vcpu queue pairs.
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Neil Horman <nhorman@redhat.com>
Cc: Jeremy Eder <jeder@redhat.com>
Cc: Marko Myllynen <myllynen@redhat.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.
That driver has a change_mtu method explicitly for sending
a message to the hardware. If that fails it returns an
error.
Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.
However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems many drivers do not respect napi_hash_del() contract.
When napi_hash_del() is used before netif_napi_del(), an RCU grace
period is needed before freeing NAPI object.
Fixes: 91815639d8 ("virtio-net: rx busy polling support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Virtio 1.0 spec says VIRTIO_F_ANY_LAYOUT and VIRTIO_NET_F_GSO are
legacy-only feature bits. Do not negotiate them in virtio 1 mode. Note
this is a spec violation so we need to backport it to stable/downstream
kernels.
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The virtio committee recently ratified a change, VIRTIO-152, which
defines the mtu field to be 'max' MTU, not simply desired MTU.
This commit brings the virtio-net device in compliance with VIRTIO-152.
Additionally, drop the max_mtu branch - it cannot be taken since the u16
returned by virtio_cread16 will never exceed the initial value of
max_mtu.
Signed-off-by: Aaron Conole <aconole@redhat.com>
Acked-by: "Michael S. Tsirkin" <mst@redhat.com>
Acked-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hyperv_net:
- set min/max_mtu, per Haiyang, after rndis_filter_device_add
virtio_net:
- set min/max_mtu
- remove virtnet_change_mtu
vmxnet3:
- set min/max_mtu
xen-netback:
- min_mtu = 0, max_mtu = 65517
xen-netfront:
- min_mtu = 0, max_mtu = 65535
unisys/visor:
- clean up defines a little to not clash with network core or add
redundat definitions
CC: netdev@vger.kernel.org
CC: virtualization@lists.linux-foundation.org
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: Shrikrishna Khare <skhare@vmware.com>
CC: "VMware, Inc." <pv-drivers@vmware.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Paul Durrant <paul.durrant@citrix.com>
CC: David Kershner <david.kershner@unisys.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Install the callbacks via the state machine.
The driver supports multiple instances and therefore the new
cpuhp_state_add_instance_nocalls() infrastrucure is used. The driver
currently uses get_online_cpus() to avoid missing a CPU hotplug event while
invoking virtnet_set_affinity(). This could be avoided by using
cpuhp_state_add_instance() variant which holds the hotplug lock and invokes
callback during registration. This is more or less a 1:1 conversion of the
current code.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: netdev@vger.kernel.org
Cc: Will Deacon <will.deacon@arm.com>
Cc: virtualization@lists.linux-foundation.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/1471024183-12666-7-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
VLAN and MQ control was doing DMA from the stack. Fix it.
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit e858fae2b0 ("virtio_net: use common code for virtio_net_hdr
and skb GSO conversion") replaced the tun code for header manipulation
with the generic helpers. While doing so, it implictly moved the
skb_partial_csum_set() invocation after eth_type_trans(), which
invalidate the current gso start/offset values.
Fix it by moving the helper invocation before the mac pulling.
Fixes: e858fae2b0 ("virtio_net: use common code for virtio_net_hdr and
skb GSO conversion")
Reported-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace open coded conversion between virtio_net_hdr to skb GSO info with
virtio_net_hdr_{from,to}_skb
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit adds the feature bit and associated mtu device entry for the
virtio network device. When a virtio device comes up, it checks the
feature bit for the VIRTIO_NET_F_MTU feature. If such feature bit is
enabled, the driver will read the advised MTU and use it as the initial
value.
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In function virtnet_open() and virtnet_probe(), func try_fill_recv() may
be executed at the same time. VQ in virtqueue_add() has not been protected
well and BUG_ON will be triggered when virito_net.ko being removed.
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds basic polling support for vhost.
Reworks virtio to optionally use DMA API, fixing it on Xen.
Balloon stats gained a new entry.
Using the new napi_alloc_skb speeds up virtio net.
virtio blk stats can now be read while another VCPU
us busy inflating or deflating the balloon.
Plus misc cleanups in various places.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJW7qRJAAoJECgfDbjSjVRpVNoH/A7z+lZ6nooSJ9fUBtAAlwit
mE1VKi8g0G6naV1NVLFVe7hPAejExGiHfR3ZUrVoenJKj2yeW/DFojFC10YR/KTe
ac7Imuc+owA3UOE/QpeGBs59+EEWKTZUYt6r8HSJVwoodeosw9v2ecP/Iwhbax8H
a4V3HqOADjKnHg73R9o3u+bAgA1GrGYHeK0AfhCBSTNwlPdxkvf0463HgfOpM4nl
/sNoFWO3vOyekk+loIk+jpmWVIoIfG2NFzW4lPwEPkfqUBX7r0ei/NR23hIqHL7r
QZ6vMj1Ew9qctUONbJu4kXjuV2Vk9NhxwbDjoJtm8plKL2hz2prJynUEogkHh2g=
=VMD0
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio/vhost updates from Michael Tsirkin:
"New features, performance improvements, cleanups:
- basic polling support for vhost
- rework virtio to optionally use DMA API, fixing it on Xen
- balloon stats gained a new entry
- using the new napi_alloc_skb speeds up virtio net
- virtio blk stats can now be read while another VCPU is busy
inflating or deflating the balloon
plus misc cleanups in various places"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio_net: replace netdev_alloc_skb_ip_align() with napi_alloc_skb()
vhost_net: basic polling support
vhost: introduce vhost_vq_avail_empty()
vhost: introduce vhost_has_work()
virtio_balloon: Allow to resize and update the balloon stats in parallel
virtio_balloon: Use a workqueue instead of "vballoon" kthread
virtio/s390: size of SET_IND payload
virtio/s390: use dev_to_virtio
vhost: rename vhost_init_used()
vhost: rename cross-endian helpers
virtio_blk: VIRTIO_BLK_F_WCE->VIRTIO_BLK_F_FLUSH
vring: Use the DMA API on Xen
virtio_pci: Use the DMA API if enabled
virtio_mmio: Use the DMA API if enabled
virtio: Add improved queue allocation API
virtio_ring: Support DMA APIs
vring: Introduce vring_use_dma_api()
s390/dma: Allow per device dma ops
alpha/dma: use common noop dma ops
dma: Provide simple noop dma ops
This gives small but noticeable rx performance improvement (2-3%)
and will allow exploiting future napi improvement.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We should validate the port setting that we got from the user and check
if it's what we've set it to (PORT_OTHER), also add explanation that
ignoring advertising is good as long as we don't have autonegotiation.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows the user to set and retrieve speed and duplex of the
virtio_net device via ethtool. Having this functionality is very helpful
for simulating different environments and also enables the virtio_net
device to participate in operations where proper speed and duplex are
required (e.g. currently bonding lacp mode requires full duplex). Custom
speed and duplex are not allowed, the user-supplied settings are validated
before applying.
Example:
$ ethtool eth1
Settings for eth1:
...
Speed: Unknown!
Duplex: Unknown! (255)
$ ethtool -s eth1 speed 1000 duplex full
$ ethtool eth1
Settings for eth1:
...
Speed: 1000Mb/s
Duplex: Full
Based on a patch by Roopa Prabhu.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/geneve.c
Here we had an overlapping change, where in 'net' the extraneous stats
bump was being removed whilst in 'net-next' the final argument to
udp_tunnel6_xmit_skb() was being changed.
Signed-off-by: David S. Miller <davem@davemloft.net>
Once virtio starts using the DMA API, we won't be able to safely DMA
from the stack. virtio-net does a couple of config DMA requests
from small stack buffers -- switch to using dynamically-allocated
memory.
This should have no effect on any performance-critical code paths.
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Andy Lutomirski <luto@kernel.org>
NAPI drivers no longer need to observe a particular protocol
to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y)
napi_hash_add() and napi_hash_del() are automatically called
from core networking stack, respectively from
netif_napi_add() and netif_napi_del()
This patch depends on free_netdev() and netif_napi_del() being
called from process context, which seems to be the norm.
Drivers might still prefer to call napi_hash_del() on their
own, since they might combine all the rcu grace periods into
a single one, knowing their NAPI structures lifetime, while
core networking stack has no idea of a possible combining.
Once this patch proves to not bring serious regressions,
we will cleanup drivers to either remove napi_hash_del()
or provide appropriate rcu grace periods combining.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We would like to automatically provide busy polling support
to all NAPI drivers, without them having to implement anything.
skb_mark_napi_id() can be called from napi_gro_receive() and
napi_get_frags().
Few drivers are still calling skb_mark_napi_id() because
they use netif_receive_skb(). They should eventually call
napi_gro_receive() instead. I will leave this to drivers
maintainers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Usually an skb does not have up to MAX_SKB_FRAGS frags. So no need to
initialize the unuse part of sg. This patch initialize the sg based on
the real number it will used:
- during xmit, it could be inferred from nr_frags and can_push.
- for small receive buffer, it will also be 2.
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using the out-of-line EWMA calculation, use DECLARE_EWMA()
to create static inlines. On x86/64 this results in no change in code
size for me, but reduces the struct receive_queue size by the two
unsigned long values that store the parameters.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/cavium/Kconfig
The cavium conflict was overlapping dependency
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
virtio declares support for NETIF_F_FRAGLIST, but assumes
that there are at most MAX_SKB_FRAGS + 2 fragments which isn't
always true with a fraglist.
A longer fraglist in the skb will make the call to skb_to_sgvec overflow
the sg array, leading to memory corruption.
Drop NETIF_F_FRAGLIST so we only get what we can handle.
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Straightforward patch to add GRO processing to virtio_net.
napi_complete_done() usage allows more aggressive aggregation,
opted-in by setting /sys/class/net/xxx/gro_flush_timeout
Tested:
Setting /sys/class/net/xxx/gro_flush_timeout to 1000 nsec,
Rick Jones reported following results.
One VM of each on a pair of OpenStack compute nodes with E5-2650Lv3 CPUs
and Intel 82599ES-based NICs. So, two "before" and two "after" VMs.
The OpenStack compute nodes were running OpenStack Kilo, with VxLAN
encapsulation being used through OVS so no GRO coming-up the host
stack. The compute nodes themselves were running a 3.14-based kernel.
Single-stream netperf, CPU utilizations and thus service demands are
based on intra-guest reported CPU.
Throughput Mbit/s, bigger is better
Min Median Average Max
4.2.0-rc3+ 1364 1686 1678 1938
4.2.0-rc3+flush1k 1824 2269 2275 2647
Send Service Demand, smaller is better
Min Median Average Max
4.2.0-rc3+ 0.236 0.558 0.524 0.802
4.2.0-rc3+flush1k 0.176 0.503 0.471 0.738
Receive Service Demand, smaller is better.
Min Median Average Max
4.2.0-rc3+ 1.906 2.188 2.191 2.531
4.2.0-rc3+flush1k 0.448 0.529 0.533 0.692
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Rick Jones <rick.jones2@hp.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ANY_LAYOUT is a compatibility feature. It's implied
for VERSION_1 devices, and non-transitional devices
might not offer it. Change code to behave accordingly.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d631b94e7a
virtio: change comment in transmit
started clarifying the logic behind queue state management,
but introduced an inaccuracy: TX_BUSY does not cause
a BUG message.
Clean this up some more, explaining the tradeoffs in detail.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
received is 0, no need to minus it and use "+=" to reassign it
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The original comment was not really informative or funny
as well as sexist. Replace it with a better explanation of
why the driver does stop and what the impacts are.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't delete napi from hash list during module exit. This will
cause the following panic when doing module load and unload:
BUG: unable to handle kernel paging request at 0000004e00000075
IP: [<ffffffff816bd01b>] napi_hash_add+0x6b/0xf0
PGD 3c5d5067 PUD 0
Oops: 0000 [#1] SMP
...
Call Trace:
[<ffffffffa0a5bfb7>] init_vqs+0x107/0x490 [virtio_net]
[<ffffffffa0a5c9f2>] virtnet_probe+0x562/0x791815639d880be [virtio_net]
[<ffffffff8139e667>] virtio_dev_probe+0x137/0x200
[<ffffffff814c7f2a>] driver_probe_device+0x7a/0x250
[<ffffffff814c81d3>] __driver_attach+0x93/0xa0
[<ffffffff814c8140>] ? __device_attach+0x40/0x40
[<ffffffff814c6053>] bus_for_each_dev+0x63/0xa0
[<ffffffff814c7a79>] driver_attach+0x19/0x20
[<ffffffff814c76f0>] bus_add_driver+0x170/0x220
[<ffffffffa0a60000>] ? 0xffffffffa0a60000
[<ffffffff814c894f>] driver_register+0x5f/0xf0
[<ffffffff8139e41b>] register_virtio_driver+0x1b/0x30
[<ffffffffa0a60010>] virtio_net_driver_init+0x10/0x12 [virtio_net]
This patch fixes this by doing this in virtnet_free_queues(). And also
don't delete napi in virtnet_freeze() since it will call
virtnet_free_queues() which has already did this.
Fixes 91815639d8 ("virtio-net: rx busy polling support")
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On top of tht is the major rework of lguest, to use PCI and virtio 1.0, to
double-check the implementation.
Then comes the inevitable fixes and cleanups from that work.
Thanks,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU5B9cAAoJENkgDmzRrbjxPacP/jajliXX353JJ/g/hkZ6oDN5
o7FhELBKiUMr7enVZYwj2BBYk5OM36nB9pQkiqHMSbjJGoS5IK70enxb4YRxSHBn
YCLblZMNqutGS0kclZ9DDysztjAhxH7CvLM6pMZ7eHP0f3+FM/QhbxHfbG9DTBUH
2U/nybvd3M/+YBe7ptwQdrH8aOCAD6RTIsXellfm99dNMK6K/5lqnWQ98WSXmNXq
vyvdaAQsqqUkmxtajjcBumaCH4/SehOJJjUqojCMsR3aBkgOBWDZJURMek+KA5Dt
X996fBsTAlvTtCUKRrmLTb2ScDH7fu+jwbWRqMYDk8zpEr3XqiLTTPV4/TiHGmi7
Wiw3g1wIY1YbETlZyongB5MIoVyUfmDAd+bT8nBsj3KIITD84gOUQFDMl6d63c0I
z6A9Pu/UzpJGsXZT3WoFLi6TO67QyhOseqZnhS4wBgLabjxffNM7yov9RVKUVH/n
JHunnpUk2iTtSgscBarOBz5867dstuurnaUIspZthVBo6y6N0z+GrU+agJ8Y4DXx
mvwzeYLhQH2208PjxPFiah/kA/gHNm1m678TbpS+CUsgmpQiJ4gTwtazDSi4TwZY
Hs9T9GulkzpZIzEyKL3qG2TsfyDhW5Avn+GvKInAT9+Fkig4BnP3DUONBxcwGZ78
eI3FDUWsE36NqE5ECWmz
=ivCe
-----END PGP SIGNATURE-----
Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull virtio updates from Rusty Russell:
"OK, this has the big virtio 1.0 implementation, as specified by OASIS.
On top of tht is the major rework of lguest, to use PCI and virtio
1.0, to double-check the implementation.
Then comes the inevitable fixes and cleanups from that work"
* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (80 commits)
virtio: don't set VIRTIO_CONFIG_S_DRIVER_OK twice.
virtio_net: unconditionally define struct virtio_net_hdr_v1.
tools/lguest: don't use legacy definitions for net device in example launcher.
virtio: Don't expose legacy net features when VIRTIO_NET_NO_LEGACY defined.
tools/lguest: use common error macros in the example launcher.
tools/lguest: give virtqueues names for better error messages
tools/lguest: more documentation and checking of virtio 1.0 compliance.
lguest: don't look in console features to find emerg_wr.
tools/lguest: don't start devices until DRIVER_OK status set.
tools/lguest: handle indirect partway through chain.
tools/lguest: insert driver references from the 1.0 spec (4.1 Virtio Over PCI)
tools/lguest: insert device references from the 1.0 spec (4.1 Virtio Over PCI)
tools/lguest: rename virtio_pci_cfg_cap field to match spec.
tools/lguest: fix features_accepted logic in example launcher.
tools/lguest: handle device reset correctly in example launcher.
virtual: Documentation: simplify and generalize paravirt_ops.txt
lguest: remove NOTIFY call and eventfd facility.
lguest: remove NOTIFY facility from demonstration launcher.
lguest: use the PCI console device's emerg_wr for early boot messages.
lguest: always put console in PCI slot #1.
...
Conflicts:
drivers/net/vxlan.c
drivers/vhost/net.c
include/linux/if_vlan.h
net/core/dev.c
The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.
In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.
In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.
In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 3d0ad09412.
Now that GSO functionality can correctly track if the fragment
id has been selected and select a fragment id if necessary,
we can re-enable UFO on tap/macvap and virtio devices.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch enables the use of software timestamping via the virtio_net
driver.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Some devices might not implement config space access
(e.g. remoteproc used not to - before 3.9).
virtio/net needs config space access so make it
fail gracefully if not there.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
There's no need to do header check for virtio-net since:
- Host sets dodgy for all gso packets from guest and check the header.
- Host should be prepared for all kinds of evil packets from guest, since
malicious guest can send any kinds of packet.
So this patch sets NETIF_F_GSO_ROBUST for virtio-net to skip the check.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit d75b1ade56 (net: less
interrupt masking in NAPI) breaks virtio_net in an insidious way.
It is now required that if the entire budget is consumed when poll
returns, the napi poll_list must remain empty. However, like some
other drivers virtio_net tries to do a last-ditch check and if
there is more work it will call napi_schedule and then immediately
process some of this new work. Should the entire budget be consumed
while processing such new work then we will violate the new caller
contract.
This patch fixes this by not touching any work when we reschedule
in virtio_net.
The worst part of this bug is that the list corruption causes other
napi users to be moved off-list. In my case I was chasing a stall
in IPsec (IPsec uses netif_rx) and I only belatedly realised that it
was virtio_net which caused the stall even though the virtio_net
poll was still functioning perfectly after IPsec stalled.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The spec states that mac in config space is only driver-writable in the
legacy case. Fence writing it in virtnet_set_mac_address() in the
virtio 1.0 case.
Suggested-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
With VERSION_1 virtio_net uses same header size
whether mergeable buffers are enabled or not.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Our buffer length check is not strict enough for mergeable
buffers: buffer can still be shorter that header + address
by 2 bytes.
Fix that up.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
virtio 1.0 doesn't use virtio_net_hdr anymore, and in fact, it's not
really useful since virtio_net_hdr_mrg_rxbuf includes that as the first
field anyway.
Let's drop it, precalculate header len and store within vi instead.
This way we can also remove struct skb_vnet_hdr.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Too many places poke at [rs]q->vq->vdev->priv just to get
the vi structure. Let's just pass the pointer around: seems
cleaner, and might even be faster.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Based on patches by Rusty Russell, Cornelia Huck.
Note: more code changes are needed for 1.0 support
(due to different header size).
So we don't advertize support for 1.0 yet.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We currently trigger BUG when VIRTIO_NET_F_CTRL_VQ
is not set but one of features depending on it is.
That's not a friendly way to report errors to
hypervisors.
Let's check, and fail probe instead.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6 does not allow fragmentation by routers, so there is no
fragmentation ID in the fixed header. UFO for IPv6 requires the ID to
be passed separately, but there is no provision for this in the virtio
net protocol.
Until recently our software implementation of UFO/IPv6 generated a new
ID, but this was a bug. Now we will use ID=0 for any UFO/IPv6 packet
passed through a tap, which is even worse.
Unfortunately there is no distinction between UFO/IPv4 and v6
features, so disable UFO on taps and virtio_net completely until we
have a proper solution.
We cannot depend on VM managers respecting the tap feature flags, so
keep accepting UFO packets but log a warning the first time we do
this.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fixes: 916e4cf46d ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data")
Signed-off-by: David S. Miller <davem@davemloft.net>
been sitting in MST's tree during my vacation. I changed a function name
and made one trivial change, then they spent two days in linux-next.
Thanks,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUQFBQAAoJENkgDmzRrbjxJRIP/1yCQRElQewxURSmJelyqCdU
0mHYB0R9Mf3tfre1xnofqs2lWeSMc/4ptKHsVR6pupoztSwnz7HsLHfEFvFJh4mj
KsaqYElxkNxTcfyHwLjyJS0/J6tG1tYypXGiimTBS0bvFHL3XZdimVgJ6WvX+gO7
YSaDEX8/EqCERafslS5+gKJlz3drDOnCZCe9y4BDSmsvl2k7bkpSxIn8vsR6jIC0
c5JpUy6QVF+3XA/J932M7yRs+xpqxNoUWiyY3ar9o3CtQAaQB0ZAetSxY6hTfvVc
GlNFzCifdsaQwsl2SVsE2h6tWaRhtMtcGWQuhHThIPyIf8XxhYyBRY2FLo70LMz1
eqtwy6F/Bg/nzUsdee4PZBMeoKHlAEL12RpsEKgfUoLzj16Aqa8ll+Agbglbkw8G
f3d2FwzKAlpY5NwHETC1wYy52PJ3efqksRWuhokmYpxNSbHJS/lsiJOE7272/4Qr
MtXuvRmo22tf34XFd5y7zqWjgZ58eeFOqQWi/K+6ZgpqVOvikjrXXKEuiVdjO0ZD
kTVR/sQKiR+79rzENk80XBhWaMveECNXF1TiZ/3MmURkmEOBRQMxRQ20BX3exvna
AJ/WVA5DcfXZc1yyqknE1NLGrvSBMJENH13x2QPwrqNWAryOOKuF1VKKIwWlDw5j
vtx5nXiJa8YYdxI2TJCN
=JK6x
-----END PGP SIGNATURE-----
Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull virtio updates from Rusty Russell:
"One cc: stable commit, the rest are a series of minor cleanups which
have been sitting in MST's tree during my vacation. I changed a
function name and made one trivial change, then they spent two days in
linux-next"
* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (25 commits)
virtio-rng: refactor probe error handling
virtio_scsi: drop scan callback
virtio_balloon: enable VQs early on restore
virtio_scsi: fix race on device removal
virito_scsi: use freezable WQ for events
virtio_net: enable VQs early on restore
virtio_console: enable VQs early on restore
virtio_scsi: enable VQs early on restore
virtio_blk: enable VQs early on restore
virtio_scsi: move kick event out from virtscsi_init
virtio_net: fix use after free on allocation failure
9p/trans_virtio: enable VQs early
virtio_console: enable VQs early
virtio_blk: enable VQs early
virtio_net: enable VQs early
virtio: add API to enable VQs early
virtio_net: minor cleanup
virtio-net: drop config_mutex
virtio_net: drop config_enable
virtio-blk: drop config_mutex
...
commit 0b725a2ca6
net: Remove ndo_xmit_flush netdev operation, use signalling instead.
added code that looks at skb->xmit_more after the skb has
been put in TX VQ. Since some paths process the ring and free the skb
immediately, this can cause use after free.
Fix by storing xmit_more in a local variable.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
virtio spec requires drivers to set DRIVER_OK before using VQs.
This is set automatically after restore returns, virtio net violated this
rule by using receive VQs within restore.
To fix, call virtio_device_ready before using VQs.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In the extremely unlikely event that driver initialization fails after
RX buffers are added, virtio net frees RX buffers while VQs are
still active, potentially causing device to use a freed buffer.
To fix, reset device first - same as we do on device removal.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
virtio spec requires drivers to set DRIVER_OK before using VQs.
This is set automatically after probe returns, virtio net violated this
rule by using receive VQs within probe.
To fix, call virtio_device_ready before using VQs.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
goto done;
done:
return;
is ugly, it was put there to make diff review easier.
replace by open-coded return.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
config_mutex served two purposes: prevent multiple concurrent config
change handlers, and synchronize access to config_enable flag.
Since commit dbf2576e37
workqueue: make all workqueues non-reentrant
all workqueues are non-reentrant, and config_enable
is now gone.
Get rid of the unnecessary lock.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now that virtio core ensures config changes don't arrive during probing,
drop config_enable flag in virtio net.
On removal, flush is now sufficient to guarantee that no change work is
queued.
This help simplify the driver, and will allow setting DRIVER_OK earlier
without losing config change notifications.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is the only driver which doesn't hand virtqueue_add_inbuf and
virtqueue_add_outbuf a well-formed, well-terminated sg. Fix it,
so we can make virtio_add_* simpler.
pktgen results:
modprobe pktgen
echo 'add_device eth0' > /proc/net/pktgen/kpktgend_0
echo nowait 1 > /proc/net/pktgen/eth0
echo count 1000000 > /proc/net/pktgen/eth0
echo clone_skb 100000 > /proc/net/pktgen/eth0
echo dst_mac 4e:14:25:a9:30:ac > /proc/net/pktgen/eth0
echo dst 192.168.1.2 > /proc/net/pktgen/eth0
for i in `seq 20`; do echo start > /proc/net/pktgen/pgctrl; tail -n1 /proc/net/pktgen/eth0; done
Before:
746547-793084(786421+/-9.6e+03)pps 346-367(364.4+/-4.4)Mb/sec (346397808-367990976(3.649e+08+/-4.5e+06)bps) errors: 0
After:
767390-792966(785159+/-6.5e+03)pps 356-367(363.75+/-2.9)Mb/sec (356068960-367936224(3.64314e+08+/-3e+06)bps) errors: 0
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mirror the changes made to ixgbe in commit 2367a17390
("ixgbe: flush when in xmit_more mode and under descriptor pressure")
Signed-off-by: David S. Miller <davem@davemloft.net>
As reported by Jesper Dangaard Brouer, for high packet rates the
overhead of having another indirect call in the TX path is
non-trivial.
There is the indirect call itself, and then there is all of the
reloading of the state to refetch the tail pointer value and
then write the device register.
Move to a more passive scheme, which requires very light modifications
to the device drivers.
The signal is a new skb->xmit_more value, if it is non-zero it means
that more SKBs are pending to be transmitted on the same queue as the
current SKB. And therefore, the driver may elide the tail pointer
update.
Right now skb->xmit_more is always zero.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add basic support for rx busy polling. Instead of introducing new
states and spinlock to synchronize between NAPI and polling method,
this patch just reuse NAPI state to avoid extra overhead for fast path
and simplified the codes.
Test was done between a kvm guest and an external host. Two hosts were
connected through 40gb mlx4 cards. With both busy_poll and busy_read
are set to 50 in guest, 1 byte netperf tcp_rr shows 127% improvement:
transaction rate was increased from 8353.33 to 18966.87.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move common receive logic to a new helper virtnet_receive(). It will
also be used by rx busy polling method.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Vlad Yasevich <vyasevic@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net: get rid of SET_ETHTOOL_OPS
Dave Miller mentioned he'd like to see SET_ETHTOOL_OPS gone.
This does that.
Mostly done via coccinelle script:
@@
struct ethtool_ops *ops;
struct net_device *dev;
@@
- SET_ETHTOOL_OPS(dev, ops);
+ dev->ethtool_ops = ops;
Compile tested only, but I'd seriously wonder if this broke anything.
Suggested-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Wilfried Klaebe <w-lkml@lebenslange-mailadresse.de>
Acked-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a small supplement for commit e7428e95a0
("virtio-net: put virtio-net header inline with data"). TCP packages have
enough room to put virtio-net header in, but UDP packages do not. By
setting dev->needed_headroom for virtio-net device, UDP packages could have
enough room.
For UDP packages, sk_buff is alloced in fun __ip_append_data. The size is
"alloclen + hh_len + 15", and "hh_len = LL_RESERVED_SPACE(rt-dst.dev);".
The Macro is defined as follows:
#define LL_RESERVED_SPACE(dev) \
((((dev)->hard_header_len+(dev)->needed_headroom)\
&~(HH_DATA_MOD - 1)) + HH_DATA_MOD)
By default, for UDP packages, after skb is allocated, only 16 bytes
reserved. And 2 bytes remained after mac header is set. That is not enough
to put virtio-net header in. If we set dev->needed_headroom to 12 or 10
(according to mergeable_rx_bufs is on or off ), more room can be reserved.
Then there is enough room for UDP packages to put the header in.
test result list as below:
guest and host: suse11sp3, netperf, intel 2.4GHz
+-------+---------+---------+---------+---------+
| | old | new |
+-------+---------+---------+---------+---------+
| UDP | Gbit/s | pps | Gbit/s | pps |
| 64 | 0.57 | 692232 | 0.61 | 742420 |
| 256 | 1.60 | 686860 | 1.71 | 733331 |
| 512 | 2.92 | 674576 | 3.07 | 710446 |
| 1024 | 4.99 | 598977 | 5.17 | 620821 |
| 1460 | 5.68 | 483757 | 7.16 | 610519 |
| 4096 | 6.98 | 637468 | 7.21 | 658471 |
+-------+---------+---------+---------+---------+
Signed-off-by: Zhang Jie <zhangjie14@huawei.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Execute "ethtool -L eth0 combined 0" in guest, if multiqueue
is enabled, virtnet_send_command() will return -EINVAL error,
there is a validation in QEMU.
But if multiqueue is disabled, virtnet_set_queues() will just
return zero (success). We should return error for this situation.
Signed-off-by: Amos Kong <akong@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
"Here is my initial pull request for the networking subsystem during
this merge window:
1) Support for ESN in AH (RFC 4302) from Fan Du.
2) Add full kernel doc for ethtool command structures, from Ben
Hutchings.
3) Add BCM7xxx PHY driver, from Florian Fainelli.
4) Export computed TCP rate information in netlink socket dumps, from
Eric Dumazet.
5) Allow IPSEC SA to be dumped partially using a filter, from Nicolas
Dichtel.
6) Convert many drivers to pci_enable_msix_range(), from Alexander
Gordeev.
7) Record SKB timestamps more efficiently, from Eric Dumazet.
8) Switch to microsecond resolution for TCP round trip times, also
from Eric Dumazet.
9) Clean up and fix 6lowpan fragmentation handling by making use of
the existing inet_frag api for it's implementation.
10) Add TX grant mapping to xen-netback driver, from Zoltan Kiss.
11) Auto size SKB lengths when composing netlink messages based upon
past message sizes used, from Eric Dumazet.
12) qdisc dumps can take a long time, add a cond_resched(), From Eric
Dumazet.
13) Sanitize netpoll core and drivers wrt. SKB handling semantics.
Get rid of never-used-in-tree netpoll RX handling. From Eric W
Biederman.
14) Support inter-address-family and namespace changing in VTI tunnel
driver(s). From Steffen Klassert.
15) Add Altera TSE driver, from Vince Bridgers.
16) Optimizing csum_replace2() so that it doesn't adjust the checksum
by checksumming the entire header, from Eric Dumazet.
17) Expand BPF internal implementation for faster interpreting, more
direct translations into JIT'd code, and much cleaner uses of BPF
filtering in non-socket ocntexts. From Daniel Borkmann and Alexei
Starovoitov"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1976 commits)
netpoll: Use skb_irq_freeable to make zap_completion_queue safe.
net: Add a test to see if a skb is freeable in irq context
qlcnic: Fix build failure due to undefined reference to `vxlan_get_rx_port'
net: ptp: move PTP classifier in its own file
net: sxgbe: make "core_ops" static
net: sxgbe: fix logical vs bitwise operation
net: sxgbe: sxgbe_mdio_register() frees the bus
Call efx_set_channels() before efx->type->dimension_resources()
xen-netback: disable rogue vif in kthread context
net/mlx4: Set proper build dependancy with vxlan
be2net: fix build dependency on VxLAN
mac802154: make csma/cca parameters per-wpan
mac802154: allow only one WPAN to be up at any given time
net: filter: minor: fix kdoc in __sk_run_filter
netlink: don't compare the nul-termination in nla_strcmp
can: c_can: Avoid led toggling for every packet.
can: c_can: Simplify TX interrupt cleanup
can: c_can: Store dlc private
can: c_can: Reduce register access
can: c_can: Make the code readable
...
doubling of the default queue length though.
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJTOirvAAoJENkgDmzRrbjxpqAP/3114WOubf4MNoi24eXNkU2x
ZC++orq408A72g8dB7A0XAhatKc8Ay0bDP5hOZNAgTHm2hjYxmhpC9UlKEzzElpW
yjwR0wYBXA0RoJuZqCp9MNgOtkU54QoQZ0c4EZblagslUZBmKQPUDE7XdgkaqO6o
A3azfCjAFDu523Azep9Npj1sk+H+VH3OIXyMGY+uyHBw1a2rnmIhn4lQCQ+pX+YO
3wpqxEragpjBAizs1CAAB9wWm2O8zVACxkoUXVYI8Tu60n99lwr9Abxlc0oHSIig
E7kBnyzQVHfkDrPXR3EdwTi3Hwd/BaOiW4dPvQ3IJKvPOzoiS4H3IpJCnCV5PfRb
VHl3q//SzPQ+GXH7WH2Fhb9JxoLxBRzFcy3kdIR1wBHYahAOiQjcLgapOO5mVq3X
PJy9CDs2L9rjbtxvQWtnl62V3JFGw+ZdhhG/BjeC5Who/aSh/mDoss7/qdfrxKJx
z5IWYSlJw7ighOuF0dPdCKAX9WiWqENvga31Q2svrH4Hxlx6eGumEmX+YQw0iRAf
ddMYA+1qLT4myPTN0nORFM+T/mkZHNkNMCr0qylRFH0j6hSiDxwWqQG0eXA661By
W6nIkW++sj2Vkk4knMGCXSyMmy9Nrv+1R+8unQJCXixYevotP5JEY0DoCQwlGuuq
xa0UR+2q9Htnbytu8S0K
=AoMS
-----END PGP SIGNATURE-----
Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull virtio updates from Rusty Russell:
"Nothing exciting: virtio-blk users might see a bit of a boost from the
doubling of the default queue length though"
* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
virtio-blk: base queue-depth on virtqueue ringsize or module param
Revert a02bbb1ccfe8: MAINTAINERS: add virtio-dev ML for virtio
virtio: fail adding buffer on broken queues.
virtio-rng: don't crash if virtqueue is broken.
virtio_balloon: don't crash if virtqueue is broken.
virtio_blk: don't crash, report error if virtqueue is broken.
virtio_net: don't crash if virtqueue is broken.
virtio_balloon: don't softlockup on huge balloon changes.
virtio: Use pci_enable_msix_exact() instead of pci_enable_msix()
MAINTAINERS: virtio-dev is subscribers only
tools/virtio: add a missing )
tools/virtio: fix missing kmemleak_ignore symbol
tools/virtio: update internal copies of headers
Conflicts:
drivers/net/ethernet/marvell/mvneta.c
The mvneta.c conflict is a case of overlapping changes,
a conversion to devm_ioremap_resource() vs. a conversion
to netdev_alloc_pcpu_stats.
Signed-off-by: David S. Miller <davem@davemloft.net>
Current error handling of virtqueue_kick() was wrong in two places:
- The skb were freed immediately when virtqueue_kick() fail during
xmit. This may lead double free since the skb was not detached from
the virtqueue.
- try_fill_recv() returns false when virtqueue_kick() fail. This will
lead unnecessary rescheduling of refill work.
Actually, it's safe to just ignore the kick failure in those two
places. So this patch fixes this by partially revert commit
6797590118.
Fixes 6797590118
(virtio_net: verify if virtqueue_kick() succeeded).
Cc: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace dev_kfree_skb with dev_kfree_skb_any in start_xmit which can
be called in hard irq and other contexts.
start_xmit only frees skbs that it is dropping.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Replace the bh safe variant with the hard irq safe variant.
We need a hard irq safe variant to deal with netpoll transmitting
packets from hard irq context, and we need it in most if not all of
the places using the bh safe variant.
Except on 32bit uni-processor the code is exactly the same so don't
bother with a bh variant, just have a hard irq safe variant that
everyone can use.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A bad implementation of virtio might cause us to mark the virtqueue
broken: we'll dev_err() in that case, and the device is useless, but
let's not BUG_ON().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We should alloc big buffers also when guest can receive UFO
packets to let the big packets fit into guest rx buffer.
Fixes 5c5167515d
(virtio-net: Allow UFO feature to be set and advertised.)
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add initial support for per-rx queue sysfs attributes to virtio-net. If
mergeable packet buffers are enabled, adds a read-only mergeable packet
buffer size sysfs attribute for each RX queue.
Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2613af0ed1 ("virtio_net: migrate mergeable rx buffers to page frag
allocators") changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all
workloads. For workloads with packet size <= MTU bytes, MTU + virtio-net
header-sized buffers are preferred as larger buffers reduce the TCP window
due to SKB truesize. However, single-stream workloads with large average
packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
are used.
This commit auto-tunes the mergeable receiver buffer packet size by
choosing the packet buffer size based on an EWMA of the recent packet
sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
virtio-net header len to PAGE_SIZE. This improves throughput for
large packet workloads, as any workload with average packet size >=
PAGE_SIZE will use PAGE_SIZE buffers.
These optimizations interact positively with recent commit
ba27524103 ("virtio-net: coalesce rx frags when possible during rx"),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.
Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads & vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.
net-next w/ virtio_net before 2613af0ed1 (PAGE_SIZE bufs): 14642.85Gb/s
net-next (MTU-size bufs): 13170.01Gb/s
net-next + auto-tune: 14555.94Gb/s
Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
using MTU-sized buffers to about 26Gb/s using auto-tuning.
Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC
mergeable rx buffer allocations. This commit migrates virtio-net to use
per-receive queue page frags for GFP_ATOMIC allocation. This change unifies
mergeable rx buffer memory allocation, which now will use skb_refill_frag()
for both atomic and GFP-WAIT buffer allocations.
To address fragmentation concerns, if after buffer allocation there
is too little space left in the page frag to allocate a subsequent
buffer, the remaining space is added to the current allocated buffer
so that the remaining space can be used to store packet data.
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It looks like there's no need for those two fields:
- Unless there's a failure for the first refill try, rq->max should be always
equal to the vring size.
- rq->num is only used to determine the condition that we need to do the refill,
we could check vq->num_free instead.
- rq->num was required to be increased or decreased explicitly after each
get/put which results a bad API.
So this patch removes them both to make the code simpler.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
net/ipv6/ip6_tunnel.c
net/ipv6/ip6_vti.c
ipv6 tunnel statistic bug fixes conflicting with consolidation into
generic sw per-cpu net stats.
qlogic conflict between queue counting bug fix and the addition
of multiple MAC address support.
Signed-off-by: David S. Miller <davem@davemloft.net>
During restoring, try_fill_recv() was called with neither napi lock nor napi
disabled. This can lead two try_fill_recv() was called in the same time. Fix
this by refilling before trying to enable napi.
Fixes 0741bcb558
(virtio: net: Add freeze, restore handlers to support S4).
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All the code passes NULL for the last sg list (in).
Simplify by just removing it.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge 'net' into 'net-next' to get the AF_PACKET bug fix that
Daniel's direct transmit changes depend upon.
Signed-off-by: David S. Miller <davem@davemloft.net>
When a packet with invalid length arrives, ensure that the packet
is freed correctly if mergeable packet buffers and big packets
(GUEST_TSO4) are both enabled.
Signed-off-by: Michael Dalton <mwdalton@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several files refer to an old address for the Free Software Foundation
in the file header comment. Resolve by replacing the address with
the URL <http://www.gnu.org/licenses/> so that we do not have to keep
updating the header comments anytime the address changes.
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Veaceslav Falico <vfalico@redhat.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Paul Mackerras <paulus@samba.org>
CC: Ian Campbell <ian.campbell@citrix.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
receive mergeable now handles errors internally.
Do same for big and small packet paths, otherwise
the logic is too hard to follow.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet noticed that if we encounter an error
when processing a mergeable buffer, we don't
dequeue all of the buffers from this packet,
the result is almost sure to be loss of networking.
Jason Wang noticed that we also leak a page and that we don't decrement
the rq buf count, so we won't repost buffers (a resource leak).
Fix both issues.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael Dalton <mwdalton@google.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"MAC filter" sounds more reasonable than "MAC fitler".
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"Mostly these are fixes for fallout due to merge window changes, as
well as cures for problems that have been with us for a much longer
period of time"
1) Johannes Berg noticed two major deficiencies in our genetlink
registration. Some genetlink protocols we passing in constant
counts for their ops array rather than something like
ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were
using fixed IDs for their multicast groups.
We have to retain these fixed IDs to keep existing userland tools
working, but reserve them so that other multicast groups used by
other protocols can not possibly conflict.
In dealing with these two problems, we actually now use less state
management for genetlink operations and multicast groups.
2) When configuring interface hardware timestamping, fix several
drivers that simply do not validate that the hwtstamp_config value
is one the driver actually supports. From Ben Hutchings.
3) Invalid memory references in mwifiex driver, from Amitkumar Karwar.
4) In dev_forward_skb(), set the skb->protocol in the right order
relative to skb_scrub_packet(). From Alexei Starovoitov.
5) Bridge erroneously fails to use the proper wrapper functions to make
calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki
Makita.
6) When detaching a bridge port, make sure to flush all VLAN IDs to
prevent them from leaking, also from Toshiaki Makita.
7) Put in a compromise for TCP Small Queues so that deep queued devices
that delay TX reclaim non-trivially don't have such a performance
decrease. One particularly problematic area is 802.11 AMPDU in
wireless. From Eric Dumazet.
8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts
here. Fix from Eric Dumzaet, reported by Dave Jones.
9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn.
10) When computing mergeable buffer sizes, virtio-net fails to take the
virtio-net header into account. From Michael Dalton.
11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic
bumping, this one has been with us for a while. From Eric Dumazet.
12) Fix NULL deref in the new TIPC fragmentation handling, from Erik
Hugne.
13) 6lowpan bit used for traffic classification was wrong, from Jukka
Rissanen.
14) macvlan has the same issue as normal vlans did wrt. propagating LRO
disabling down to the real device, fix it the same way. From Michal
Kubecek.
15) CPSW driver needs to soft reset all slaves during suspend, from
Daniel Mack.
16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet.
17) The xen-netfront RX buffer refill timer isn't properly scheduled on
partial RX allocation success, from Ma JieYue.
18) When ipv6 ping protocol support was added, the AF_INET6 protocol
initialization cleanup path on failure was borked a little. Fix
from Vlad Yasevich.
19) If a socket disconnects during a read/recvmsg/recvfrom/etc that
blocks we can do the wrong thing with the msg_name we write back to
userspace. From Hannes Frederic Sowa. There is another fix in the
works from Hannes which will prevent future problems of this nature.
20) Fix route leak in VTI tunnel transmit, from Fan Du.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
genetlink: make multicast groups const, prevent abuse
genetlink: pass family to functions using groups
genetlink: add and use genl_set_err()
genetlink: remove family pointer from genl_multicast_group
genetlink: remove genl_unregister_mc_group()
hsr: don't call genl_unregister_mc_group()
quota/genetlink: use proper genetlink multicast APIs
drop_monitor/genetlink: use proper genetlink multicast APIs
genetlink: only pass array to genl_register_family_with_ops()
tcp: don't update snd_nxt, when a socket is switched from repair mode
atm: idt77252: fix dev refcnt leak
xfrm: Release dst if this dst is improper for vti tunnel
netlink: fix documentation typo in netlink_set_err()
be2net: Delete secondary unicast MAC addresses during be_close
be2net: Fix unconditional enabling of Rx interface options
net, virtio_net: replace the magic value
ping: prevent NULL pointer dereference on write to msg_name
bnx2x: Prevent "timeout waiting for state X"
bnx2x: prevent CFC attention
bnx2x: Prevent panic during DMAE timeout
...
It is more appropriate to use # of queue pairs currently used by
the driver instead of a magic value.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
some robustness fixes for broken virtio devices, plus minor tweaks.
[vs last pull request: added the virtio-scsi broken vq escape patch, which
I somehow lost.]
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJSgDsJAAoJENkgDmzRrbjxEE4P/jXqZHS/HdlxW9k0BjKKlEIF
PdtCoP3UhWTdskXvy2pD8m6nYn214MEJYUIa4HFlIEZsdxhuexzQHY19Ynkjagyv
57sRsUUm5fYQLIL7IUh2DUD1VU38hUFinno/y333szzvCj9qITDA/QABsiWxK8NO
dq+Lmeixgrhc5yN9iryW+gZV+hekJIZ4LsU5ejSaJucKblzXUH8qIbmSthG7RTYJ
tr4J7xTTXbhxY4CoC5Dpx2hvsFkvzaAIvI4Nr1mDjfq5cR8BaYvnC89U1IbhdAey
p1AbZE58JLrY+Z8K8LBRGV2KjO8qSZ6R47hbZ9nAnodJYB7sZLyj6jUe1q+/htuC
Dh9Xm9O4eW2xNaFk20dYeIF4UU5/HzdsbvG/IlH8x4sm8/K706ocYyAOHlzYUg2T
k7gltrgDzDokMgb2R44gwnr4oaJ2q8Gne6JXswlPEv2eRs6vNnA5Xhc0rEHGkU6C
gYn1vNFN6yx0vf2syG/Ce5pZtMxGpefKQkHzzWdq8FKr1B9s54dDuf2hls7J8A9t
OQT1gE33yURSelf4Kh4k9zWXaWk/Ohv9l2R1cqpALnJ4/+q0fP5t7HdK500S7aax
DxLeFeqvsBw7nlWgsGxQmt+fjITQFHhcDiwst0ehnt6RbDEW7XPIguz0K/gyhxYG
+UNbl/5Gr64jnUX3YCzm
=vY2L
-----END PGP SIGNATURE-----
Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull virtio updates from Rusty Russell:
"Nothing really exciting: some groundwork for changing virtio endian,
and some robustness fixes for broken virtio devices, plus minor
tweaks"
* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
virtio_scsi: verify if queue is broken after virtqueue_get_buf()
x86, asmlinkage, lguest: Pass in globals into assembler statement
virtio: mmio: fix signature checking for BE guests
virtio_ring: adapt to notify() returning bool
virtio_net: verify if queue is broken after virtqueue_get_buf()
virtio_console: verify if queue is broken after virtqueue_get_buf()
virtio_blk: verify if queue is broken after virtqueue_get_buf()
virtio_ring: add new function virtqueue_is_broken()
virtio_test: verify if virtqueue_kick() succeeded
virtio_net: verify if virtqueue_kick() succeeded
virtio_ring: let virtqueue_{kick()/notify()} return a bool
virtio_ring: change host notification API
virtio_config: remove virtio_config_val
virtio: use size-based config accessors.
virtio_config: introduce size-based accessors.
virtio_ring: plug kmemleak false positive.
virtio: pm: use CONFIG_PM_SLEEP instead of CONFIG_PM
Commit 2613af0ed1 ("virtio_net: migrate mergeable rx buffers to page
frag allocators") changed the mergeable receive buffer size from PAGE_SIZE
to MTU-size. However, the merge buffer size does not take into account the
size of the virtio-net header. Consequently, packets that are MTU-size
will take two buffers intead of one (to store the virtio-net header),
substantially decreasing the throughput of MTU-size traffic due to TCP
window / SKB truesize effects.
This commit changes the mergeable buffer size to include the virtio-net
header. The buffer size is cacheline-aligned because skb_page_frag_refill
will not automatically align the requested size.
Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs and
vhost enabled. All VMs and vhost threads run in a single 4 CPU cgroup
cpuset, using cgroups to ensure that other processes in the system will not
be scheduled on the benchmark CPUs. Transmit offloads and mergeable receive
buffers are enabled, but guest_tso4 / guest_csum are explicitly disabled to
force MTU-sized packets on the receiver.
next-net trunk before 2613af0ed1 (PAGE_SIZE buf): 3861.08Gb/s
net-next trunk (MTU 1500- packet uses two buf due to size bug): 4076.62Gb/s
net-next trunk (MTU 1480- packet fits in one buf): 6301.34Gb/s
net-next trunk w/ size fix (MTU 1500 - packet fits in one buf): 6445.44Gb/s
Suggested-by: Eric Northup <digitaleric@google.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull core locking changes from Ingo Molnar:
"The biggest changes:
- add lockdep support for seqcount/seqlocks structures, this
unearthed both bugs and required extra annotation.
- move the various kernel locking primitives to the new
kernel/locking/ directory"
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
block: Use u64_stats_init() to initialize seqcounts
locking/lockdep: Mark __lockdep_count_forward_deps() as static
lockdep/proc: Fix lock-time avg computation
locking/doc: Update references to kernel/mutex.c
ipv6: Fix possible ipv6 seqlock deadlock
cpuset: Fix potential deadlock w/ set_mems_allowed
seqcount: Add lockdep functionality to seqcount/seqlock structures
net: Explicitly initialize u64_stats_sync structures for lockdep
locking: Move the percpu-rwsem code to kernel/locking/
locking: Move the lglocks code to kernel/locking/
locking: Move the rwsem code to kernel/locking/
locking: Move the rtmutex code to kernel/locking/
locking: Move the semaphore core to kernel/locking/
locking: Move the spinlock code to kernel/locking/
locking: Move the lockdep code to kernel/locking/
locking: Move the mutex code to kernel/locking/
hung_task debugging: Add tracepoint to report the hang
x86/locking/kconfig: Update paravirt spinlock Kconfig description
lockstat: Report avg wait and hold times
lockdep, x86/alternatives: Drop ancient lockdep fixup message
...
In order to enable lockdep on seqcount/seqlock structures, we
must explicitly initialize any locks.
The u64_stats_sync structure, uses a seqcount, and thus we need
to introduce a u64_stats_init() function and use it to initialize
the structure.
This unfortunately adds a lot of fairly trivial initialization code
to a number of drivers. But the benefit of ensuring correctness makes
this worth while.
Because these changes are required for lockdep to be enabled, and the
changes are quite trivial, I've not yet split this patch out into 30-some
separate patches, as I figured it would be better to get the various
maintainers thoughts on how to best merge this change along with
the seqcount lockdep enablement.
Feedback would be appreciated!
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Roger Luethi <rl@hellgate.ch>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Simon Horman <horms@verge.net.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We used to use a percpu structure vq_index to record the cpu to queue
mapping, this is suboptimal since it duplicates the work of XPS and
loses all other XPS functionality such as allowing user to configure
their own transmission steering strategy.
So this patch switches to use XPS and suggest a default mapping when
the number of cpus is equal to the number of queues. With XPS support,
there's no need for keeping per-cpu vq_index and .ndo_select_queue(),
so they were removed also.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2613af0ed1 (virtio_net: migrate mergeable
rx buffers to page frag allocators) try to increase the payload/truesize for
MTU-sized traffic. But this will introduce the extra overhead for GSO packets
received because of the frag list. This commit tries to reduce this issue by
coalesce the possible rx frags when possible during rx. Test result shows the
about 15% improvement on full size GSO packet receiving (and even better than
before commit 2613af0ed1).
Before this commit:
./netperf -H 192.168.100.4
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.4
() port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 20303.87
After this commit:
./netperf -H 192.168.100.4
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.4
() port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 23841.26
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michael Dalton <mwdalton@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/emulex/benet/be.h
drivers/net/netconsole.c
net/bridge/br_private.h
Three mostly trivial conflicts.
The net/bridge/br_private.h conflict was a function signature (argument
addition) change overlapping with the extern removals from Joe Perches.
In drivers/net/netconsole.c we had one change adjusting a printk message
whilst another changed "printk(KERN_INFO" into "pr_info(".
Lastly, the emulex change was a new inline function addition overlapping
with Joe Perches's extern removals.
Signed-off-by: David S. Miller <davem@davemloft.net>
The virtio_net driver's mergeable receive buffer allocator
uses 4KB packet buffers. For MTU-sized traffic, SKB truesize
is > 4KB but only ~1500 bytes of the buffer is used to store
packet data, reducing the effective TCP window size
substantially. This patch addresses the performance concerns
with mergeable receive buffers by allocating MTU-sized packet
buffers using page frag allocators. If more than MAX_SKB_FRAGS
buffers are needed, the SKB frag_list is used.
Signed-off-by: Michael Dalton <mwdalton@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a virtqueue_get_buf() call returns a NULL pointer a possibly endless while
loop should be avoided by checking for a broken virtqueue.
Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Verify if a host kick succeeded by checking return value of virtqueue_kick().
Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We used to schedule the refill work unconditionally after changing the
number of queues. This may lead an issue if the device is not
up. Since we only try to cancel the work in ndo_stop(), this may cause
the refill work still work after removing the device. Fix this by only
schedule the work when device is up.
The bug were introduce by commit 9b9cd8024a.
(virtio-net: fix the race between channels setting and refill)
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We're trying to re-configure the affinity unconditionally in cpu hotplug
callback. This may lead the issue during resuming from s3/s4 since
- virt queues haven't been allocated at that time.
- it's unnecessary since thaw method will re-configure the affinity.
Fix this issue by checking the config_enable and do nothing is we're not ready.
The bug were introduced by commit 8de4b2f3ae
(virtio-net: reset virtqueue affinity when doing cpu hotplug).
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This lets the transport do endian conversion if necessary, and insulates
the drivers from the difference.
Most drivers can use the simple helpers virtio_cread() and virtio_cwrite().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The freeze and restore functions defined in virtio drivers are used
for suspend and hibernate, so CONFIG_PM_SLEEP is more appropriate than
CONFIG_PM. This patch replace all CONFIG_PM with CONFIG_PM_SLEEP for
virtio drivers that implement freeze and restore callbacks.
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reviewed-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If the VIRTIO_NET_F_GUEST_CSUM virtio feature is available, the guest
does not have to calculate the checksums on all received packets. This
is pretty much the same feature as RX checksum offloading on real
network cards, so the virtio-net driver should report this by setting
the NETIF_F_RXCSUM flag. When the user now runs "ethtool -k", he or she
can see whether the virtio-net interface has to calculate RX checksums
or not.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
For small packets we can simplify xmit processing
by linearizing buffers with the header:
most packets seem to have enough head room
we can use for this purpose.
Since existing hypervisors require that header
is the first s/g element, we need a feature bit
for this.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
virtio net called virtqueue_enable_cq on RX path after napi_complete, so
with NAPI_STATE_SCHED clear - outside the implicit napi lock.
This violates the requirement to synchronize virtqueue_enable_cq wrt
virtqueue_add_buf. In particular, used event can move backwards,
causing us to lose interrupts.
In a debug build, this can trigger panic within START_USE.
Jason Wang reports that he can trigger the races artificially,
by adding udelay() in virtqueue_enable_cb() after virtio_mb().
However, we must call napi_complete to clear NAPI_STATE_SCHED before
polling the virtqueue for used buffers, otherwise napi_schedule_prep in
a callback will fail, causing us to lose RX events.
To fix, call virtqueue_enable_cb_prepare with NAPI_STATE_SCHED
set (under napi lock), later call virtqueue_poll with
NAPI_STATE_SCHED clear (outside the lock).
Reported-by: Jason Wang <jasowang@redhat.com>
Tested-by: Jason Wang <jasowang@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 55257d72bd (virtio-net: fill only rx queues
which are being used) tries to refill on demand when changing the number of
channels by call try_refill_recv() directly, this may race:
- the refill work who may do the refill in the same time
- the try_refill_recv() called in bh since napi was not disabled
Which may led guest complain during setting channels:
virtio_net virtio0: input.1:id 0 is not a head!
Solve this issue by scheduling a refill work which can guarantee the
serialization of refill.
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Commit 55257d72bd (virtio-net: fill only rx
queues which are being used) only does the napi enabling during open for
curr_queue_pairs. This will break multiqueue receiving since napi of new queues
were still disabled after changing the number of queues.
This patch fixes this by enabling napi for all possible queues during open.
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 82dc3c63c6 ("net: introduce NAPI_POLL_WEIGHT")
we warn drivers when they use napi weight higher than NAPI_POLL_WEIGHT,
but virtio_net still uses 128 by default. This patch makes its default
value to NAPI_POLL_WEIGHT.
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I dived into lguest again, reworking the pagetable code so we can move
the switcher page: our fixmaps sometimes take more than 2MB now...
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJRga7lAAoJENkgDmzRrbjx/yIQAKpqIBtxOJeYH3SY+Uoe7Cfp
toNYcpJEldvb0UcWN8M2cSZpHoxl1SUoq9djwcM29tcKa7EZAjHaGtb/Q1qMTDgv
+B3WAfiGU2pmXFxLAkbrlLNGnysy24JspqJQ5hcYV84EiBxQdZp+nCYgOphd+GMK
ww16vo9ya8jFjzt3GeRp/Heb3vEzV4Cp6BC3i0m8A3WNpEpbRb66pqXNk5o8ggJO
SxQOKSXmUM+0m+jKSul5xn3e2Ls2LOrZZ8/DIHA+gW66N4Zab7n2/j1Q9VRxb4lh
FqnR7KwgBX8OCh9IsBDqQYS7MohvMYge6eUdLtFrq84jvMleMEhrC8q9v2tucFUb
5t18CLwvyK7Gdg6UCKiZ7YSPcuURAILO16al9bh5IseeBDsuX+43VsvQoBmFn9k6
cLOVTZ6BlOmahK5PyRYFSvLa9Rxzr/05Mr7oYq9UgshD9io78dnqczFYIORF53rW
zD7C4HuTZfYJFfNd0wAJ0RfVXnf8QvDlMdo7zPC26DSXNWqj8OexCY0qqSWUB+2F
vcfJP6NkV4fZB8aawWIFUVwc64yqtt2uPVLa7ATZWqk16PgKrchGewmw3tiEwOgu
1l7xgffTRRUIJsqaCZoXdgw3yezcKRjuUBcOxL09lDAAhc+NxWNvzZBsKp66DwDk
yZQKn0OdXnuf0CeEOfFf
=1tYL
-----END PGP SIGNATURE-----
Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull virtio & lguest updates from Rusty Russell:
"Lots of virtio work which wasn't quite ready for last merge window.
Plus I dived into lguest again, reworking the pagetable code so we can
move the switcher page: our fixmaps sometimes take more than 2MB now..."
Ugh. Annoying conflicts with the tcm_vhost -> vhost_scsi rename.
Hopefully correctly resolved.
* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (57 commits)
caif_virtio: Remove bouncing email addresses
lguest: improve code readability in lg_cpu_start.
virtio-net: fill only rx queues which are being used
lguest: map Switcher below fixmap.
lguest: cache last cpu we ran on.
lguest: map Switcher text whenever we allocate a new pagetable.
lguest: don't share Switcher PTE pages between guests.
lguest: expost switcher_pages array (as lg_switcher_pages).
lguest: extract shadow PTE walking / allocating.
lguest: make check_gpte et. al return bool.
lguest: assume Switcher text is a single page.
lguest: rename switcher_page to switcher_pages.
lguest: remove RESERVE_MEM constant.
lguest: check vaddr not pgd for Switcher protection.
lguest: prepare to make SWITCHER_ADDR a variable.
virtio: console: replace EMFILE with EBUSY for already-open port
virtio-scsi: reset virtqueue affinity when doing cpu hotplug
virtio-scsi: introduce multiqueue support
virtio-scsi: push vq lock/unlock into virtscsi_vq_done
virtio-scsi: pass struct virtio_scsi to virtqueue completion function
...
Due to MQ support we may allocate a whole bunch of rx queues but
never use them. With this patch we'll safe the space used by
the receive buffers until they are actually in use:
sh-4.2# free -h
total used free shared buffers cached
Mem: 490M 35M 455M 0B 0B 4.1M
-/+ buffers/cache: 31M 459M
Swap: 0B 0B 0B
sh-4.2# ethtool -L eth0 combined 8
sh-4.2# free -h
total used free shared buffers cached
Mem: 490M 162M 327M 0B 0B 4.1M
-/+ buffers/cache: 158M 331M
Swap: 0B 0B 0B
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Change the rx_{add,kill}_vid callbacks to take a protocol argument in
preparation of 802.1ad support. The protocol argument used so far is
always htons(ETH_P_8021Q).
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the hardware VLAN acceleration features to include "CTAG" to indicate
that they only support CTAGs. Follow up patches will introduce 802.1ad
server provider tagging (STAGs) and require the distinction for hardware not
supporting acclerating both.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>