Skip to content

Commit

Permalink
Merge tag 'for-linus-ioctl' of git://git.kernel.org/pub/scm/linux/ker…
Browse files Browse the repository at this point in the history
…nel/git/dledford/rdma

Pull rdma updates from Doug Ledford:
 "This is a big pull request.

  Of note is that I'm sending you the new ioctl API for the rdma
  subsystem. We put it up on linux-api@, but didn't get much response.
  The API is complex, but it solves two different problems in one go:

   1) The bi-directional nature of the RDMA file write calls, which
      created the security hole we had to handle (and for which the fix
      is now causing problems for systems in production, we were a bit
      over zealous in the fix and the ability to open a device, then
      fork, then create new queue pairs on the device and use them is
      broken).

   2) The bloat caused by different vendors implementing extensions to
      the base verbs API. Each vendor's hardware is slightly different,
      and the hardware might be suitable for one extension but not
      another.

      By the time we add generic extensions for all the different ways
      that the different hardware can offload things, the API becomes
      bloated. Things like our completion structs have started to exceed
      a cache line in size because of all the elements needed to support
      this. That in turn shows up heavily in the performance graphs with
      a noticable drop in performance on 100Gigabit links as our
      completion structs go from occupying one cache line to 1+.

      This API makes things like the completion structs modular in a
      very similar way to netlink so that your structs can only include
      the items needed for the offloads/features you are actually using
      on a given queue pair. In that way we support everything, but only
      use what we need, and our structs stay smaller.

  The ioctl API is better explained by the posting on linux-api@ than I
  can explain it here, so I'll just leave it at that.

  The rest of the pull request is typical stuff.

  Updates for 4.14 kernel merge window

   - Lots of hfi1 driver updates (mixed with a few qib and core updates
     as well)

   - rxe updates

   - various mlx updates

   - Set default roce type to RoCEv2

   - Several larger fixes for bnxt_re that were too big for -rc

   - Several larger fixes for qedr that, likewise, were too big for -rc

   - Misc core changes

   - Make the hns_roce driver compilable on arches other than aarch64 so
     we can more easily debug build issues related to it

   - Add rdma-netlink infrastructure updates

   - Add automatic IRQ affinity infrastructure

   - Add 32bit lid support

   - Lots of misc fixes across the subsystem from random people

   - Autoloading of RDMA netlink modules

   - PCI pool cleanups from Romain Perier

   - mlx5 driver feature additions and fixes

   - Hardware tag matchine feature

   - Fix sleeping in atomic when resolving roce ah

   - Add experimental ioctl interface as posted to linux-api@"

* tag 'for-linus-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (328 commits)
  IB/core: Expose ioctl interface through experimental Kconfig
  IB/core: Assign root to all drivers
  IB/core: Add completion queue (cq) object actions
  IB/core: Add legacy driver's user-data
  IB/core: Export ioctl enum types to user-space
  IB/core: Explicitly destroy an object while keeping uobject
  IB/core: Add macros for declaring methods and attributes
  IB/core: Add uverbs merge trees functionality
  IB/core: Add DEVICE object and root tree structure
  IB/core: Declare an object instead of declaring only type attributes
  IB/core: Add new ioctl interface
  RDMA/vmw_pvrdma: Fix a signedness
  RDMA/vmw_pvrdma: Report network header type in WC
  IB/core: Add might_sleep() annotation to ib_init_ah_from_wc()
  IB/cm: Fix sleeping in atomic when RoCE is used
  IB/core: Add support to finalize objects in one transaction
  IB/core: Add a generic way to execute an operation on a uobject
  Documentation: Hardware tag matching
  IB/mlx5: Support IB_SRQT_TM
  net/mlx5: Add XRQ support
  ...
  • Loading branch information
Linus Torvalds committed Sep 4, 2017
2 parents 906dde0 + 8eb19e8 commit aa9d464
Show file tree
Hide file tree
Showing 275 changed files with 14,866 additions and 5,221 deletions.
64 changes: 64 additions & 0 deletions Documentation/infiniband/tag_matching.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
Tag matching logic

The MPI standard defines a set of rules, known as tag-matching, for matching
source send operations to destination receives. The following parameters must
match the following source and destination parameters:
* Communicator
* User tag - wild card may be specified by the receiver
* Source rank – wild car may be specified by the receiver
* Destination rank – wild
The ordering rules require that when more than one pair of send and receive
message envelopes may match, the pair that includes the earliest posted-send
and the earliest posted-receive is the pair that must be used to satisfy the
matching operation. However, this doesn’t imply that tags are consumed in
the order they are created, e.g., a later generated tag may be consumed, if
earlier tags can’t be used to satisfy the matching rules.

When a message is sent from the sender to the receiver, the communication
library may attempt to process the operation either after or before the
corresponding matching receive is posted. If a matching receive is posted,
this is an expected message, otherwise it is called an unexpected message.
Implementations frequently use different matching schemes for these two
different matching instances.

To keep MPI library memory footprint down, MPI implementations typically use
two different protocols for this purpose:

1. The Eager protocol- the complete message is sent when the send is
processed by the sender. A completion send is received in the send_cq
notifying that the buffer can be reused.

2. The Rendezvous Protocol - the sender sends the tag-matching header,
and perhaps a portion of data when first notifying the receiver. When the
corresponding buffer is posted, the responder will use the information from
the header to initiate an RDMA READ operation directly to the matching buffer.
A fin message needs to be received in order for the buffer to be reused.

Tag matching implementation

There are two types of matching objects used, the posted receive list and the
unexpected message list. The application posts receive buffers through calls
to the MPI receive routines in the posted receive list and posts send messages
using the MPI send routines. The head of the posted receive list may be
maintained by the hardware, with the software expected to shadow this list.

When send is initiated and arrives at the receive side, if there is no
pre-posted receive for this arriving message, it is passed to the software and
placed in the unexpected message list. Otherwise the match is processed,
including rendezvous processing, if appropriate, delivering the data to the
specified receive buffer. This allows overlapping receive-side MPI tag
matching with computation.

When a receive-message is posted, the communication library will first check
the software unexpected message list for a matching receive. If a match is
found, data is delivered to the user buffer, using a software controlled
protocol. The UCX implementation uses either an eager or rendezvous protocol,
depending on data size. If no match is found, the entire pre-posted receive
list is maintained by the hardware, and there is space to add one more
pre-posted receive to this list, this receive is passed to the hardware.
Software is expected to shadow this list, to help with processing MPI cancel
operations. In addition, because hardware and software are not expected to be
tightly synchronized with respect to the tag-matching operation, this shadow
list is used to detect the case that a pre-posted receive is passed to the
hardware, as the matching unexpected message is being passed from the hardware
to the software.
5 changes: 5 additions & 0 deletions block/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -206,4 +206,9 @@ config BLK_MQ_VIRTIO
depends on BLOCK && VIRTIO
default y

config BLK_MQ_RDMA
bool
depends on BLOCK && INFINIBAND
default y

source block/Kconfig.iosched
1 change: 1 addition & 0 deletions block/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ obj-$(CONFIG_BLK_CMDLINE_PARSER) += cmdline-parser.o
obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
obj-$(CONFIG_BLK_MQ_PCI) += blk-mq-pci.o
obj-$(CONFIG_BLK_MQ_VIRTIO) += blk-mq-virtio.o
obj-$(CONFIG_BLK_MQ_RDMA) += blk-mq-rdma.o
obj-$(CONFIG_BLK_DEV_ZONED) += blk-zoned.o
obj-$(CONFIG_BLK_WBT) += blk-wbt.o
obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o
Expand Down
52 changes: 52 additions & 0 deletions block/blk-mq-rdma.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
/*
* Copyright (c) 2017 Sagi Grimberg.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms and conditions of the GNU General Public License,
* version 2, as published by the Free Software Foundation.
*
* This program is distributed in the hope it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*/
#include <linux/blk-mq.h>
#include <linux/blk-mq-rdma.h>
#include <rdma/ib_verbs.h>

/**
* blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
* @set: tagset to provide the mapping for
* @dev: rdma device associated with @set.
* @first_vec: first interrupt vectors to use for queues (usually 0)
*
* This function assumes the rdma device @dev has at least as many available
* interrupt vetors as @set has queues. It will then query it's affinity mask
* and built queue mapping that maps a queue to the CPUs that have irq affinity
* for the corresponding vector.
*
* In case either the driver passed a @dev with less vectors than
* @set->nr_hw_queues, or @dev does not provide an affinity mask for a
* vector, we fallback to the naive mapping.
*/
int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
struct ib_device *dev, int first_vec)
{
const struct cpumask *mask;
unsigned int queue, cpu;

for (queue = 0; queue < set->nr_hw_queues; queue++) {
mask = ib_get_vector_affinity(dev, first_vec + queue);
if (!mask)
goto fallback;

for_each_cpu(cpu, mask)
set->mq_map[cpu] = queue;
}

return 0;

fallback:
return blk_mq_map_queues(set);
}
EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);
9 changes: 9 additions & 0 deletions drivers/infiniband/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,15 @@ config INFINIBAND_USER_ACCESS
libibverbs, libibcm and a hardware driver library from
<http://www.openfabrics.org/git/>.

config INFINIBAND_EXP_USER_ACCESS
bool "Allow experimental support for Infiniband ABI"
depends on INFINIBAND_USER_ACCESS
---help---
IOCTL based ABI support for Infiniband. This allows userspace
to invoke the experimental IOCTL based ABI.
These commands are parsed via per-device parsing tree and
enables per-device features.

config INFINIBAND_USER_MEM
bool
depends on INFINIBAND_USER_ACCESS != n
Expand Down
6 changes: 4 additions & 2 deletions drivers/infiniband/core/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ ib_core-y := packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
device.o fmr_pool.o cache.o netlink.o \
roce_gid_mgmt.o mr_pool.o addr.o sa_query.o \
multicast.o mad.o smi.o agent.o mad_rmpp.o \
security.o
security.o nldev.o

ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
ib_core-$(CONFIG_CGROUP_RDMA) += cgroup.o
Expand All @@ -31,4 +32,5 @@ ib_umad-y := user_mad.o
ib_ucm-y := ucm.o

ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_marshall.o \
rdma_core.o uverbs_std_types.o
rdma_core.o uverbs_std_types.o uverbs_ioctl.o \
uverbs_ioctl_merge.o
12 changes: 5 additions & 7 deletions drivers/infiniband/core/addr.c
Original file line number Diff line number Diff line change
Expand Up @@ -130,13 +130,11 @@ static void ib_nl_process_good_ip_rsep(const struct nlmsghdr *nlh)
}

int ib_nl_handle_ip_res_resp(struct sk_buff *skb,
struct netlink_callback *cb)
struct nlmsghdr *nlh,
struct netlink_ext_ack *extack)
{
const struct nlmsghdr *nlh = (struct nlmsghdr *)cb->nlh;

if ((nlh->nlmsg_flags & NLM_F_REQUEST) ||
!(NETLINK_CB(skb).sk) ||
!netlink_capable(skb, CAP_NET_ADMIN))
!(NETLINK_CB(skb).sk))
return -EPERM;

if (ib_nl_is_good_ip_resp(nlh))
Expand Down Expand Up @@ -186,7 +184,7 @@ static int ib_nl_ip_send_msg(struct rdma_dev_addr *dev_addr,

/* Repair the nlmsg header length */
nlmsg_end(skb, nlh);
ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, GFP_KERNEL);
rdma_nl_multicast(skb, RDMA_NL_GROUP_LS, GFP_KERNEL);

/* Make the request retry, so when we get the response from userspace
* we will have something.
Expand Down Expand Up @@ -326,7 +324,7 @@ static void queue_req(struct addr_req *req)
static int ib_nl_fetch_ha(struct dst_entry *dst, struct rdma_dev_addr *dev_addr,
const void *daddr, u32 seq, u16 family)
{
if (ibnl_chk_listeners(RDMA_NL_GROUP_LS))
if (rdma_nl_chk_listeners(RDMA_NL_GROUP_LS))
return -EADDRNOTAVAIL;

/* We fill in what we can, the response will fill the rest */
Expand Down
23 changes: 8 additions & 15 deletions drivers/infiniband/core/cache.c
Original file line number Diff line number Diff line change
Expand Up @@ -1199,30 +1199,23 @@ int ib_cache_setup_one(struct ib_device *device)
device->cache.ports =
kzalloc(sizeof(*device->cache.ports) *
(rdma_end_port(device) - rdma_start_port(device) + 1), GFP_KERNEL);
if (!device->cache.ports) {
err = -ENOMEM;
goto out;
}
if (!device->cache.ports)
return -ENOMEM;

err = gid_table_setup_one(device);
if (err)
goto out;
if (err) {
kfree(device->cache.ports);
device->cache.ports = NULL;
return err;
}

for (p = 0; p <= rdma_end_port(device) - rdma_start_port(device); ++p)
ib_cache_update(device, p + rdma_start_port(device), true);

INIT_IB_EVENT_HANDLER(&device->cache.event_handler,
device, ib_cache_event);
err = ib_register_event_handler(&device->cache.event_handler);
if (err)
goto err;

ib_register_event_handler(&device->cache.event_handler);
return 0;

err:
gid_table_cleanup_one(device);
out:
return err;
}

void ib_cache_release_one(struct ib_device *device)
Expand Down
Loading

0 comments on commit aa9d464

Please sign in to comment.