Merge branch 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-…

…block Pull followup block layer updates from Jens Axboe: "I ended up splitting the main pull request for this series into two, mainly because of clashes between NVMe fixes that went into 4.13 after the for-4.14 branches were split off. This pull request is mostly NVMe, but not exclusively. In detail, it contains: - Two pull request for NVMe changes from Christoph. Nothing new on the feature front, basically just fixes all over the map for the core bits, transport, rdma, etc. - Series from Bart, cleaning up various bits in the BFQ scheduler. - Series of bcache fixes, which has been lingering for a release or two. Coly sent this in, but patches from various people in this area. - Set of patches for BFQ from Paolo himself, updating both documentation and fixing some corner cases in performance. - Series from Omar, attempting to now get the 4k loop support correct. Our confidence level is higher this time. - Series from Shaohua for loop as well, improving O_DIRECT performance and fixing a use-after-free" * 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-block: (74 commits) bcache: initialize dirty stripes in flash_dev_run() loop: set physical block size to logical block size bcache: fix bch_hprint crash and improve output bcache: Update continue_at() documentation bcache: silence static checker warning bcache: fix for gc and write-back race bcache: increase the number of open buckets bcache: Correct return value for sysfs attach errors bcache: correct cache_dirty_target in __update_writeback_rate() bcache: gc does not work when triggering by manual command bcache: Don't reinvent the wheel but use existing llist API bcache: do not subtract sectors_to_gc for bypassed IO bcache: fix sequential large write IO bypass bcache: Fix leak of bdev reference block/loop: remove unused field block/loop: fix use after free bfq: Use icq_to_bic() consistently bfq: Suppress compiler warnings about comparisons bfq: Check kstrtoul() return value bfq: Declare local functions static ...
mariux64 · Sep 9, 2017 · 126e76f · 126e76f
2 parents fbd0141 + 175206c
commit 126e76f
Show file tree

Hide file tree

Showing 35 changed files with 1,143 additions and 817 deletions.
diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt
@@ -16,14 +16,16 @@ throughput. So, when needed for achieving a lower latency, BFQ builds
 schedules that may lead to a lower throughput. If your main or only
 goal, for a given device, is to achieve the maximum-possible
 throughput at all times, then do switch off all low-latency heuristics
-for that device, by setting low_latency to 0. Full details in Section 3.
+for that device, by setting low_latency to 0. See Section 3 for
+details on how to configure BFQ for the desired tradeoff between
+latency and throughput, or on how to maximize throughput.
 
 On average CPUs, the current version of BFQ can handle devices
 performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
 reference, 30-50 KIOPS correspond to very high bandwidths with
 sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
-to 120-200 MB/s with 4KB random I/O. BFQ has not yet been tested on
-multi-queue devices.
+to 120-200 MB/s with 4KB random I/O. BFQ is currently being tested on
+multi-queue devices too.
 
 The table of contents follow. Impatients can just jump to Section 3.
 
@@ -33,7 +35,7 @@ CONTENTS
  1-1 Personal systems
  1-2 Server systems
 2. How does BFQ work?
-3. What are BFQ's tunable?
+3. What are BFQ's tunables and how to properly configure BFQ?
 4. BFQ group scheduling
  4-1 Service guarantees provided
  4-2 Interface
@@ -145,19 +147,28 @@ plus a lot of code, are borrowed from CFQ.
       contrast, BFQ may idle the device for a short time interval,
       giving the process the chance to go on being served if it issues
       a new request in time. Device idling typically boosts the
-      throughput on rotational devices, if processes do synchronous
-      and sequential I/O. In addition, under BFQ, device idling is
-      also instrumental in guaranteeing the desired throughput
-      fraction to processes issuing sync requests (see the description
-      of the slice_idle tunable in this document, or [1, 2], for more
-      details).
+      throughput on rotational devices and on non-queueing flash-based
+      devices, if processes do synchronous and sequential I/O. In
+      addition, under BFQ, device idling is also instrumental in
+      guaranteeing the desired throughput fraction to processes
+      issuing sync requests (see the description of the slice_idle
+      tunable in this document, or [1, 2], for more details).
 
       - With respect to idling for service guarantees, if several
 	processes are competing for the device at the same time, but
-	all processes (and groups, after the following commit) have
-	the same weight, then BFQ guarantees the expected throughput
-	distribution without ever idling the device. Throughput is
-	thus as high as possible in this common scenario.
+	all processes and groups have the same weight, then BFQ
+	guarantees the expected throughput distribution without ever
+	idling the device. Throughput is thus as high as possible in
+	this common scenario.
+
+     - On flash-based storage with internal queueing of commands
+       (typically NCQ), device idling happens to be always detrimental
+       for throughput. So, with these devices, BFQ performs idling
+       only when strictly needed for service guarantees, i.e., for
+       guaranteeing low latency or fairness. In these cases, overall
+       throughput may be sub-optimal. No solution currently exists to
+       provide both strong service guarantees and optimal throughput
+       on devices with internal queueing.
 
   - If low-latency mode is enabled (default configuration), BFQ
     executes some special heuristics to detect interactive and soft
@@ -191,10 +202,7 @@ plus a lot of code, are borrowed from CFQ.
   - Queues are scheduled according to a variant of WF2Q+, named
     B-WF2Q+, and implemented using an augmented rb-tree to preserve an
     O(log N) overall complexity.  See [2] for more details. B-WF2Q+ is
-    also ready for hierarchical scheduling. However, for a cleaner
-    logical breakdown, the code that enables and completes
-    hierarchical support is provided in the next commit, which focuses
-    exactly on this feature.
+    also ready for hierarchical scheduling, details in Section 4.
 
   - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
     perfectly fair, and smooth service. In particular, B-WF2Q+
@@ -249,13 +257,24 @@ plus a lot of code, are borrowed from CFQ.
   the Idle class, to prevent it from starving.
 
 
-3. What are BFQ's tunable?
-==========================
+3. What are BFQ's tunables and how to properly configure BFQ?
+=============================================================
+
+Most BFQ tunables affect service guarantees (basically latency and
+fairness) and throughput. For full details on how to choose the
+desired tradeoff between service guarantees and throughput, see the
+parameters slice_idle, strict_guarantees and low_latency. For details
+on how to maximise throughput, see slice_idle, timeout_sync and
+max_budget. The other performance-related parameters have been
+inherited from, and have been preserved mostly for compatibility with
+CFQ. So far, no performance improvement has been reported after
+changing the latter parameters in BFQ.
 
-The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
-fifo_expire_sync below are the same as in CFQ. Their description is
-just copied from that for CFQ. Some considerations in the description
-of slice_idle are copied from CFQ too.
+In particular, the tunables back_seek-max, back_seek_penalty,
+fifo_expire_async and fifo_expire_sync below are the same as in
+CFQ. Their description is just copied from that for CFQ. Some
+considerations in the description of slice_idle are copied from CFQ
+too.
 
 per-process ioprio and weight
 -----------------------------
@@ -285,15 +304,17 @@ number of seeks and see improved throughput.
 
 Setting slice_idle to 0 will remove all the idling on queues and one
 should see an overall improved throughput on faster storage devices
-like multiple SATA/SAS disks in hardware RAID configuration.
+like multiple SATA/SAS disks in hardware RAID configuration, as well
+as flash-based storage with internal command queueing (and
+parallelism).
 
 So depending on storage and workload, it might be useful to set
 slice_idle=0.  In general for SATA/SAS disks and software RAID of
 SATA/SAS disks keeping slice_idle enabled should be useful. For any
 configurations where there are multiple spindles behind single LUN
-(Host based hardware RAID controller or for storage arrays), setting
-slice_idle=0 might end up in better throughput and acceptable
-latencies.
+(Host based hardware RAID controller or for storage arrays), or with
+flash-based fast storage, setting slice_idle=0 might end up in better
+throughput and acceptable latencies.
 
 Idling is however necessary to have service guarantees enforced in
 case of differentiated weights or differentiated I/O-request lengths.
@@ -312,13 +333,14 @@ There is an important flipside for idling: apart from the above cases
 where it is beneficial also for throughput, idling can severely impact
 throughput. One important case is random workload. Because of this
 issue, BFQ tends to avoid idling as much as possible, when it is not
-beneficial also for throughput. As a consequence of this behavior, and
-of further issues described for the strict_guarantees tunable,
-short-term service guarantees may be occasionally violated. And, in
-some cases, these guarantees may be more important than guaranteeing
-maximum throughput. For example, in video playing/streaming, a very
-low drop rate may be more important than maximum throughput. In these
-cases, consider setting the strict_guarantees parameter.
+beneficial also for throughput (as detailed in Section 2). As a
+consequence of this behavior, and of further issues described for the
+strict_guarantees tunable, short-term service guarantees may be
+occasionally violated. And, in some cases, these guarantees may be
+more important than guaranteeing maximum throughput. For example, in
+video playing/streaming, a very low drop rate may be more important
+than maximum throughput. In these cases, consider setting the
+strict_guarantees parameter.
 
 strict_guarantees
 -----------------
@@ -420,58 +442,20 @@ The default value is 0, which enables auto-tuning: BFQ sets max_budget
 to the maximum number of sectors that can be served during
 timeout_sync, according to the estimated peak rate.
 
+For specific devices, some users have occasionally reported to have
+reached a higher throughput by setting max_budget explicitly, i.e., by
+setting max_budget to a higher value than 0. In particular, they have
+set max_budget to higher values than those to which BFQ would have set
+it with auto-tuning. An alternative way to achieve this goal is to
+just increase the value of timeout_sync, leaving max_budget equal to 0.
+
 weights
 -------
 
 Read-only parameter, used to show the weights of the currently active
 BFQ queues.
 
 
-wr_ tunables
-------------
-
-BFQ exports a few parameters to control/tune the behavior of
-low-latency heuristics.
-
-wr_coeff
-
-Factor by which the weight of a weight-raised queue is multiplied. If
-the queue is deemed soft real-time, then the weight is further
-multiplied by an additional, constant factor.
-
-wr_max_time
-
-Maximum duration of a weight-raising period for an interactive task
-(ms). If set to zero (default value), then this value is computed
-automatically, as a function of the peak rate of the device. In any
-case, when the value of this parameter is read, it always reports the
-current duration, regardless of whether it has been set manually or
-computed automatically.
-
-wr_max_softrt_rate
-
-Maximum service rate below which a queue is deemed to be associated
-with a soft real-time application, and is then weight-raised
-accordingly (sectors/sec).
-
-wr_min_idle_time
-
-Minimum idle period after which interactive weight-raising may be
-reactivated for a queue (in ms).
-
-wr_rt_max_time
-
-Maximum weight-raising duration for soft real-time queues (in ms). The
-start time from which this duration is considered is automatically
-moved forward if the queue is detected to be still soft real-time
-before the current soft real-time weight-raising period finishes.
-
-wr_min_inter_arr_async
-
-Minimum period between I/O request arrivals after which weight-raising
-may be reactivated for an already busy async queue (in ms).
-
-
 4. Group scheduling with BFQ
 ============================
 

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
@@ -206,7 +206,7 @@ static void bfqg_get(struct bfq_group *bfqg)
 	bfqg->ref++;
 }
 
-void bfqg_put(struct bfq_group *bfqg)
+static void bfqg_put(struct bfq_group *bfqg)
 {
 	bfqg->ref--;
 
@@ -385,7 +385,7 @@ static struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg)
 	return cpd_to_bfqgd(blkcg_to_cpd(blkcg, &blkcg_policy_bfq));
 }
 
-struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp)
+static struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp)
 {
 	struct bfq_group_data *bgd;
 
@@ -395,20 +395,20 @@ struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp)
 	return &bgd->pd;
 }
 
-void bfq_cpd_init(struct blkcg_policy_data *cpd)
+static void bfq_cpd_init(struct blkcg_policy_data *cpd)
 {
 	struct bfq_group_data *d = cpd_to_bfqgd(cpd);
 
 	d->weight = cgroup_subsys_on_dfl(io_cgrp_subsys) ?
 		CGROUP_WEIGHT_DFL : BFQ_WEIGHT_LEGACY_DFL;
 }
 
-void bfq_cpd_free(struct blkcg_policy_data *cpd)
+static void bfq_cpd_free(struct blkcg_policy_data *cpd)
 {
 	kfree(cpd_to_bfqgd(cpd));
 }
 
-struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
+static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
 {
 	struct bfq_group *bfqg;
 
@@ -426,7 +426,7 @@ struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
 	return &bfqg->pd;
 }
 
-void bfq_pd_init(struct blkg_policy_data *pd)
+static void bfq_pd_init(struct blkg_policy_data *pd)
 {
 	struct blkcg_gq *blkg = pd_to_blkg(pd);
 	struct bfq_group *bfqg = blkg_to_bfqg(blkg);
@@ -445,15 +445,15 @@ void bfq_pd_init(struct blkg_policy_data *pd)
 	bfqg->rq_pos_tree = RB_ROOT;
 }
 
-void bfq_pd_free(struct blkg_policy_data *pd)
+static void bfq_pd_free(struct blkg_policy_data *pd)
 {
 	struct bfq_group *bfqg = pd_to_bfqg(pd);
 
 	bfqg_stats_exit(&bfqg->stats);
 	bfqg_put(bfqg);
 }
 
-void bfq_pd_reset_stats(struct blkg_policy_data *pd)
+static void bfq_pd_reset_stats(struct blkg_policy_data *pd)
 {
 	struct bfq_group *bfqg = pd_to_bfqg(pd);
 
@@ -740,7 +740,7 @@ static void bfq_reparent_active_entities(struct bfq_data *bfqd,
  * blkio already grabs the queue_lock for us, so no need to use
  * RCU-based magic
  */
-void bfq_pd_offline(struct blkg_policy_data *pd)
+static void bfq_pd_offline(struct blkg_policy_data *pd)
 {
 	struct bfq_service_tree *st;
 	struct bfq_group *bfqg = pd_to_bfqg(pd);