1. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      How this work was done:
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
      All documentation files were explicitly excluded.
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
         For non */uapi/* files that summary was:
         SPDX license identifier                            # files
         GPL-2.0                                              11139
         and resulted in the first patch in this series.
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                        930
         and resulted in the second patch in this series.
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
         and that resulted in the third patch in this series.
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  2. 02 Mar, 2017 2 commits
  3. 10 Feb, 2017 1 commit
  4. 17 Jan, 2017 2 commits
  5. 07 Nov, 2015 1 commit
    • Mel Gorman's avatar
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman authored
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      This patch then converts a number of sites
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 18 Feb, 2014 1 commit
  7. 09 Nov, 2013 1 commit
  8. 11 Sep, 2013 1 commit
    • Jan Kara's avatar
      lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt · 5e4c0d97
      Jan Kara authored
      With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
      one such possible user), the following race can happen:
          if (rtp->nr) {
            ret = rtp->nodes[rtp->nr - 1];
          if (rtp->nr) {
            ret = rtp->nodes[rtp->nr - 1];
      And we give out one radix tree node twice.  That clearly results in radix
      tree corruption with different results (usually OOPS) depending on which
      two users of radix tree race.
      We fix the problem by making radix_tree_node_alloc() always allocate fresh
      radix tree nodes when in interrupt.  Using preloading when in interrupt
      doesn't make sense since all the allocations have to be atomic anyway and
      we cannot steal nodes from process-context users because some users rely
      on radix_tree_insert() succeeding after radix_tree_preload().
      in_interrupt() check is somewhat ugly but we cannot simply key off passed
      gfp_mask as that is acquired from root_gfp_mask() and thus the same for
      all preload users.
      Another part of the fix is to avoid node preallocation in
      radix_tree_preload() when passed gfp_mask doesn't allow waiting.  Again,
      preallocation in such case doesn't make sense and when preallocation would
      happen in interrupt we could possibly leak some allocated nodes.  However,
      some users of radix_tree_preload() require following radix_tree_insert()
      to succeed.  To avoid unexpected effects for these users,
      radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
      and we provide a new function radix_tree_maybe_preload() for those users
      which get different gfp mask from different call sites and which are
      prepared to handle radix_tree_insert() failure.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  9. 14 May, 2013 1 commit
    • Viresh Kumar's avatar
      block: queue work on power efficient wq · 695588f9
      Viresh Kumar authored
      Block layer uses workqueues for multiple purposes. There is no real dependency
      of scheduling these on the cpu which scheduled them.
      On a idle system, it is observed that and idle cpu wakes up many times just to
      service this work. It would be better if we can schedule it on a cpu which the
      scheduler believes to be the most appropriate one.
      This patch replaces normal workqueues with power efficient versions.
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  10. 28 Feb, 2013 1 commit
    • Sasha Levin's avatar
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin authored
      I'm not sure why, but the hlist for each entry iterators were conceived
              list_for_each_entry(pos, head, member)
      The hlist ones were greedy and wanted an extra parameter:
              hlist_for_each_entry(tpos, pos, head, member)
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      Besides the semantic patch, there was some manual work required:
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      -T b;
          <+... when != b
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c, d) S
      - b,
      c) S
      for_each_busy_worker(a, c,
      - b,
      d) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      -(a, b)
      + sk_for_each_from(a) S
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c, d, e) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c, d) S
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c, d) S
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: default avatarPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  11. 01 Aug, 2012 1 commit
    • Olof Johansson's avatar
      block: uninitialized ioc->nr_tasks triggers WARN_ON · 4638a83e
      Olof Johansson authored
      I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
      below warning as of 3.5-rc1 and later (3.4 is fine):
      [   10.886893] ------------[ cut here ]------------
      [   10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
      [   10.886905] Hardware name: Bochs
      [   10.886906] Modules linked in:
      [   10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
      [   10.886908] Call Trace:
      [   10.886911]  [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0
      [   10.886912]  [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20
      [   10.886913]  [<ffffffff8107c088>] copy_process+0x1488/0x1560
      [   10.886914]  [<ffffffff8107c244>] do_fork+0xb4/0x340
      [   10.886918]  [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50
      [   10.886919]  [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80
      [   10.886920]  [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60
      [   10.886923]  [<ffffffff81051db3>] sys_clone+0x23/0x30
      [   10.886925]  [<ffffffff8179bd73>] stub_clone+0x13/0x20
      [   10.886927]  [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b
      [   10.886928] ---[ end trace 32a14af7ee6a590b ]---
      Reproducing is easy, I can hit it on a KVM system with a very basic
      config (x86_64 make defconfig + enable the drivers needed). To hit it,
      just install dump (on debian/ubuntu, not sure what the package might be
      called on Fedora), and:
      dump -o -f /tmp/foo /
      You'll see the warning in dmesg once it forks off the I/O process and
      starts dumping filesystem contents.
      I bisected it down to the following commit:
      commit f6e8d01b
      Author: Tejun Heo <tj@kernel.org>
      Date:   Mon Mar 5 13:15:26 2012 -0800
          block: add io_context->active_ref
          Currently ioc->nr_tasks is used to decide two things - whether an ioc
          is done issuing IOs and whether it's shared by multiple tasks.  This
          patch separate out the first into ioc->active_ref, which is acquired
          and released using {get|put}_io_context_active() respectively.
          This will be used to associate bio's with a given task.  This patch
          doesn't introduce any visible behavior change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
          Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      It seems like the init of ioc->nr_tasks was removed in that patch,
      so it starts out at 0 instead of 1.
      Tejun, is the right thing here to add back the init, or should something else
      be done?
      The below patch removes the warning, but I haven't done any more extensive
      testing on it.
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  12. 31 May, 2012 1 commit
    • Eric Dumazet's avatar
      block: avoid infinite loop in get_task_io_context() · 3c9c708c
      Eric Dumazet authored
      Calling get_task_io_context() on a exiting task which isn't %current can
      loop forever. This triggers at boot time on my dev machine.
      BUG: soft lockup - CPU#3 stuck for 22s ! [mountall.1603]
      Fix this by making create_task_io_context() returns -EBUSY in this case
      to break the loop.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alan Cox <alan@linux.intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  13. 20 Mar, 2012 1 commit
  14. 14 Mar, 2012 1 commit
    • Xiaotian Feng's avatar
      block: fix ioc leak in put_io_context · ff8c1474
      Xiaotian Feng authored
      When put_io_context is called, if ioc->icq_list is empty and refcount
      is 1, kernel will not free the ioc.
      This is caught by following kmemleak:
      unreferenced object 0xffff880036349fe0 (size 216):
        comm "sh", pid 2137, jiffies 4294931140 (age 290579.412s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          01 00 01 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
          [<ffffffff8169f926>] kmemleak_alloc+0x26/0x50
          [<ffffffff81195a9c>] kmem_cache_alloc_node+0x1cc/0x2a0
          [<ffffffff81356b67>] create_io_context_slowpath+0x27/0x130
          [<ffffffff81356d2b>] get_task_io_context+0xbb/0xf0
          [<ffffffff81055f0e>] copy_process+0x188e/0x18b0
          [<ffffffff8105609b>] do_fork+0x11b/0x420
          [<ffffffff810247f8>] sys_clone+0x28/0x30
          [<ffffffff816d3373>] stub_clone+0x13/0x20
          [<ffffffffffffffff>] 0xffffffffffffffff
      ioc should be freed if ioc->icq_list is empty.
      Signed-off-by: default avatarXiaotian Feng <dannyfeng@tencent.com>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  15. 06 Mar, 2012 2 commits
    • Tejun Heo's avatar
      block: add io_context->active_ref · f6e8d01b
      Tejun Heo authored
      Currently ioc->nr_tasks is used to decide two things - whether an ioc
      is done issuing IOs and whether it's shared by multiple tasks.  This
      patch separate out the first into ioc->active_ref, which is acquired
      and released using {get|put}_io_context_active() respectively.
      This will be used to associate bio's with a given task.  This patch
      doesn't introduce any visible behavior change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: interface update for ioc/icq creation functions · 24acfc34
      Tejun Heo authored
      Make the following interface updates to prepare for future ioc related
      * create_io_context() returning ioc only works for %current because it
        doesn't increment ref on the ioc.  Drop @task parameter from it and
        always assume %current.
      * Make create_io_context_slowpath() return 0 or -errno and rename it
        to create_task_io_context().
      * Make ioc_create_icq() take @ioc as parameter instead of assuming
        that of %current.  The caller, get_request(), is updated to create
        ioc explicitly and then pass it into ioc_create_icq().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  16. 15 Feb, 2012 3 commits
    • Tejun Heo's avatar
      block: exit_io_context() should call elevator_exit_icq_fn() · 621032ad
      Tejun Heo authored
      While updating locking, b2efa052 "block, cfq: unlink
      cfq_io_context's immediately" moved elevator_exit_icq_fn() invocation
      from exit_io_context() to the final ioc put.  While this doesn't cause
      catastrophic failure, it effectively removes task exit notification to
      elevator and cause noticeable IO performance degradation with CFQ.
      On task exit, CFQ used to immediately expire the slice if it was being
      used by the exiting task as no more IO would be issued by the task;
      however, after b2efa052, the notification is lost and disk could sit
      idle needlessly, leading to noticeable IO performance degradation for
      certain workloads.
      This patch renames ioc_exit_icq() to ioc_destroy_icq(), separates
      elevator_exit_icq_fn() invocation into ioc_exit_icq() and invokes it
      from exit_io_context().  ICQ_EXITED flag is added to avoid invoking
      the callback more than once for the same icq.
      Walking icq_list from ioc side and invoking elevator callback requires
      reverse double locking.  This may be better implemented using RCU;
      unfortunately, using RCU isn't trivial.  e.g. RCU protection would
      need to cover request_queue and queue_lock switch on cleanup makes
      grabbing queue_lock from RCU unsafe.  Reverse double locking should
      do, at least for now.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-bisected-by: default avatarShaohua Li <shli@kernel.org>
      LKML-Reference: <CANejiEVzs=pUhQSTvUppkDcc2TNZyfohBRLygW5zFmXyk5A-xQ@mail.gmail.com>
      Tested-by: default avatarShaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: simplify ioc_release_fn() · 2274b029
      Tejun Heo authored
      Reverse double lock dancing in ioc_release_fn() can be simplified by
      just using trylock on the queue_lock and back out from ioc lock on
      trylock failure.  Simplify it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Tested-by: default avatarShaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: replace icq->changed with icq->flags · d705ae6b
      Tejun Heo authored
      icq->changed was used for ICQ_*_CHANGED bits.  Rename it to flags and
      access it under ioc->lock instead of using atomic bitops.
      ioc_get_changed() is added so that the changed part can be fetched and
      cleared as before.
      icq->flags will be used to carry other flags.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Tested-by: default avatarShaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  17. 11 Feb, 2012 1 commit
    • Tejun Heo's avatar
      block: fix lockdep warning on io_context release put_io_context() · d8c66c5d
      Tejun Heo authored
      11a3122f "block: strip out locking optimization in put_io_context()"
      removed ioc_lock depth lockdep annoation along with locking
      optimization; however, while recursing from put_io_context() is no
      longer possible, ioc_release_fn() may still end up putting the last
      reference of another ioc through elevator, which wlil grab ioc->lock
      triggering spurious (as the ioc is always different one) A-A deadlock
      As this can only happen one time from ioc_release_fn(), using non-zero
      subclass from ioc_release_fn() is enough.  Use subclass 1.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  18. 07 Feb, 2012 1 commit
    • Tejun Heo's avatar
      block: strip out locking optimization in put_io_context() · 11a3122f
      Tejun Heo authored
      put_io_context() performed a complex trylock dancing to avoid
      deferring ioc release to workqueue.  It was also broken on UP because
      trylock was always assumed to succeed which resulted in unbalanced
      preemption count.
      While there are ways to fix the UP breakage, even the most
      pathological microbench (forced ioc allocation and tight fork/exit
      loop) fails to show any appreciable performance benefit of the
      optimization.  Strip it out.  If there turns out to be workloads which
      are affected by this change, simpler optimization from the discussion
      thread can be applied later.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  19. 06 Feb, 2012 1 commit
    • Shaohua Li's avatar
      block: fix ioc locking warning · 9fa73472
      Shaohua Li authored
      Meelis reported a warning:
      WARNING: at kernel/timer.c:1122 run_timer_softirq+0x199/0x1ec()
      Hardware name: 939Dual-SATA2
      timer: cfq_idle_slice_timer+0x0/0xaa preempt leak: 00000102 -> 00000103
      Modules linked in: sr_mod cdrom videodev media drm_kms_helper ohci_hcd ehci_hcd v4l2_compat_ioctl32 usbcore i2c_ali15x3 snd_seq drm snd_timer snd_seq
      Pid: 0, comm: swapper Not tainted 3.3.0-rc2-00110-gd1256667 #176
      Call Trace:
       <IRQ>  [<ffffffff81022aaa>] warn_slowpath_common+0x7e/0x96
       [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d
       [<ffffffff81022b56>] warn_slowpath_fmt+0x41/0x43
       [<ffffffff8114c526>] ? cfq_idle_slice_timer+0xa1/0xaa
       [<ffffffff8114c485>] ? cfq_slice_expired+0x1d/0x1d
       [<ffffffff8102c124>] run_timer_softirq+0x199/0x1ec
       [<ffffffff81047a53>] ? timekeeping_get_ns+0x12/0x31
       [<ffffffff810145fd>] ? apic_write+0x11/0x13
       [<ffffffff81027475>] __do_softirq+0x74/0xfa
       [<ffffffff812f337a>] call_softirq+0x1a/0x30
       [<ffffffff81002ff9>] do_softirq+0x31/0x68
       [<ffffffff810276cf>] irq_exit+0x3d/0xa3
       [<ffffffff81014aca>] smp_apic_timer_interrupt+0x6b/0x77
       [<ffffffff812f2de9>] apic_timer_interrupt+0x69/0x70
       <EOI>  [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d
       [<ffffffff81040136>] ? sched_clock_cpu+0x73/0x7d
       [<ffffffff8100801f>] ? default_idle+0x1e/0x32
       [<ffffffff81008019>] ? default_idle+0x18/0x32
       [<ffffffff810008b1>] cpu_idle+0x87/0xd1
       [<ffffffff812de861>] rest_init+0x85/0x89
       [<ffffffff81659a4d>] start_kernel+0x2eb/0x2f8
       [<ffffffff8165926e>] x86_64_start_reservations+0x7e/0x82
       [<ffffffff81659362>] x86_64_start_kernel+0xf0/0xf7
      this_q == locked_q is possible. There are two problems here:
      1. In UP case, there is preemption counter issue as spin_trylock always
      2. In SMP case, the loop breaks too earlier.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Reported-by: default avatarMeelis Roos <mroos@linux.ee>
      Reported-by: default avatarKnut Petersen <Knut_Petersen@t-online.de>
      Tested-by: default avatarKnut Petersen <Knut_Petersen@t-online.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  20. 27 Dec, 2011 1 commit
    • Tejun Heo's avatar
      block: remove WARN_ON_ONCE() in exit_io_context() · c98b2cc2
      Tejun Heo authored
      6e736be7 "block: make ioc get/put interface more conventional and fix
      race on alloction" added WARN_ON_ONCE() in exit_io_context() which
      triggers if !PF_EXITING.  All tasks hitting exit_io_context() from
      task exit should have PF_EXITING set but task struct tearing down
      after fork failure calls into the function without PF_EXITING,
      triggering the condition.
        WARNING: at block/blk-ioc.c:234 exit_io_context+0x40/0x92()
        Pid: 17090, comm: trinity Not tainted 3.2.0-rc6-next-20111222-sasha-dirty #77
        Call Trace:
         [<ffffffff810b69a3>] warn_slowpath_common+0x8f/0xb2
         [<ffffffff810b6a77>] warn_slowpath_null+0x18/0x1a
         [<ffffffff8181a7a2>] exit_io_context+0x40/0x92
         [<ffffffff810b58c9>] copy_process+0x126f/0x1453
         [<ffffffff810b5c1b>] do_fork+0x120/0x3e9
         [<ffffffff8106242f>] sys_clone+0x26/0x28
         [<ffffffff82425803>] stub_clone+0x13/0x20
        ---[ end trace a2e4eb670b375238 ]---
      Reported-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  21. 25 Dec, 2011 1 commit
    • Tejun Heo's avatar
      block: an exiting task should be allowed to create io_context · fd638368
      Tejun Heo authored
      While fixing io_context creation / task exit race condition,
      6e736be7 "block: make ioc get/put interface more conventional and
      fix race on alloction" also prevented an exiting (%PF_EXITING) task
      from creating its own io_context.  This is incorrect as exit path may
      issue IOs, e.g. from exit_files(), and if those IOs are the first ones
      issued by the task, io_context needs to be created to process the IOs.
      Combined with the existing problem of io_context / io_cq creation
      failure having the possibility of stalling IO, this problem results in
      deterministic full IO lockup with certain workloads.
      Fix it by allowing io_context creation regardless of %PF_EXITING for
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  22. 19 Dec, 2011 1 commit
  23. 13 Dec, 2011 9 commits
    • Tejun Heo's avatar
      block, cfq: move icq creation and rq->elv.icq association to block core · f1f8cc94
      Tejun Heo authored
      Now block layer knows everything necessary to create and associate
      icq's with requests.  Move ioc_create_icq() to blk-ioc.c and update
      get_request() such that, if elevator_type->icq_size is set, requests
      are automatically associated with their matching icq's before
      elv_set_request().  io_context reference is also managed by block core
      on request alloc/free.
      * Only ioprio/cgroup changed handling remains from cfq_get_cic().
        Collapsed into cfq_set_request().
      * This removes queue kicking on icq allocation failure (for now).  As
        icq allocation failure is rare and the only effect of queue kicking
        achieved was possibily accelerating queue processing, this change
        shouldn't be noticeable.
        There is a larger underlying problem.  Unlike request allocation,
        icq allocation is not guaranteed to succeed eventually after
        retries.  The number of icq is unbound and thus mempool can't be the
        solution either.  This effectively adds allocation dependency on
        memory free path and thus possibility of deadlock.
        This usually wouldn't happen because icq allocation is not a hot
        path and, even when the condition triggers, it's highly unlikely
        that none of the writeback workers already has icq.
        However, this is still possible especially if elevator is being
        switched under high memory pressure, so we better get it fixed.
        Probably the only solution is just bypassing elevator and appending
        to dispatch queue on any elevator allocation failure.
      * Comment added to explain how icq's are managed and synchronized.
      This completes cleanup of io_context interface.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block, cfq: move io_cq exit/release to blk-ioc.c · 7e5a8794
      Tejun Heo authored
      With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
      blk-ioc too.  The odd ->io_cq->exit/release() callbacks are replaced
      with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
      and q, and freeing automatically handled by blk-ioc.  The elevator
      operation only need to perform exit operation specific to the elevator
      - in cfq's case, exiting the cfqq's.
      Also, clearing of io_cq's on q detach is moved to block core and
      automatically performed on elevator switch and q release.
      Because the q io_cq points to might be freed before RCU callback for
      the io_cq runs, blk-ioc code should remember to which cache the io_cq
      needs to be freed when the io_cq is released.  New field
      io_cq->__rcu_icq_cache is added for this purpose.  As both the new
      field and rcu_head are used only after io_cq is released and the
      q/ioc_node fields aren't, they are put into unions.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block, cfq: move io_cq lookup to blk-ioc.c · 47fdd4ca
      Tejun Heo authored
      Now that all io_cq related data structures are in block core layer,
      io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c.
      Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with
      parameter return type changes (cfqd -> request_queue, cfq_io_cq ->
      io_cq) and cfq_cic_lookup() becomes thin wrapper around
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block, cfq: reorganize cfq_io_context into generic and cfq specific parts · c5869807
      Tejun Heo authored
      Currently io_context and cfq logics are mixed without clear boundary.
      Most of io_context is independent from cfq but cfq_io_context handling
      logic is dispersed between generic ioc code and cfq.
      cfq_io_context represents association between an io_context and a
      request_queue, which is a concept useful outside of cfq, but it also
      contains fields which are useful only to cfq.
      This patch takes out generic part and put it into io_cq (io
      context-queue) and the rest into cfq_io_cq (cic moniker remains the
      same) which contains io_cq.  The following changes are made together.
      * cfq_ttime and cfq_io_cq now live in cfq-iosched.c.
      * All related fields, functions and constants are renamed accordingly.
      * ioc->ioc_data is now "struct io_cq *" instead of "void *" and
        renamed to icq_hint.
      This prepares for io_context API cleanup.  Documentation is currently
      sparse.  It will be added later.
      Changes in this patch are mechanical and don't cause functional
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block, cfq: replace current_io_context() with create_io_context() · f2dbd76a
      Tejun Heo authored
      When called under queue_lock, current_io_context() triggers lockdep
      warning if it hits allocation path.  This is because io_context
      installation is protected by task_lock which is not IRQ safe, so it
      triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
      deadlock warning.
      Given the restriction, accessor + creator rolled into one doesn't work
      too well.  Drop current_io_context() and let the users access
      task->io_context directly inside queue_lock combined with explicit
      creation using create_io_context().
      Future ioc updates will further consolidate ioc access and the create
      interface will be unexported.
      While at it, relocate ioc internal interface declarations in blk.h and
      add section comments before and after.
      This patch does not introduce functional change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block, cfq: unlink cfq_io_context's immediately · b2efa052
      Tejun Heo authored
      cic is association between io_context and request_queue.  A cic is
      linked from both ioc and q and should be destroyed when either one
      goes away.  As ioc and q both have their own locks, locking becomes a
      bit complex - both orders work for removal from one but not from the
      Currently, cfq tries to circumvent this locking order issue with RCU.
      ioc->lock nests inside queue_lock but the radix tree and cic's are
      also protected by RCU allowing either side to walk their lists without
      grabbing lock.
      This rather unconventional use of RCU quickly devolves into extremely
      fragile convolution.  e.g. The following is from cfqd going away too
      soon after ioc and q exits raced.
       general protection fault: 0000 [#1] PREEMPT SMP
       CPU 2
       Modules linked in:
       [   88.503444]
       Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs
       RIP: 0010:[<ffffffff81397628>]  [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0
       Call Trace:
        [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90
        [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20
        [<ffffffff81389130>] exit_io_context+0x100/0x140
        [<ffffffff81098a29>] do_exit+0x579/0x850
        [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0
        [<ffffffff81098de7>] sys_exit_group+0x17/0x20
        [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b
      The only real hot path here is cic lookup during request
      initialization and avoiding extra locking requires very confined use
      of RCU.  This patch makes cic removal from both ioc and request_queue
      perform double-locking and unlink immediately.
      * From q side, the change is almost trivial as ioc->lock nests inside
        queue_lock.  It just needs to grab each ioc->lock as it walks
        cic_list and unlink it.
      * From ioc side, it's a bit more difficult because of inversed lock
        order.  ioc needs its lock to walk its cic_list but can't grab the
        matching queue_lock and needs to perform unlock-relock dancing.
        Unlinking is now wholly done from put_io_context() and fast path is
        optimized by using the queue_lock the caller already holds, which is
        by far the most common case.  If the ioc accessed multiple devices,
        it tries with trylock.  In unlikely cases of fast path failure, it
        falls back to full double-locking dance from workqueue.
      Double-locking isn't the prettiest thing in the world but it's *far*
      simpler and more understandable than RCU trick without adding any
      meaningful overhead.
      This still leaves a lot of now unnecessary RCU logics.  Future patches
      will trim them.
      -v2: Vivek pointed out that cic->q was being dereferenced after
           cic->release() was called.  Updated to use local variable @this_q
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block, cfq: move ioc ioprio/cgroup changed handling to cic · dc86900e
      Tejun Heo authored
      ioprio/cgroup change was handled by marking the changed state in ioc
      and, on the following access to the ioc, performing RCU-protected
      iteration through all cic's grabbing the matching queue_lock.
      This patch moves the changed state to each cic.  When ioprio or cgroup
      changes, the respective bit is set on all cic's of the ioc and when
      each of those cic (not ioc) is accessed, change is applied for that
      specific ioc-queue pair.
      This also fixes the following two race conditions between setting and
      clearing of changed states.
      * Missing barrier between assign/load of ioprio and ioprio_changed
        allowed applying old ioprio.
      * Change requests could happen between application of change and
        clearing of changed variables.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: make ioc get/put interface more conventional and fix race on alloction · 6e736be7
      Tejun Heo authored
      Ignoring copy_io() during fork, io_context can be allocated from two
      places - current_io_context() and set_task_ioprio().  The former is
      always called from local task while the latter can be called from
      different task.  The synchornization between them are peculiar and
      * current_io_context() doesn't grab task_lock() and assumes that if it
        saw %NULL ->io_context, it would stay that way until allocation and
        assignment is complete.  It has smp_wmb() between alloc/init and
      * set_task_ioprio() grabs task_lock() for assignment and does
        smp_read_barrier_depends() between "ioc = task->io_context" and "if
        (ioc)".  Unfortunately, this doesn't achieve anything - the latter
        is not a dependent load of the former.  ie, if ioc itself were being
        dereferenced "ioc->xxx", it would mean something (not sure what tho)
        but as the code currently stands, the dependent read barrier is
      As only one of the the two test-assignment sequences is task_lock()
      protected, the task_lock() can't do much about race between the two.
      Nothing prevents current_io_context() and set_task_ioprio() allocating
      its own ioc for the same task and overwriting the other's.
      Also, set_task_ioprio() can race with exiting task and create a new
      ioc after exit_io_context() is finished.
      ioc get/put doesn't have any reason to be complex.  The only hot path
      is accessing the existing ioc of %current, which is simple to achieve
      given that ->io_context is never destroyed as long as the task is
      alive.  All other paths can happily go through task_lock() like all
      other task sub structures without impacting anything.
      This patch updates ioc get/put so that it becomes more conventional.
      * alloc_io_context() is replaced with get_task_io_context().  This is
        the only interface which can acquire access to ioc of another task.
        On return, the caller has an explicit reference to the object which
        should be put using put_io_context() afterwards.
      * The functionality of current_io_context() remains the same but when
        creating a new ioc, it shares the code path with
        get_task_io_context() and always goes through task_lock().
      * get_io_context() now means incrementing ref on an ioc which the
        caller already has access to (be that an explicit refcnt or implicit
        %current one).
      * PF_EXITING inhibits creation of new io_context and once
        exit_io_context() is finished, it's guaranteed that both ioc
        acquisition functions return %NULL.
      * All users are updated.  Most are trivial but
        smp_read_barrier_depends() removal from cfq_get_io_context() needs a
        bit of explanation.  I suppose the original intention was to ensure
        ioc->ioprio is visible when set_task_ioprio() allocates new
        io_context and installs it; however, this wouldn't have worked
        because set_task_ioprio() doesn't have wmb between init and install.
        There are other problems with this which will be fixed in another
      * While at it, use NUMA_NO_NODE instead of -1 for wildcard node
      -v2: Vivek spotted contamination from debug patch.  Removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Tejun Heo's avatar
      block: misc ioc cleanups · 42ec57a8
      Tejun Heo authored
      * int return from put_io_context() wasn't used by anybody.  Make it
        return void like other put functions and docbook-fy the function
      * Reorder dummy declarations for !CONFIG_BLOCK case a bit.
      * Make alloc_ioc_context() use __GFP_ZERO allocation, take init out of
        if block and drop 0'ing.
      * Docbook-fy current_io_context() comment.
      This patch doesn't introduce any functional change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  24. 06 Jun, 2011 1 commit
  25. 02 Jun, 2011 1 commit
  26. 23 May, 2011 1 commit
  27. 21 Dec, 2010 1 commit