1. 17 Sep, 2015 1 commit
    • Ming Lei's avatar
      block: fix bounce_end_io · 99451879
      Ming Lei authored
      When bio bounce is involved, one new bio and its biovecs are
      cloned from the comming bio, which can be one fast-cloned bio
      from upper layer(such as dm).
      So it is obviously wrong to assume the start index of the coming(
      original) bio's io vector is zero, which can be any value between
      0 and (bi_max_vecs - 1), especially in case of bio split.
      This patch fixes Fedora's booting oops on i386, often with the
      following kernel log together:
      > [    9.026738] systemd[1]: Switching root.
      > [    9.036467] systemd-journald[149]: Received SIGTERM from PID 1
      > (systemd).
      > [    9.082262] BUG: Bad page state in process kworker/u5:1  pfn:372ac
      > [    9.083989] page:f3d32ae0 count:0 mapcount:0 mapping:f2252178
      > index:0x16a
      > [    9.085755] flags: 0x40020021(locked|lru|mappedtodisk)
      > [    9.087284] page dumped because: page still charged to cgroup
      > [    9.088772] bad because of flags:
      > [    9.089731] flags: 0x21(locked|lru)
      > [    9.090818] page->mem_cgroup:f2c3e400
      Reported-by: default avatarJosh Boyer <jwboyer@fedoraproject.org>
      Tested-by: default avatarAdam Williamson <awilliam@redhat.com>
      Cc: Ming Lin <mlin@kernel.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  2. 29 Jul, 2015 2 commits
    • Jens Axboe's avatar
      block: manipulate bio->bi_flags through helpers · b7c44ed9
      Jens Axboe authored
      Some places use helpers now, others don't. We only have the 'is set'
      helper, add helpers for setting and clearing flags too.
      It was a bit of a mess of atomic vs non-atomic access. With
      BIO_UPTODATE gone, we don't have any risk of concurrent access to the
      flags. So relax the restriction and don't make any of them atomic. The
      flags that do have serialization issues (reffed and chained), we
      already handle those separately.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    • Christoph Hellwig's avatar
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig authored
      Currently we have two different ways to signal an I/O error on a BIO:
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  3. 23 Jul, 2015 1 commit
  4. 02 Jun, 2015 1 commit
    • Tejun Heo's avatar
      writeback: separate out include/linux/backing-dev-defs.h · 66114cad
      Tejun Heo authored
      With the planned cgroup writeback support, backing-dev related
      declarations will be more widely used across block and cgroup;
      unfortunately, including backing-dev.h from include/linux/blkdev.h
      makes cyclic include dependency quite likely.
      This patch separates out backing-dev-defs.h which only has the
      essential definitions and updates blkdev.h to include it.  c files
      which need access to more backing-dev details now include
      backing-dev.h directly.  This takes backing-dev.h off the common
      include dependency chain making it a lot easier to use it across block
      and cgroup.
      v2: fs/fat build failure fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  5. 19 May, 2015 1 commit
  6. 27 Apr, 2015 1 commit
    • Wang YanQing's avatar
      block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE · 393a3397
      Wang YanQing authored
      Commit d2c5e30c
      ("[PATCH] zoned vm counters: conversion of nr_bounce to per zone counter")
      convert statistic of nr_bounce to per zone and one global value in vm_stat,
      but it call inc_|dec_zone_page_state on different pages, then different
      zones, and cause us to get unexpected value of NR_BOUNCE.
      Below is the result on my machine:
      Mar  2 09:26:08 udknight kernel: [144766.778265] Mem-Info:
      Mar  2 09:26:08 udknight kernel: [144766.778266] DMA per-cpu:
      Mar  2 09:26:08 udknight kernel: [144766.778268] CPU    0: hi:    0, btch:   1 usd:   0
      Mar  2 09:26:08 udknight kernel: [144766.778269] CPU    1: hi:    0, btch:   1 usd:   0
      Mar  2 09:26:08 udknight kernel: [144766.778270] Normal per-cpu:
      Mar  2 09:26:08 udknight kernel: [144766.778271] CPU    0: hi:  186, btch:  31 usd:   0
      Mar  2 09:26:08 udknight kernel: [144766.778273] CPU    1: hi:  186, btch:  31 usd:   0
      Mar  2 09:26:08 udknight kernel: [144766.778274] HighMem per-cpu:
      Mar  2 09:26:08 udknight kernel: [144766.778275] CPU    0: hi:  186, btch:  31 usd:   0
      Mar  2 09:26:08 udknight kernel: [144766.778276] CPU    1: hi:  186, btch:  31 usd:   0
      Mar  2 09:26:08 udknight kernel: [144766.778279] active_anon:46926 inactive_anon:287406 isolated_anon:0
      Mar  2 09:26:08 udknight kernel: [144766.778279]  active_file:105085 inactive_file:139432 isolated_file:0
      Mar  2 09:26:08 udknight kernel: [144766.778279]  unevictable:653 dirty:0 writeback:0 unstable:0
      Mar  2 09:26:08 udknight kernel: [144766.778279]  free:178957 slab_reclaimable:6419 slab_unreclaimable:9966
      Mar  2 09:26:08 udknight kernel: [144766.778279]  mapped:4426 shmem:305277 pagetables:784 bounce:0
      Mar  2 09:26:08 udknight kernel: [144766.778279]  free_cma:0
      Mar  2 09:26:08 udknight kernel: [144766.778286] DMA free:3324kB min:68kB low:84kB high:100kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15976kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      Mar  2 09:26:08 udknight kernel: [144766.778287] lowmem_reserve[]: 0 822 3754 3754
      Mar  2 09:26:08 udknight kernel: [144766.778293] Normal free:26828kB min:3632kB low:4540kB high:5448kB active_anon:4872kB inactive_anon:68kB active_file:1796kB inactive_file:1796kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:892920kB managed:842560kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:4144kB slab_reclaimable:25676kB slab_unreclaimable:39864kB kernel_stack:1944kB pagetables:3136kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2412612 all_unreclaimable? yes
      Mar  2 09:26:08 udknight kernel: [144766.778294] lowmem_reserve[]: 0 0 23451 23451
      Mar  2 09:26:08 udknight kernel: [144766.778299] HighMem free:685676kB min:512kB low:3748kB high:6984kB active_anon:182832kB inactive_anon:1149556kB active_file:418544kB inactive_file:555932kB unevictable:2612kB isolated(anon):0kB isolated(file):0kB present:3001732kB managed:3001732kB mlocked:0kB dirty:0kB writeback:0kB mapped:17704kB shmem:1216964kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:75771152kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      Mar  2 09:26:08 udknight kernel: [144766.778300] lowmem_reserve[]: 0 0 0 0
      You can see bounce:75771152kB for HighMem, but bounce:0 for lowmem and global.
      This patch fix it.
      Signed-off-by: default avatarWang YanQing <udknight@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
  7. 06 Jun, 2014 1 commit
  8. 20 May, 2014 1 commit
  9. 24 Nov, 2013 1 commit
    • Kent Overstreet's avatar
      block: Convert bio_for_each_segment() to bvec_iter · 7988613b
      Kent Overstreet authored
      More prep work for immutable biovecs - with immutable bvecs drivers
      won't be able to use the biovec directly, they'll need to use helpers
      that take into account bio->bi_iter.bi_bvec_done.
      This updates callers for the new usage without changing the
      implementation yet.
      Signed-off-by: default avatarKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Paul Clements <Paul.Clements@steeleye.com>
      Cc: Jim Paris <jim@jtan.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
      Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
      Cc: support@lsi.com
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: drbd-user@lists.linbit.com
      Cc: nbd-general@lists.sourceforge.net
      Cc: cbe-oss-dev@lists.ozlabs.org
      Cc: xen-devel@lists.xensource.com
      Cc: virtualization@lists.linux-foundation.org
      Cc: linux-raid@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Cc: DL-MPTFusionLinux@lsi.com
      Cc: linux-scsi@vger.kernel.org
      Cc: devel@driverdev.osuosl.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: cluster-devel@redhat.com
      Cc: linux-mm@kvack.org
      Acked-by: default avatarGeoff Levand <geoff@infradead.org>
  10. 30 Sep, 2013 1 commit
  11. 29 Apr, 2013 1 commit
    • Darrick J. Wong's avatar
      mm: make snapshotting pages for stable writes a per-bio operation · 71368511
      Darrick J. Wong authored
      Walking a bio's page mappings has proved problematic, so create a new
      bio flag to indicate that a bio's data needs to be snapshotted in order
      to guarantee stable pages during writeback.  Next, for the one user
      (ext3/jbd) of snapshotting, hook all the places where writes can be
      initiated without PG_writeback set, and set BIO_SNAP_STABLE there.
      We must also flag journal "metadata" bios for stable writeout, since
      file data can be written through the journal.  Finally, the
      MS_SNAP_STABLE mount flag (only used by ext3) is now superfluous, so get
      rid of it.
      [akpm@linux-foundation.org: rename _submit_bh()'s `flags' to `bio_flags', delobotomize the _submit_bh declaration]
      [akpm@linux-foundation.org: teeny cleanup]
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  12. 23 Mar, 2013 3 commits
    • Kent Overstreet's avatar
      block: Convert some code to bio_for_each_segment_all() · cb34e057
      Kent Overstreet authored
      More prep work for immutable bvecs:
      A few places in the code were either open coding or using the wrong
      version - fix.
      After we introduce the bvec iter, it'll no longer be possible to modify
      the biovec through bio_for_each_segment_all() - it doesn't increment a
      pointer to the current bvec, you pass in a struct bio_vec (not a
      pointer) which is updated with what the current biovec would be (taking
      into account bi_bvec_done and bi_size).
      So because of that it's more worthwhile to be consistent about
      bio_for_each_segment()/bio_for_each_segment_all() usage.
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: NeilBrown <neilb@suse.de>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Alexander Viro <viro@zeniv.linux.org.uk>
    • Kent Overstreet's avatar
      block: Add bio_for_each_segment_all() · d74c6d51
      Kent Overstreet authored
      __bio_for_each_segment() iterates bvecs from the specified index
      instead of bio->bv_idx.  Currently, the only usage is to walk all the
      bvecs after the bio has been advanced by specifying 0 index.
      For immutable bvecs, we need to split these apart;
      bio_for_each_segment() is going to have a different implementation.
      This will also help document the intent of code that's using it -
      bio_for_each_segment_all() is only legal to use for code that owns the
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Neil Brown <neilb@suse.de>
      CC: Boaz Harrosh <bharrosh@panasas.com>
    • Kent Overstreet's avatar
      bounce: Refactor __blk_queue_bounce to not use bi_io_vec · 6bc454d1
      Kent Overstreet authored
      A bunch of what __blk_queue_bounce() was doing was problematic for the
      immutable bvec work; this cleans that up and the code is quite a bit
      smaller, too.
      The __bio_for_each_segment() in copy_to_high_bio_irq() was changed
      because that one's looping over the original bio, not the bounce bio -
      a later patch renames __bio_for_each_segment() ->
      bio_for_each_segment_all(), and documents that
      bio_for_each_segment_all() is only for code that owns the bio.
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
  13. 22 Feb, 2013 1 commit
    • Darrick J. Wong's avatar
      block: optionally snapshot page contents to provide stable pages during write · ffecfd1a
      Darrick J. Wong authored
      This provides a band-aid to provide stable page writes on jbd without
      needing to backport the fixed locking and page writeback bit handling
      schemes of jbd2.  The band-aid works by using bounce buffers to snapshot
      page contents instead of waiting.
      For those wondering about the ext3 bandage -- fixing the jbd locking
      (which was done as part of ext4dev years ago) is a lot of surgery, and
      setting PG_writeback on data pages when we actually hold the page lock
      dropped ext3 performance by nearly an order of magnitude.  If we're
      going to migrate iscsi and raid to use stable page writes, the
      complaints about high latency will likely return.  We might as well
      centralize their page snapshotting thing to one place.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  14. 18 Jul, 2012 1 commit
  15. 20 Mar, 2012 1 commit
  16. 31 Oct, 2011 1 commit
  17. 20 Oct, 2011 1 commit
    • David Vrabel's avatar
      block: initialize the bounce pool if high memory may be added later · 3bcfeaf9
      David Vrabel authored
      init_emergency_pool() does not create the page pool for bouncing block
      requests if the current count of high pages is zero.  If high memory
      may be added later (either via memory hotplug or a balloon driver in a
      virtualized system) then a oops occurs if a request with a high page
      need bouncing because the pool does not exist.
      So, always create the pool if memory hotplug is enabled and change the
      test so it's valid even if all high pages are currently in the balloon
      (the balloon drivers adjust totalhigh_pages but not max_pfn).
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  18. 10 Sep, 2010 1 commit
  19. 30 Mar, 2010 1 commit
    • Tejun Heo's avatar
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo authored
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      The script does the followings.
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
      The conversion was done in the following steps.
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
      6. percpu.h was updated not to include slab.h.
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Guess-its-ok-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
  20. 16 Jun, 2009 1 commit
  21. 09 Jun, 2009 1 commit
    • Li Zefan's avatar
      tracing/events: convert block trace points to TRACE_EVENT() · 55782138
      Li Zefan authored
      TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
      these new capabilities to this tracepoint:
        - zero-copy and per-cpu splice() tracing
        - binary tracing without printf overhead
        - structured logging records exposed under /debug/tracing/events
        - trace events embedded in function tracer output and other plugins
        - user-defined, per tracepoint filter expressions
        - no dev_t info for the output of plug, unplug_timer and unplug_io events.
          no dev_t info for getrq and sleeprq events if bio == NULL.
          no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.
          This is mainly because we can't get the deivce from a request queue.
          But this may change in the future.
        - A packet command is converted to a string in TP_assign, not TP_print.
          While blktrace do the convertion just before output.
          Since pc requests should be rather rare, this is not a big issue.
        - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
          has a unique format, which means we have some unused data in a trace entry.
          The overhead is minimized by using __dynamic_array() instead of __array().
      I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:
            dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
      1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
      2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
      3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s
      So the overhead of tracing is very small, and no regression when using
      those trace events vs blktrace.
      And the binary output of TRACE_EVENT is much smaller than blktrace:
       # ls -l -h
       -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
       -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
       -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out
      Following are some comparisons between TRACE_EVENT and blktrace:
        kjournald-480   [000]   303.084981: block_plug: [kjournald]
        kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]
        kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
        kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1
        kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
        kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384
        kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
        kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]
        kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]
        bash-2066  [001]  1072.953770:   8,0    G   N [bash]
        bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]
        konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
        konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]
        ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
        ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]
        kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
        kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]
      Changelog from v2 -> v3:
      - use the newly introduced __dynamic_array().
      Changelog from v1 -> v2:
      - use __string() instead of __array() to minimize the memory required
        to store hex dump of rq->cmd().
      - support large pc requests.
      - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.
      - some cleanups.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
  22. 22 May, 2009 1 commit
  23. 29 Dec, 2008 1 commit
  24. 26 Nov, 2008 2 commits
  25. 09 Oct, 2008 1 commit
  26. 16 Oct, 2007 1 commit
  27. 10 Oct, 2007 1 commit
  28. 24 Jul, 2007 1 commit
  29. 27 Mar, 2007 1 commit
    • Vasily Tarasov's avatar
      block: blk_max_pfn is somtimes wrong · f772b3d9
      Vasily Tarasov authored
      There is a small problem in handling page bounce.
      At the moment blk_max_pfn equals max_pfn, which is in fact not maximum
      possible _number_ of a page frame, but the _amount_ of page frames.  For
      example for the 32bit x86 node with 4Gb RAM, max_pfn = 0x100000, but not
      request_queue structure has a member q->bounce_pfn and queue needs bounce
      pages for the pages _above_ this limit.  This routine is handled by
      blk_queue_bounce(), where the following check is produced:
      	if (q->bounce_pfn >= blk_max_pfn)
      Assume, that a driver has set q->bounce_pfn to 0xFFFF, but blk_max_pfn
      equals 0x10000.  In such situation the check above fails and for each bio
      we always fall down for iterating over pages tied to the bio.
      I want to notice, that for quite a big range of device drivers (ide, md,
      ...) such problem doesn't happen because they use BLK_BOUNCE_ANY for
      bounce_pfn.  BLK_BOUNCE_ANY is defined as blk_max_pfn << PAGE_SHIFT, and
      then the check above doesn't fail.  But for other drivers, which obtain
      reuired value from drivers, it fails.  For example sata_nv uses
      ATA_DMA_MASK or dev->dma_mask.
      I propose to use (max_pfn - 1) for blk_max_pfn.  And the same for
      blk_max_low_pfn.  The patch also cleanses some checks related with
      Signed-off-by: default avatarVasily Tarasov <vtaras@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
  30. 12 Jan, 2007 1 commit
  31. 30 Sep, 2006 1 commit