1. 29 Feb, 2012 1 commit
  2. 14 Jan, 2012 1 commit
    • Gleb Natapov's avatar
      Unused iocbs in a batch should not be accounted as active. · 69e4747e
      Gleb Natapov authored
      Since commit 080d676d ("aio: allocate kiocbs in batches") iocbs are
      allocated in a batch during processing of first iocbs.  All iocbs in a
      batch are automatically added to ctx->active_reqs list and accounted in
      If one (not the last one) of iocbs submitted by an user fails, further
      iocbs are not processed, but they are still present in ctx->active_reqs
      and accounted in ctx->reqs_active.  This causes process to stuck in a D
      state in wait_for_all_aios() on exit since ctx->reqs_active will never
      go down to zero.  Furthermore since kiocb_batch_free() frees iocb
      without removing it from active_reqs list the list become corrupted
      which may cause oops.
      Fix this by removing iocb from ctx->active_reqs and updating
      ctx->reqs_active in kiocb_batch_free().
      Signed-off-by: default avatarGleb Natapov <gleb@redhat.com>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: stable@kernel.org   # 3.2
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  3. 02 Nov, 2011 1 commit
    • Jeff Moyer's avatar
      aio: allocate kiocbs in batches · 080d676d
      Jeff Moyer authored
      In testing aio on a fast storage device, I found that the context lock
      takes up a fair amount of cpu time in the I/O submission path.  The reason
      is that we take it for every I/O submitted (see __aio_get_req).  Since we
      know how many I/Os are passed to io_submit, we can preallocate the kiocbs
      in batches, reducing the number of times we take and release the lock.
      In my testing, I was able to reduce the amount of time spent in
      _raw_spin_lock_irq by .56% (average of 3 runs).  The command I used to
      test this was:
         aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 <dev>
      I also tested the patch with various numbers of events passed to
      io_submit, and I ran the xfstests aio group of tests to ensure I didn't
      break anything.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Daniel Ehrenberg <dehrenberg@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  4. 01 Nov, 2011 1 commit
    • Christopher Yeoh's avatar
      Cross Memory Attach · fcf63409
      Christopher Yeoh authored
      The basic idea behind cross memory attach is to allow MPI programs doing
      intra-node communication to do a single copy of the message rather than a
      double copy of the message via shared memory.
      The following patch attempts to achieve this by allowing a destination
      process, given an address and size from a source process, to copy memory
      directly from the source process into its own address space via a system
      call.  There is also a symmetrical ability to copy from the current
      process's address space into a destination process's address space.
      - Use of /proc/pid/mem has been considered, but there are issues with
        using it:
        - Does not allow for specifying iovecs for both src and dest, assuming
          preadv or pwritev was implemented either the area read from or
        written to would need to be contiguous.
        - Currently mem_read allows only processes who are currently
        ptrace'ing the target and are still able to ptrace the target to read
        from the target. This check could possibly be moved to the open call,
        but its not clear exactly what race this restriction is stopping
        (reason  appears to have been lost)
        - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
        domain socket is a bit ugly from a userspace point of view,
        especially when you may have hundreds if not (eventually) thousands
        of processes  that all need to do this with each other
        - Doesn't allow for some future use of the interface we would like to
        consider adding in the future (see below)
        - Interestingly reading from /proc/pid/mem currently actually
        involves two copies! (But this could be fixed pretty easily)
      As mentioned previously use of vmsplice instead was considered, but has
      problems.  Since you need the reader and writer working co-operatively if
      the pipe is not drained then you block.  Which requires some wrapping to
      do non blocking on the send side or polling on the receive.  In all to all
      communication it requires ordering otherwise you can deadlock.  And in the
      example of many MPI tasks writing to one MPI task vmsplice serialises the
      There are some cases of MPI collectives where even a single copy interface
      does not get us the performance gain we could.  For example in an
      MPI_Reduce rather than copy the data from the source we would like to
      instead use it directly in a mathops (say the reduce is doing a sum) as
      this would save us doing a copy.  We don't need to keep a copy of the data
      from the source.  I haven't implemented this, but I think this interface
      could in the future do all this through the use of the flags - eg could
      specify the math operation and type and the kernel rather than just
      copying the data would apply the specified operation between the source
      and destination and store it in the destination.
      Although we don't have a "second user" of the interface (though I've had
      some nibbles from people who may be interested in using it for intra
      process messaging which is not MPI).  This interface is something which
      hardware vendors are already doing for their custom drivers to implement
      fast local communication.  And so in addition to this being useful for
      OpenMPI it would mean the driver maintainers don't have to fix things up
      when the mm changes.
      There was some discussion about how much faster a true zero copy would
      go. Here's a link back to the email with some testing I did on that:
      There is a basic man page for the proposed interface here:
      This has been implemented for x86 and powerpc, other architecture should
      mainly (I think) just need to add syscall numbers for the process_vm_readv
      and process_vm_writev. There are 32 bit compatibility versions for
      64-bit kernels.
      For arch maintainers there are some simple tests to be able to quickly
      verify that the syscalls are working correctly here:
      http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: default avatarChris Yeoh <yeohc@au1.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <linux-man@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  5. 23 Mar, 2011 1 commit
    • Roland Dreier's avatar
      aio: wake all waiters when destroying ctx · e91f90bb
      Roland Dreier authored
      The test program below will hang because io_getevents() uses
      add_wait_queue_exclusive(), which means the wake_up() in io_destroy() only
      wakes up one of the threads.  Fix this by using wake_up_all() in the aio
      code paths where we want to make sure no one gets stuck.
      	// t.c -- compile with gcc -lpthread -laio t.c
      	#include <libaio.h>
      	#include <pthread.h>
      	#include <stdio.h>
      	#include <unistd.h>
      	static const int nthr = 2;
      	void *getev(void *ctx)
      		struct io_event ev;
      		io_getevents(ctx, 1, 1, &ev, NULL);
      		printf("io_getevents returned\n");
      		return NULL;
      	int main(int argc, char *argv[])
      		io_context_t ctx = 0;
      		pthread_t thread[nthr];
      		int i;
      		io_setup(1024, &ctx);
      		for (i = 0; i < nthr; ++i)
      			pthread_create(&thread[i], NULL, getev, ctx);
      		for (i = 0; i < nthr; ++i)
      			pthread_join(thread[i], NULL);
      		return 0;
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  6. 10 Mar, 2011 3 commits
  7. 25 Feb, 2011 2 commits
  8. 26 Jan, 2011 1 commit
    • Tejun Heo's avatar
      fs/aio: aio_wq isn't used in memory reclaim path · d37adaa1
      Tejun Heo authored
      aio_wq isn't used during memory reclaim.  Convert to alloc_workqueue()
      without WQ_MEM_RECLAIM.  It's possible to use system_wq but given that
      the number of work items is determined from userland and the work item
      may block, enforcing strict concurrency limit would be a good idea.
      Also, move fput_work to system_wq so that aio_wq is used soley to
      throttle the max concurrency of aio work items and fput_work doesn't
      interact with other work items.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: linux-aio@kvack.org
  9. 17 Jan, 2011 1 commit
  10. 13 Jan, 2011 2 commits
  11. 26 Oct, 2010 2 commits
    • Al Viro's avatar
      new helper: ihold() · 7de9c6ee
      Al Viro authored
      Clones an existing reference to inode; caller must already hold one.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Chris Mason's avatar
      aio: bump i_count instead of using igrab · 306fb097
      Chris Mason authored
      The aio batching code is using igrab to get an extra reference on the
      inode so it can safely batch.  igrab will go ahead and take the global
      inode spinlock, which can be a bottleneck on large machines doing lots
      of AIO.
      In this case, igrab isn't required because we already have a reference
      on the file handle.  It is safe to just bump the i_count directly
      on the inode.
      Benchmarking shows this patch brings IOP/s on tons of flash up by about
      Signed-off-by: default avatarChris Mason <chris.mason@oracle.com>
  12. 23 Sep, 2010 1 commit
    • Jan Kara's avatar
      aio: do not return ERESTARTSYS as a result of AIO · a0c42bac
      Jan Kara authored
      OCFS2 can return ERESTARTSYS from its write function when the process is
      signalled while waiting for a cluster lock (and the filesystem is mounted
      with intr mount option).  Generally, it seems reasonable to allow
      filesystems to return this error code from its IO functions.  As we must
      not leak ERESTARTSYS (and similar error codes) to userspace as a result of
      an AIO operation, we have to properly convert it to EINTR inside AIO code
      (restarting the syscall isn't really an option because other AIO could
      have been already submitted by the same io_submit syscall).
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  13. 15 Sep, 2010 1 commit
    • Jeff Moyer's avatar
      aio: check for multiplication overflow in do_io_submit · 75e1c70f
      Jeff Moyer authored
      Tavis Ormandy pointed out that do_io_submit does not do proper bounds
      checking on the passed-in iocb array:
             if (unlikely(nr < 0))
                     return -EINVAL;
             if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(iocbpp)))))
                     return -EFAULT;                      ^^^^^^^^^^^^^^^^^^
      The attached patch checks for overflow, and if it is detected, the
      number of iocbs submitted is scaled down to a number that will fit in
      the long.  This is an ok thing to do, as sys_io_submit is documented as
      returning the number of iocbs submitted, so callers should handle a
      return value of less than the 'nr' argument passed in.
      Reported-by: default avatarTavis Ormandy <taviso@cmpxchg8b.com>
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  14. 05 Aug, 2010 1 commit
  15. 28 May, 2010 1 commit
    • Al Viro's avatar
      get rid of the magic around f_count in aio · d7065da0
      Al Viro authored
      __aio_put_req() plays sick games with file refcount.  What
      it wants is fput() from atomic context; it's almost always
      done with f_count > 1, so they only have to deal with delayed
      work in rare cases when their reference happens to be the
      last one.  Current code decrements f_count and if it hasn't
      hit 0, everything is fine.  Otherwise it keeps a pointer
      to struct file (with zero f_count!) around and has delayed
      work do __fput() on it.
      Better way to do it: use atomic_long_add_unless( , -1, 1)
      instead of !atomic_long_dec_and_test().  IOW, decrement it
      only if it's not the last reference, leave refcount alone
      if it was.  And use normal fput() in delayed work.
      I've made that atomic_long_add_unless call a new helper -
      fput_atomic().  Drops a reference to file if it's safe to
      do in atomic (i.e. if that's not the last one), tells if
      it had been able to do that.  aio.c converted to it, __fput()
      use is gone.  req->ki_file *always* contributes to refcount
      now.  And __fput() became static.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  16. 27 May, 2010 1 commit
    • Jeff Moyer's avatar
      aio: fix the compat vectored operations · 9d85cba7
      Jeff Moyer authored
      The aio compat code was not converting the struct iovecs from 32bit to
      64bit pointers, causing either EINVAL to be returned from io_getevents, or
      EFAULT as the result of the I/O.  This patch passes a compat flag to
      io_submit to signal that pointer conversion is necessary for a given iocb
      A variant of this was tested by Michael Tokarev.  I have also updated the
      libaio test harness to exercise this code path with good success.
      Further, I grabbed a copy of ltp and ran the
      testcases/kernel/syscall/readv and writev tests there (compiled with -m32
      on my 64bit system).  All seems happy, but extra eyes on this would be
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix CONFIG_COMPAT=n build]
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Reported-by: default avatarMichael Tokarev <mjt@tls.msk.ru>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: <stable@kernel.org>		[]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  17. 16 Dec, 2009 1 commit
  18. 29 Oct, 2009 1 commit
  19. 28 Oct, 2009 1 commit
    • Jeff Moyer's avatar
      aio: implement request batching · cfb1e33e
      Jeff Moyer authored
      Some workloads issue batches of small I/O, and the performance is poor
      due to the call to blk_run_address_space for every single iocb.  Nathan
      Roberts pointed this out, and suggested that by deferring this call
      until all I/Os in the iocb array are submitted to the block layer, we
      can realize some impressive performance gains (up to 30% for sequential
      4k reads in batches of 16).
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
  20. 23 Sep, 2009 1 commit
  21. 22 Sep, 2009 1 commit
  22. 01 Jul, 2009 1 commit
  23. 19 Mar, 2009 2 commits
  24. 14 Jan, 2009 1 commit
  25. 29 Dec, 2008 1 commit
    • Jens Axboe's avatar
      aio: make the lookup_ioctx() lockless · abf137dd
      Jens Axboe authored
      The mm->ioctx_list is currently protected by a reader-writer lock,
      so we always grab that lock on the read side for doing ioctx
      lookups. As the workload is extremely reader biased, turn this into
      an rcu hlist so we can make lookup_ioctx() lockless. Get rid of
      the rwlock and use a spinlock for providing update side exclusion.
      There's usually only 1 entry on this list, so it doesn't make sense
      to look into fancier data structures.
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
  26. 27 Jul, 2008 1 commit
  27. 25 Jul, 2008 1 commit
    • Oleg Nesterov's avatar
      kill PF_BORROWED_MM in favour of PF_KTHREAD · 246bb0b1
      Oleg Nesterov authored
      Kill PF_BORROWED_MM.  Change use_mm/unuse_mm to not play with ->flags, and
      do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users.
      No functional changes yet.  But this allows us to do further
      oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the
      kthreads, this is wrong because of use_mm().  The problem with
      PF_BORROWED_MM is that we need task_lock() to avoid races.  With this
      patch we can check PF_KTHREAD directly, or use a simple lockless helper:
      	/* The result must not be dereferenced !!! */
      	struct mm_struct *__get_task_mm(struct task_struct *tsk)
      		if (tsk->flags & PF_KTHREAD)
      			return NULL;
      		return tsk->mm;
      Note also ecard_task().  It runs with ->mm != NULL, but it's the kernel
      thread without PF_BORROWED_MM.
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  28. 06 Jun, 2008 1 commit
  29. 30 Apr, 2008 1 commit
  30. 29 Apr, 2008 3 commits
  31. 28 Apr, 2008 1 commit
  32. 11 Apr, 2008 1 commit