1. 29 Jan, 2013 1 commit
    • Steven Whitehouse's avatar
      GFS2: Split gfs2_trans_add_bh() into two · 350a9b0a
      Steven Whitehouse authored
      There is little common content in gfs2_trans_add_bh() between the data
      and meta classes by the time that the functions which it calls are
      taken into account. The intent here is to split this into two
      separate functions. Stage one is to introduce gfs2_trans_add_data()
      and gfs2_trans_add_meta() and update the callers accordingly.
      Later patches will then pull in the content of gfs2_trans_add_bh()
      and its dependent functions in order to clean up the code in this
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  2. 18 Dec, 2012 1 commit
  3. 07 Nov, 2012 3 commits
    • Steven Whitehouse's avatar
      GFS2: Add Orlov allocator · 9dbe9610
      Steven Whitehouse authored
      Just like ext3, this works on the root directory and any directory
      with the +T flag set. Also, just like ext3, any subdirectory created
      in one of the just mentioned cases will be allocated to a random
      resource group (GFS2 equivalent of a block group).
      If you are creating a set of directories, each of which will contain a
      job running on a different node, then by setting +T on the parent
      directory before creating the subdirectories, each will land up in a
      different resource group, and thus resource group contention between
      nodes will be kept to a minimum.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Benjamin Marzinski's avatar
      GFS2: Don't call file_accessed() with a shared glock · 3d162688
      Benjamin Marzinski authored
      file_accessed() was being called by gfs2_mmap() with a shared glock. If it
      needed to update the atime, it was crashing because it dirtied the inode in
      gfs2_dirty_inode() without holding an exclusive lock. gfs2_dirty_inode()
      checked if the caller was already holding a glock, but it didn't make sure that
      the glock was in the exclusive state. Now, instead of calling file_accessed()
      while holding the shared lock in gfs2_mmap(), file_accessed() is called after
      grabbing and releasing the glock to update the inode.  If file_accessed() needs
      to update the atime, it will grab an exclusive lock in gfs2_dirty_inode().
      gfs2_dirty_inode() now also checks to make sure that if the calling process has
      already locked the glock, it has an exclusive lock.
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Andrew Price's avatar
      GFS2: Clean up some unused assignments · 73738a77
      Andrew Price authored
      Cleans up two cases where variables were assigned values but then never
      used again.
      Signed-off-by: default avatarAndrew Price <anprice@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  4. 09 Oct, 2012 1 commit
    • Konstantin Khlebnikov's avatar
      mm: kill vma flag VM_CAN_NONLINEAR · 0b173bc4
      Konstantin Khlebnikov authored
      Move actual pte filling for non-linear file mappings into the new special
      vma operation: ->remap_pages().
      Filesystems must implement this method to get non-linear mapping support,
      if it uses filemap_fault() then generic_file_remap_pages() can be used.
      Now device drivers can implement this method and obtain nonlinear vma support.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  5. 24 Sep, 2012 1 commit
    • Steven Whitehouse's avatar
      GFS2: Remove rs_requested field from reservations · 71f890f7
      Steven Whitehouse authored
      The rs_requested field is left over from the original allocation
      code, however this should have been a parameter passed to the
      various functions from gfs2_inplace_reserve() and not a member of the
      reservation structure as the value is not required after the
      initial allocation.
      This also helps simplify the code since we no longer need to set
      the rs_requested to zero. Also the gfs2_inplace_release()
      function can also be simplified since the reservation structure
      will always be defined when it is called, and the only remaining
      task is to unlock the rgrp if required. It can also now be
      called unconditionally too, resulting in a further simplification.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  6. 13 Sep, 2012 1 commit
  7. 31 Jul, 2012 1 commit
  8. 30 Jul, 2012 1 commit
  9. 20 Jul, 2012 1 commit
  10. 19 Jul, 2012 1 commit
    • Bob Peterson's avatar
      GFS2: Reduce file fragmentation · 8e2e0047
      Bob Peterson authored
      This patch reduces GFS2 file fragmentation by pre-reserving blocks. The
      resulting improved on disk layout greatly speeds up operations in cases
      which would have resulted in interlaced allocation of blocks previously.
      A typical example of this is 10 parallel dd processes, each writing to a
      file in a common dirctory.
      The implementation uses an rbtree of reservations attached to each
      resource group (and each inode).
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  11. 06 Jun, 2012 3 commits
    • Steven Whitehouse's avatar
      GFS2: Add "top dir" flag support · 23d0bb83
      Steven Whitehouse authored
      This patch adds support for the "top dir" flag. Currently this is unused
      but a subsequent patch is planned which will add support for the
      Orlov allocation policy when allocating subdirectories in a parent
      with this flag set.
      In order to ensure backward compatible behaviour, mkfs.gfs2 does
      not currently tag the root directory with this flag, it must always be
      set manually.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Bob Peterson's avatar
      GFS2: Fold quota data into the reservations struct · 5407e242
      Bob Peterson authored
      This patch moves the ancillary quota data structures into the
      block reservations structure. This saves GFS2 some time and
      effort in allocating and deallocating the qadata structure.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Bob Peterson's avatar
      GFS2: Extend the life of the reservations · 0a305e49
      Bob Peterson authored
      This patch lengthens the lifespan of the reservations structure for
      inodes. Before, they were allocated and deallocated for every write
      operation. With this patch, they are allocated when the first write
      occurs, and deallocated when the last process closes the file.
      It's more efficient to do it this way because it saves GFS2 a lot of
      unnecessary allocates and frees. It also gives us more flexibility
      for the future: (1) we can now fold the qadata structure back into
      the structure and save those alloc/frees, (2) we can use this for
      multi-block reservations.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  12. 24 Apr, 2012 1 commit
  13. 31 Mar, 2012 1 commit
  14. 09 Mar, 2012 1 commit
    • Benjamin Marzinski's avatar
      GFS2: call gfs2_write_alloc_required for each chunk · 58a7d5fb
      Benjamin Marzinski authored
      gfs2_fallocate was calling gfs2_write_alloc_required() once at the start of
      the function. This caused problems since gfs2_write_alloc_required used a
      long unsigned int for the len, but gfs2_fallocate could allocate a much
      larger amount.  This patch will move the call into the loop where the
      chunks are actually allocated and zeroed out. This will keep the allocation
      size under the limit, and also allow gfs2_fallocate to quickly skip over
      sections of the file that are already completely allocated.
      fallcate_chunk was also not correctly setting the file size.  It was using the
      len veriable to find the last block written to, but by the time it was setting
      the size, the len variable had already been decremented to 0.
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  15. 28 Feb, 2012 3 commits
    • Steven Whitehouse's avatar
      GFS2: FITRIM ioctl support · 66fc061b
      Steven Whitehouse authored
      The FITRIM ioctl provides an alternative way to send discard requests to
      the underlying device. Using the discard mount option results in every
      freed block generating a discard request to the block device. This can
      be slow, since many block devices can only process discard requests of
      larger sizes, and also such operations can be time consuming.
      Rather than using the discard mount option, FITRIM allows a sweep of the
      filesystem on an occasional basis, and also to optionally avoid sending
      down discard requests for smaller regions.
      In GFS2 FITRIM will work at resource group granularity. There is a flag
      for each resource group which keeps track of which resource groups have
      been trimmed. This flag is reset whenever a deallocation occurs in the
      resource group, and set whenever a successful FITRIM of that resource
      group has taken place. This helps to reduce repeated discard requests
      for the same block ranges, again improving performance.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Read resource groups on mount · a365fbf3
      Steven Whitehouse authored
      This makes mount take slightly longer, but at the same time, the first
      write to the filesystem will be faster too. It also means that if there
      is a problem in the resource index, then we can refuse to mount rather
      than having to try and report that when the first write occurs.
      In addition, to avoid recursive locking, we hvae to take account of
      instances when the rindex glock may already be held when we are
      trying to update the rbtree of resource groups.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Bob Peterson's avatar
      GFS2: Ensure rindex is uptodate for fallocate · 9e73f571
      Bob Peterson authored
      This patch fixes a problem whereby gfs2_grow was failing and causing GFS2
      to assert. The problem was that when GFS2's fallocate operation tried to
      acquire an "allocation" it made sure the rindex was up to date, and if not,
      it called gfs2_rindex_update. However, if the file being fallocated was
      the rindex itself, it was already locked at that point. By calling
      gfs2_rindex_update at an earlier point in time, we bring rindex up to date
      and thereby avoid trying to lock it when the "allocation" is acquired.
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  16. 04 Jan, 2012 2 commits
  17. 22 Nov, 2011 1 commit
  18. 21 Nov, 2011 1 commit
    • Steven Whitehouse's avatar
      GFS2: O_(D)SYNC support for fallocate · 4442f2e0
      Steven Whitehouse authored
      Add sync of metadata after fallocate for O_SYNC files to ensure that we
      meet expectations for everything being on disk in this case.
      Unfortunately, the offset and len parameters are modified during the
      course of the fallocate function, so I've had to add a couple of new
      variables to call generic_write_sync() at the end.
      I know that potentially this will sync data as well within the range,
      but I think that is a fairly harmless side-effect overall, since we
      would not normally expect there to be any dirty data within the range in
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Benjamin Marzinski <bmarzins@redhat.com>
  19. 08 Nov, 2011 2 commits
  20. 28 Oct, 2011 1 commit
    • Andi Kleen's avatar
      vfs: do (nearly) lockless generic_file_llseek · ef3d0fd2
      Andi Kleen authored
      The i_mutex lock use of generic _file_llseek hurts.  Independent processes
      accessing the same file synchronize over a single lock, even though
      they have no need for synchronization at all.
      Under high utilization this can cause llseek to scale very poorly on larger
      This patch does some rethinking of the llseek locking model:
      First the 64bit f_pos is not necessarily atomic without locks
      on 32bit systems. This can already cause races with read() today.
      This was discussed on linux-kernel in the past and deemed acceptable.
      The patch does not change that.
      Let's look at the different seek variants:
      SEEK_SET: Doesn't really need any locking.
      If there's a race one writer wins, the other loses.
      For 32bit the non atomic update races against read()
      stay the same. Without a lock they can also happen
      against write() now.  The read() race was deemed
      acceptable in past discussions, and I think if it's
      ok for read it's ok for write too.
      => Don't need a lock.
      SEEK_END: This behaves like SEEK_SET plus it reads
      the maximum size too. Reading the maximum size would have the
      32bit atomic problem. But luckily we already have a way to read
      the maximum size without locking (i_size_read), so we
      can just use that instead.
      Without i_mutex there is no synchronization with write() anymore,
      however since the write() update is atomic on 64bit it just behaves
      like another racy SEEK_SET.  On non atomic 32bit it's the same
      as SEEK_SET.
      => Don't need a lock, but need to use i_size_read()
      SEEK_CUR: This has a read-modify-write race window
      on the same file. One could argue that any application
      doing unsynchronized seeks on the same file is already broken.
      But for the sake of not adding a regression here I'm
      using the file->f_lock to synchronize this. Using this
      lock is much better than the inode mutex because it doesn't
      synchronize between processes.
      => So still need a lock, but can use a f_lock.
      This patch implements this new scheme in generic_file_llseek.
      I dropped generic_file_llseek_unlocked and changed all callers.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
  21. 21 Oct, 2011 8 commits
    • Benjamin Marzinski's avatar
      GFS2: rewrite fallocate code to write blocks directly · 64dd153c
      Benjamin Marzinski authored
      GFS2's fallocate code currently goes through the page cache. Since it's only
      writing to the end of the file or to holes in it, it doesn't need to, and it
      was causing issues on low memory environments. This patch pulls in some of
      Steve's block allocation work, and uses it to simply allocate the blocks for
      the file, and zero them out at allocation time.  It provides a slight
      performance increase, and it dramatically simplifies the code.
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Clean up ->page_mkwrite · 13d921e3
      Steven Whitehouse authored
      This patch brings gfs2's ->page_mkwrite uptodate with respect to the
      expectations set by the VM. Also added is a check to wait if the fs
      is frozen, before we attempt to get a glock. This will only work on
      the node which initiates the freeze, but thats ok since the transaction
      lock will still provide the expected barrier on other nodes.
      The major change here is that we return a locked page now, except when
      we don't return a page at all (error cases). This removes the race
      which required rechecking the page after it was returned.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
    • Steven Whitehouse's avatar
      GFS2: Fix AIL flush issue during fsync · b5b24d7a
      Steven Whitehouse authored
      Unfortunately, it is not enough to just ignore locked buffers during
      the AIL flush from fsync. We need to be able to ignore all buffers
      which are locked, dirty or pinned at this stage as they might have
      been added subsequent to the log flush earlier in the fsync function.
      In addition, this means that we no longer need to rely on i_mutex to
      keep out writes during fsync, so we can, as a side-effect, remove
      that protection too.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Tested-By: default avatarAbhijith Das <adas@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Cache the most recently used resource group in the inode · 54335b1f
      Steven Whitehouse authored
      This means that after the initial allocation for any inode, the
      last used resource group is cached in the inode for future use.
      This drastically reduces the number of lookups of resource
      groups in the common case, and this the contention on that
      data structure.
      The allocation algorithm is the same as previously, except that we
      always check to see if the goal block is within the cached rgrp
      first before going to the rbtree to look one up.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Fix lseek after SEEK_DATA, SEEK_HOLE have been added · 9453615a
      Steven Whitehouse authored
      We need to take the inode's glock whenever the inode's size
      is referenced, otherwise it might not be uptodate. Even
      though generic_file_llseek_unlocked() doesn't implement
      SEEK_DATA, SEEK_HOLE directly, it does reference the inode's
      size in those cases, so we need to add them to the list
      of origins which need the glock.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
    • Steven Whitehouse's avatar
      GFS2: Use ->dirty_inode() · ab9bbda0
      Steven Whitehouse authored
      The aim of this patch is to use the newly enhanced ->dirty_inode()
      super block operation to deal with atime updates, rather than
      piggy backing that code into ->write_inode() as is currently
      The net result is a simplification of the code in various places
      and a reduction of the number of gfs2_dinode_out() calls since
      this is now implied by ->dirty_inode().
      Some of the mark_inode_dirty() calls have been moved under glocks
      in order to take advantage of then being able to avoid locking in
      ->dirty_inode() when we already have suitable locks.
      One consequence is that generic_write_end() now correctly deals
      with file size updates, so that we do not need a separate check
      for that afterwards. This also, indirectly, means that fdatasync
      should work correctly on GFS2 - the current code always syncs the
      metadata whether it needs to or not.
      Has survived testing with postmark (with and without atime) and
      also fsx.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Fix bug trap and journaled data fsync · f1818529
      Steven Whitehouse authored
      Journaled data requires that a complete flush of all dirty data for
      the file is done, in order that the ail flush which comes after
      will succeed.
      Also the recently enhanced bug trap can trigger falsely in case
      an ail flush from fsync races with a page read. This updates the
      bug trap such that it will ignore buffers which are locked and
      only trigger on dirty and/or pinned buffers when the ail flush
      is run from fsync. The original bug trap is retained when ail
      flush is run from ->go_sync()
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
    • Steven Whitehouse's avatar
      GFS2: Split data write & wait in fsync · 2f0264d5
      Steven Whitehouse authored
      Now that the data writing is part of fsync proper, we can split
      the waiting part out and do it later on. This reduces the
      number of waits that we do during fsync on average.
      There is also no need to take the i_mutex unless we are flushing
      metadata to disk, so we can move that to within the metadata
      flushing code.
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  22. 21 Jul, 2011 1 commit
    • Josef Bacik's avatar
      fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers · 02c24a82
      Josef Bacik authored
      Btrfs needs to be able to control how filemap_write_and_wait_range() is called
      in fsync to make it less of a painful operation, so push down taking i_mutex and
      the calling of filemap_write_and_wait() down into the ->fsync() handlers.  Some
      file systems can drop taking the i_mutex altogether it seems, like ext3 and
      ocfs2.  For correctness sake I just pushed everything down in all cases to make
      sure that we keep the current behavior the same for everybody, and then each
      individual fs maintainer can make up their mind about what to do from there.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  23. 20 Jul, 2011 1 commit
  24. 15 Jul, 2011 1 commit
  25. 03 May, 2011 1 commit
    • Benjamin Marzinski's avatar
      GFS2: make sure fallocate bytes is a multiple of blksize · 6905d9e4
      Benjamin Marzinski authored
      The GFS2 fallocate code chooses a target size to for allocating chunks of
      space.  Whenever it can't find any resource groups with enough space free, it
      halves its target. Since this target is in bytes, eventually it will no longer
      be a multiple of blksize.  As long as there is more space available in the
      resource group than the target, this isn't a problem, since gfs2 will use the
      actual space available, which is always a multiple of blksize.  However,
      when gfs couldn't fallocate a bigger chunk than the target, it was using the
      non-blksize aligned number. This caused a BUG in later code that required
      blksize aligned offsets.  GFS2 now ensures that bytes is always a multiple of
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: default avatarSteven Whitehouse <swhiteho@redhat.com>