1. 11 Oct, 2016 1 commit
    • Vlastimil Babka's avatar
      fs/select: add vmalloc fallback for select(2) · 2d19309c
      Vlastimil Babka authored
      The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
      with the number of fds passed. We had a customer report page allocation
      failures of order-4 for this allocation. This is a costly order, so it might
      easily fail, as the VM expects such allocation to have a lower-order fallback.
      Such trivial fallback is vmalloc(), as the memory doesn't have to be physically
      contiguous and the allocation is temporary for the duration of the syscall
      only. There were some concerns, whether this would have negative impact on the
      system by exposing vmalloc() to userspace. Although an excessive use of vmalloc
      can cause some system wide performance issues - TLB flushes etc. - a large
      order allocation is not for free either and an excessive reclaim/compaction can
      have a similar effect. Also note that the size is effectively limited by
      RLIMIT_NOFILE which defaults to 1024 on the systems I checked. That means the
      bitmaps will fit well within single page and thus the vmalloc() fallback could
      be only excercised for processes where root allows a higher limit.
      Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
      it doesn't need this kind of fallback.
      [eric.dumazet@gmail.com: fix failure path logic]
      [akpm@linux-foundation.org: use proper type for size]
      Link: http://lkml.kernel.org/r/20160927084536.5923-1-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  2. 20 May, 2016 1 commit
  3. 17 Mar, 2016 1 commit
    • John Stultz's avatar
      timer: convert timer_slack_ns from unsigned long to u64 · da8b44d5
      John Stultz authored
      This patchset introduces a /proc/<pid>/timerslack_ns interface which
      would allow controlling processes to be able to set the timerslack value
      on other processes in order to save power by avoiding wakeups (Something
      Android currently does via out-of-tree patches).
      The first patch tries to fix the internal timer_slack_ns usage which was
      defined as a long, which limits the slack range to ~4 seconds on 32bit
      systems.  It converts it to a u64, which provides the same basically
      unlimited slack (500 years) on both 32bit and 64bit machines.
      The second patch introduces the /proc/<pid>/timerslack_ns interface
      which allows the full 64bit slack range for a task to be read or set on
      both 32bit and 64bit machines.
      With these two patches, on a 32bit machine, after setting the slack on
      bash to 10 seconds:
      $ time sleep 1
      real    0m10.747s
      user    0m0.001s
      sys     0m0.005s
      The first patch is a little ugly, since I had to chase the slack delta
      arguments through a number of functions converting them to u64s.  Let me
      know if it makes sense to break that up more or not.
      Other than that things are fairly straightforward.
      This patch (of 2):
      The timer_slack_ns value in the task struct is currently a unsigned
      long.  This means that on 32bit applications, the maximum slack is just
      over 4 seconds.  However, on 64bit machines, its much much larger (~500
      This disparity could make application development a little (as well as
      the default_slack) to a u64.  This means both 32bit and 64bit systems
      have the same effective internal slack range.
      Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
      the interface as a unsigned long, so we preserve that limitation on
      32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
      long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
      actually larger then what can be stored by an unsigned long.
      This patch also modifies hrtimer functions which specified the slack
      delta as a unsigned long.
      Signed-off-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  4. 06 Jan, 2016 1 commit
  5. 19 May, 2015 1 commit
  6. 13 Feb, 2015 1 commit
    • Andy Lutomirski's avatar
      all arches, signal: move restart_block to struct task_struct · f56141e3
      Andy Lutomirski authored
      If an attacker can cause a controlled kernel stack overflow, overwriting
      the restart block is a very juicy exploit target.  This is because the
      restart_block is held in the same memory allocation as the kernel stack.
      Moving the restart block to struct task_struct prevents this exploit by
      making the restart_block harder to locate.
      Note that there are other fields in thread_info that are also easy
      targets, at least on some architectures.
      It's also a decent simplification, since the restart code is more or less
      identical on all architectures.
      [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: Richard Weinberger's avatarRichard Weinberger <richard@nod.at>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  7. 30 Oct, 2013 1 commit
    • Rafael J. Wysocki's avatar
      Revert "select: use freezable blocking call" · 59612d18
      Rafael J. Wysocki authored
      This reverts commit 9745cdb3 (select: use freezable blocking call)
      that triggers problems during resume from suspend to RAM on Paul Bolle's
      32-bit x86 machines.  Paul says:
        Ever since I tried running (release candidates of) v3.11 on the two
        working i686s I still have lying around I ran into issues on resuming
        from suspend. Reverting 9745cdb3 (select: use freezable blocking
        call) resolves those issues.
        Resuming from suspend on i686 on (release candidates of) v3.11 and
        later triggers issues like:
        traps: systemd[1] general protection ip:b738e490 sp:bf882fc0 error:0 in libc-2.16.so[b731c000+1b0000]
        traps: rtkit-daemon[552] general protection ip:804d6e5 sp:b6cb32f0 error:0 in rtkit-daemon[8048000+d000]
        Once I hit the systemd error I can only get out of the mess that the
        system is at that point by power cycling it.
      Since we are reverting another freezer-related change causing similar
      problems to happen, this one should be reverted as well.
      References: https://lkml.org/lkml/2013/10/29/583Reported-by: default avatarPaul Bolle <pebolle@tiscali.nl>
      Fixes: 9745cdb3 (select: use freezable blocking call)
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: 3.11+ <stable@vger.kernel.org> # 3.11+
  8. 25 Oct, 2013 1 commit
  9. 11 Jul, 2013 1 commit
  10. 09 Jul, 2013 2 commits
  11. 02 Jul, 2013 1 commit
  12. 01 Jul, 2013 1 commit
  13. 25 Jun, 2013 1 commit
    • Eliezer Tamir's avatar
      net: poll/select low latency socket support · 2d48d67f
      Eliezer Tamir authored
      select/poll busy-poll support.
      Split sysctl value into two separate ones, one for read and one for poll.
      updated Documentation/sysctl/net.txt
      Add a new poll flag POLL_LL. When this flag is set, sock_poll will call
      sk_poll_ll if possible. sock_poll sets this flag in its return value
      to indicate to select/poll when a socket that can busy poll is found.
      When poll/select have nothing to report, call the low-level
      sock_poll again until we are out of time or we find something.
      Once the system call finds something, it stops setting POLL_LL, so it can
      return the result to the user ASAP.
      Signed-off-by: default avatarEliezer Tamir <eliezer.tamir@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  14. 12 May, 2013 1 commit
    • Colin Cross's avatar
      select: use freezable blocking call · 9745cdb3
      Colin Cross authored
      Avoid waking up every thread sleeping in a select call during
      suspend and resume by calling a freezable blocking call.  Previous
      patches modified the freezer to avoid sending wakeups to threads
      that are blocked in freezable blocking calls.
      This call was selected to be converted to a freezable call because
      it doesn't hold any locks or release any resources when interrupted
      that might be needed by another freezing task or a kernel driver
      during suspend, and is a common site where idle userspace tasks are
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarColin Cross <ccross@android.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
  15. 07 Feb, 2013 1 commit
  16. 27 Sep, 2012 2 commits
  17. 26 Jul, 2012 1 commit
    • Josh Boyer's avatar
      posix_types.h: Cleanup stale __NFDBITS and related definitions · 8ded2bbc
      Josh Boyer authored
      Recently, glibc made a change to suppress sign-conversion warnings in
      FD_SET (glibc commit ceb9e56b3d1).  This uncovered an issue with the
      kernel's definition of __NFDBITS if applications #include
      <linux/types.h> after including <sys/select.h>.  A build failure would
      be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2
      flags to gcc.
      It was suggested that the kernel should either match the glibc
      definition of __NFDBITS or remove that entirely.  The current in-kernel
      uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no
      uses of the related __FDELT and __FDMASK defines.  Given that, we'll
      continue the cleanup that was started with commit 8b3d1cda
      ("posix_types: Remove fd_set macros") and drop the remaining unused
      Additionally, linux/time.h has similar macros defined that expand to
      nothing so we'll remove those at the same time.
      Reported-by: default avatarJeff Law <law@redhat.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarJosh Boyer <jwboyer@redhat.com>
      [ .. and fix up whitespace as per akpm ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  18. 01 Jun, 2012 1 commit
  19. 23 Mar, 2012 1 commit
    • Hans Verkuil's avatar
      poll: add poll_requested_events() and poll_does_not_wait() functions · 626cf236
      Hans Verkuil authored
      In some cases the poll() implementation in a driver has to do different
      things depending on the events the caller wants to poll for.  An example
      is when a driver needs to start a DMA engine if the caller polls for
      POLLIN, but doesn't want to do that if POLLIN is not requested but instead
      only POLLOUT or POLLPRI is requested.  This is something that can happen
      in the video4linux subsystem among others.
      Unfortunately, the current epoll/poll/select implementation doesn't
      provide that information reliably.  The poll_table_struct does have it: it
      has a key field with the event mask.  But once a poll() call matches one
      or more bits of that mask any following poll() calls are passed a NULL
      poll_table pointer.
      Also, the eventpoll implementation always left the key field at ~0 instead
      of using the requested events mask.
      This was changed in eventpoll.c so the key field now contains the actual
      events that should be polled for as set by the caller.
      The solution to the NULL poll_table pointer is to set the qproc field to
      NULL in poll_table once poll() matches the events, not the poll_table
      pointer itself.  That way drivers can obtain the mask through a new
      poll_requested_events inline.
      The poll_table_struct can still be NULL since some kernel code calls it
      internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h).  In
      that case poll_requested_events() returns ~0 (i.e.  all events).
      Very rarely drivers might want to know whether poll_wait will actually
      wait.  If another earlier file descriptor in the set already matched the
      events the caller wanted to wait for, then the kernel will return from the
      select() call without waiting.  This might be useful information in order
      to avoid doing expensive work.
      A new helper function poll_does_not_wait() is added that drivers can use
      to detect this situation.  This is now used in sock_poll_wait() in
      include/net/sock.h.  This was the only place in the kernel that needed
      this information.
      Drivers should no longer access any of the poll_table internals, but use
      the poll_requested_events() and poll_does_not_wait() access functions
      instead.  In order to enforce that the poll_table fields are now prepended
      with an underscore and a comment was added warning against using them
      This required a change in unix_dgram_poll() in unix/af_unix.c which used
      the key field to get the requested events.  It's been replaced by a call
      to poll_requested_events().
      For qproc it was especially important to change its name since the
      behavior of that field changes with this patch since this function pointer
      can now be NULL when that wasn't possible in the past.
      Any driver accessing the qproc or key fields directly will now fail to compile.
      Some notes regarding the correctness of this patch: the driver's poll()
      function is called with a 'struct poll_table_struct *wait' argument.  This
      pointer may or may not be NULL, drivers can never rely on it being one or
      the other as that depends on whether or not an earlier file descriptor in
      the select()'s fdset matched the requested events.
      There are only three things a driver can do with the wait argument:
      1) obtain the key field:
      	events = wait ? wait->key : ~0;
         This will still work although it should be replaced with the new
         poll_requested_events() function (which does exactly the same).
         This will now even work better, since wait is no longer set to NULL
      2) use the qproc callback. This could be deadly since qproc can now be
         NULL. Renaming qproc should prevent this from happening. There are no
         kernel drivers that actually access this callback directly, BTW.
      3) test whether wait == NULL to determine whether poll would return without
         waiting. This is no longer sufficient as the correct test is now
         wait == NULL || wait->_qproc == NULL.
         However, the worst that can happen here is a slight performance hit in
         the case where wait != NULL and wait->_qproc == NULL. In that case the
         driver will assume that poll_wait() will actually add the fd to the set
         of waiting file descriptors. Of course, poll_wait() will not do that
         since it tests for wait->_qproc. This will not break anything, though.
         There is only one place in the whole kernel where this happens
         (sock_poll_wait() in include/net/sock.h) and that code will be replaced
         by a call to poll_does_not_wait() in the next patch.
         Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
         actually waiting. The next file descriptor from the set might match the
         event mask and thus any possible waits will never happen.
      Signed-off-by: default avatarHans Verkuil <hans.verkuil@cisco.com>
      Reviewed-by: default avatarJonathan Corbet <corbet@lwn.net>
      Reviewed-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  20. 29 Feb, 2012 1 commit
  21. 22 Feb, 2012 1 commit
    • Linus Torvalds's avatar
      sys_poll: fix incorrect type for 'timeout' parameter · faf30900
      Linus Torvalds authored
      The 'poll()' system call timeout parameter is supposed to be 'int', not
      Now, the reason this matters is that right now 32-bit compat mode is
      broken on at least x86-64, because the 32-bit code just calls
      'sys_poll()' directly on x86-64, and the 32-bit argument will have been
      zero-extended, turning a signed 'int' into a large unsigned 'long'
      We could just introduce a 'compat_sys_poll()' function for this, and
      that may eventually be what we have to do, but since the actual standard
      poll() semantics is *supposed* to be 'int', and since at least on x86-64
      glibc sign-extends the argument before invocing the system call (so
      nobody can actually use a 64-bit timeout value in user space _anyway_,
      even in 64-bit binaries), the simpler solution would seem to be to just
      fix the definition of the system call to match what it should have been
      from the very start.
      If it turns out that somebody somehow circumvents the user-level libc
      64-bit sign extension and actually uses a large unsigned 64-bit timeout
      despite that not being how poll() is supposed to work, we will need to
      do the compat_sys_poll() approach.
      Reported-by: default avatarThomas Meyer <thomas@m3y3r.de>
      Acked-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  22. 19 Feb, 2012 1 commit
    • David Howells's avatar
      Replace the fd_sets in struct fdtable with an array of unsigned longs · 1fd36adc
      David Howells authored
      Replace the fd_sets in struct fdtable with an array of unsigned longs and then
      use the standard non-atomic bit operations rather than the FD_* macros.
       (1) Removes the abuses of struct fd_set:
           (a) Since we don't want to allocate a full fd_set the vast majority of the
           	 time, we actually, in effect, just allocate a just-big-enough array of
           	 unsigned longs and cast it to an fd_set type - so why bother with the
           	 fd_set at all?
           (b) Some places outside of the core fdtable handling code (such as
           	 SELinux) want to look inside the array of unsigned longs hidden inside
           	 the fd_set struct for more efficient iteration over the entire set.
       (2) Eliminates the use of FD_*() macros in the kernel completely.
       (3) Permits the __FD_*() macros to be deleted entirely where not exposed to
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Link: http://lkml.kernel.org/r/20120216174954.23314.48147.stgit@warthog.procyon.org.ukSigned-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
  23. 21 Mar, 2011 1 commit
  24. 13 Jan, 2011 1 commit
  25. 28 Oct, 2010 2 commits
  26. 12 Mar, 2010 1 commit
  27. 06 Mar, 2010 1 commit
  28. 04 Oct, 2009 1 commit
  29. 23 Sep, 2009 1 commit
  30. 16 Aug, 2009 1 commit
  31. 17 Jun, 2009 1 commit
  32. 14 Jan, 2009 3 commits
  33. 13 Jan, 2009 1 commit
  34. 06 Jan, 2009 1 commit
    • Tejun Heo's avatar
      poll: allow f_op->poll to sleep · 5f820f64
      Tejun Heo authored
      f_op->poll is the only vfs operation which is not allowed to sleep.  It's
      because poll and select implementation used task state to synchronize
      against wake ups, which doesn't have to be the case anymore as wait/wake
      interface can now use custom wake up functions.  The non-sleep restriction
      can be a bit tricky because ->poll is not called from an atomic context
      and the result of accidentally sleeping in ->poll only shows up as
      temporary busy looping when the timing is right or rather wrong.
      This patch converts poll/select to use custom wake up function and use
      separate triggered variable to synchronize against wake up events.  The
      only added overhead is an extra function call during wake up and
      This patch removes the one non-sleep exception from vfs locking rules and
      is beneficial to userland filesystem implementations like FUSE, 9p or
      peculiar fs like spufs as it's very difficult for those to implement
      non-sleeping poll method.
      While at it, make the following cosmetic changes to make poll.h and
      select.c checkpatch friendly.
      * s/type * symbol/type *symbol/		   : three places in poll.h
      * remove blank line before EXPORT_SYMBOL() : two places in select.c
      Oleg: spotted missing barrier in poll_schedule_timeout()
      Davide: spotted missing write barrier in pollwake()
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Brad Boyer <flar@allandria.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  35. 26 Oct, 2008 1 commit