Skip to content
  • Jérôme Glisse's avatar
    hmm: heterogeneous memory management documentation · bffc33ec
    Jérôme Glisse authored
    Patch series "HMM (Heterogeneous Memory Management)", v25.
    
    Heterogeneous Memory Management (HMM) (description and justification)
    
    Today device driver expose dedicated memory allocation API through their
    device file, often relying on a combination of IOCTL and mmap calls.
    The device can only access and use memory allocated through this API.
    This effectively split the program address space into object allocated
    for the device and useable by the device and other regular memory
    (malloc, mmap of a file, share memory, â) only accessible by
    CPU (or in a very limited way by a device by pinning memory).
    
    Allowing different isolated component of a program to use a device thus
    require duplication of the input data structure using device memory
    allocator.  This is reasonable for simple data structure (array, grid,
    image, â) but this get extremely complex with advance data
    structure (list, tree, graph, â) that rely on a web of memory
    pointers.  This is becoming a serious limitation on the kind of work
    load that can be offloaded to device like GPU.
    
    New industry standard like C++, OpenCL or CUDA are pushing to remove
    this barrier.  This require a shared address space between GPU device
    and CPU so that GPU can access any memory of a process (while still
    obeying memory protection like read only).  This kind of feature is also
    appearing in various other operating systems.
    
    HMM is a set of helpers to facilitate several aspects of address space
    sharing and device memory management.  Unlike existing sharing mechanism
    that rely on pining pages use by a device, HMM relies on mmu_notifier to
    propagate CPU page table update to device page table.
    
    Duplicating CPU page table is only one aspect necessary for efficiently
    using device like GPU.  GPU local memory have bandwidth in the TeraBytes/
    second range but they are connected to main memory through a system bus
    like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x).  Thus it
    is necessary to allow migration of process memory from main system memory
    to device memory.  Issue is that on platform that only have PCIE the
    device memory is not accessible by the CPU with the same properties as
    main memory (cache coherency, atomic operations, ...).
    
    To allow migration from main memory to device memory HMM provides a set of
    helper to hotplug device memory as a new type of ZONE_DEVICE memory which
    is un-addressable by CPU but still has struct page representing it.  This
    allow most of the core kernel logic that deals with a process memory to
    stay oblivious of the peculiarity of device memory.
    
    When page backing an address of a process is migrated to device memory the
    CPU page table entry is set to a new specific swap entry.  CPU access to
    such address triggers a migration back to system memory, just like if the
    page was swap on disk.  HMM also blocks any one from pinning a ZONE_DEVICE
    page so that it can always be migrated back to system memory if CPU access
    it.  Conversely HMM does not migrate to device memory any page that is pin
    in system memory.
    
    To allow efficient migration between device memory and main memory a new
    migrate_vma() helpers is added with this patchset.  It allows to leverage
    device DMA engine to perform the copy operation.
    
    This feature will be use by upstream driver like nouveau mlx5 and probably
    other in the future (amdgpu is next suspect in line).  We are actively
    working on nouveau and mlx5 support.  To test this patchset we also worked
    with NVidia close source driver team, they have more resources than us to
    test this kind of infrastructure and also a bigger and better userspace
    eco-system with various real industry workload they can be use to test and
    profile HMM.
    
    The expected workload is a program builds a data set on the CPU (from
    disk, from network, from sensors, â).  Program uses GPU API (OpenCL,
    CUDA, ...) to give hint on memory placement for the input data and also
    for the output buffer.  Program call GPU API to schedule a GPU job, this
    happens using device driver specific ioctl.  All this is hidden from
    programmer point of view in case of C++ compiler that transparently
    offload some part of a program to GPU.  Program can keep doing other stuff
    on the CPU while the GPU is crunching numbers.
    
    It is expected that CPU will not access the same data set as the GPU while
    GPU is working on it, but this is not mandatory.  In fact we expect some
    small memory object to be actively access by both GPU and CPU concurrently
    as synchronization channel and/or for monitoring purposes.  Such object
    will stay in system memory and should not be bottlenecked by system bus
    bandwidth (rare write and read access from both CPU and GPU).
    
    As we are relying on device driver API, HMM does not introduce any new
    syscall nor does it modify any existing ones.  It does not change any
    POSIX semantics or behaviors.  For instance the child after a fork of a
    process that is using HMM will not be impacted in anyway, nor is there any
    data hazard between child COW or parent COW of memory that was migrated to
    device prior to fork.
    
    HMM assume a numbers of hardware features.  Device must allow device page
    table to be updated at any time (ie device job must be preemptable).
    Device page table must provides memory protection such as read only.
    Device must track write access (dirty bit).  Device must have a minimum
    granularity that match PAGE_SIZE (ie 4k).
    
    Reviewer (just hint):
    Patch 1  HMM documentation
    Patch 2  introduce core infrastructure and definition of HMM, pretty
             small patch and easy to review
    Patch 3  introduce the mirror functionality of HMM, it relies on
             mmu_notifier and thus someone familiar with that part would be
             in better position to review
    Patch 4  is an helper to snapshot CPU page table while synchronizing with
             concurrent page table update. Understanding mmu_notifier makes
             review easier.
    Patch 5  is mostly a wrapper around handle_mm_fault()
    Patch 6  add new add_pages() helper to avoid modifying each arch memory
             hot plug function
    Patch 7  add a new memory type for ZONE_DEVICE and also add all the logic
             in various core mm to support this new type. Dan Williams and
             any core mm contributor are best people to review each half of
             this patchset
    Patch 8  special case HMM ZONE_DEVICE pages inside put_page() Kirill and
             Dan Williams are best person to review this
    Patch 9  allow to uncharge a page from memory group without using the lru
             list field of struct page (best reviewer: Johannes Weiner or
             Vladimir Davydov or Michal Hocko)
    Patch 10 Add support to uncharge ZONE_DEVICE page from a memory cgroup (best
             reviewer: Johannes Weiner or Vladimir Davydov or Michal Hocko)
    Patch 11 add helper to hotplug un-addressable device memory as new type
             of ZONE_DEVICE memory (new type introducted in patch 3 of this
             serie). This is boiler plate code around memory hotplug and it
             also pick a free range of physical address for the device memory.
             Note that the physical address do not point to anything (at least
             as far as the kernel knows).
    Patch 12 introduce a new hmm_device class as an helper for device driver
             that want to expose multiple device memory under a common fake
             device driver. This is usefull for multi-gpu configuration.
             Anyone familiar with device driver infrastructure can review
             this. Boiler plate code really.
    Patch 13 add a new migrate mode. Any one familiar with page migration is
             welcome to review.
    Patch 14 introduce a new migration helper (migrate_vma()) that allow to
             migrate a range of virtual address of a process using device DMA
             engine to perform the copy. It is not limited to do copy from and
             to device but can also do copy between any kind of source and
             destination memory. Again anyone familiar with migration code
             should be able to verify the logic.
    Patch 15 optimize the new migrate_vma() by unmapping pages while we are
             collecting them. This can be review by any mm folks.
    Patch 16 add unaddressable memory migration to helper introduced in patch
             7, this can be review by anyone familiar with migration code
    Patch 17 add a feature that allow device to allocate non-present page on
             the GPU when migrating a range of address to device memory. This
             is an helper for device driver to avoid having to first allocate
             system memory before migration to device memory
    Patch 18 add a new kind of ZONE_DEVICE memory for cache coherent device
             memory (CDM)
    Patch 19 add an helper to hotplug CDM memory
    
    Previous patchset posting :
    v1 http://lwn.net/Articles/597289/
    v2 https://lkml.org/lkml/2014/6/12/559
    v3 https://lkml.org/lkml/2014/6/13/633
    v4 https://lkml.org/lkml/2014/8/29/423
    v5 https://lkml.org/lkml/2014/11/3/759
    v6 http://lwn.net/Articles/619737/
    v7 http://lwn.net/Articles/627316/
    v8 https://lwn.net/Articles/645515/
    v9 https://lwn.net/Articles/651553/
    v10 https://lwn.net/Articles/654430/
    v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
    v12 http://www.kernelhub.org/?msg=972982&p=2
    v13 https://lwn.net/Articles/706856/
    v14 https://lkml.org/lkml/2016/12/8/344
    v15 http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1304107.html
    v16 http://www.spinics.net/lists/linux-mm/msg119814.html
    v17 https://lkml.org/lkml/2017/1/27/847
    v18 https://lkml.org/lkml/2017/3/16/596
    v19 https://lkml.org/lkml/2017/4/5/831
    v20 https://lwn.net/Articles/720715/
    v21 https://lkml.org/lkml/2017/4/24/747
    v22 http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05176.html
    v23 https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1404788.html
    v24 https://lwn.net/Articles/726691/
    
    This patch (of 19):
    
    This adds documentation for HMM (Heterogeneous Memory Management).  It
    presents the motivation behind it, the features necessary for it to be
    useful and and gives an overview of how this is implemented.
    
    Link: http://lkml.kernel.org/r/20170817000548.32038-2-jglisse@redhat.com
    
    
    Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: David Nellans <dnellans@nvidia.com>
    Cc: Balbir Singh <bsingharora@gmail.com>
    Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Mark Hairgrove <mhairgrove@nvidia.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
    Cc: Sherry Cheung <SCheung@nvidia.com>
    Cc: Subhash Gutti <sgutti@nvidia.com>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Bob Liu <liubo95@huawei.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    bffc33ec