Skip to content
  • Pavel Tatashin's avatar
    mm: zero reserved and unavailable struct pages · a4a3ede2
    Pavel Tatashin authored
    Some memory is reserved but unavailable: not present in memblock.memory
    (because not backed by physical pages), but present in memblock.reserved.
    Such memory has backing struct pages, but they are not initialized by
    going through __init_single_page().
    
    In some cases these struct pages are accessed even if they do not
    contain any data.  One example is page_to_pfn() might access page->flags
    if this is where section information is stored (CONFIG_SPARSEMEM,
    SECTION_IN_PAGE_FLAGS).
    
    One example of such memory: trim_low_memory_range() unconditionally
    reserves from pfn 0, but e820__memblock_setup() might provide the
    exiting memory from pfn 1 (i.e.  KVM).
    
    Since struct pages are zeroed in __init_single_page(), and not during
    allocation time, we must zero such struct pages explicitly.
    
    The patch involves adding a new memblock iterator:
    	for_each_resv_unavail_range(i, p_start, p_end)
    
    Which iterates through reserved && !memory lists, and we zero struct pages
    explicitly by calling mm_zero_struct_page().
    
    ===
    
    Here is more detailed example of problem that this patch is addressing:
    
    Run tested on qemu with the following arguments:
    
    	-enable-kvm -cpu kvm64 -m 512 -smp 2
    
    This patch reports that there are 98 unavailable pages.
    
    They are: pfn 0 and pfns in range [159, 255].
    
    Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
    not reserve [159, 255] ones.
    
    e820__memblock_setup() reports linux that the following physical ranges are
    available:
        [1 , 158]
    [256, 130783]
    
    Notice, that exactly unavailable pfns are missing!
    
    Now, lets check what we have in zone 0: [1, 131039]
    
    pfn 0, is not part of the zone, but pfns [1, 158], are.
    
    However, the bigger problem we have if we do not initialize these struct
    pages is with memory hotplug.  Because, that path operates at 2M
    boundaries (section_nr).  And checks if 2M range of pages is hot
    removable.  It starts with first pfn from zone, rounds it down to 2M
    boundary (sturct pages are allocated at 2M boundaries when vmemmap is
    created), and checks if that section is hot removable.  In this case
    start with pfn 1 and convert it down to pfn 0.  Later pfn is converted
    to struct page, and some fields are checked.  Now, if we do not zero
    struct pages, we get unpredictable results.
    
    In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all
    vmemmap memory to ones, the following panic is observed with kernel test
    without this patch applied:
    
      BUG: unable to handle kernel NULL pointer dereference at          (null)
      IP: is_pageblock_removable_nolock+0x35/0x90
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT
      ...
      task: ffff88001f4e2900 task.stack: ffffc90000314000
      RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
      Call Trace:
       ? is_mem_section_removable+0x5a/0xd0
       show_mem_removable+0x6b/0xa0
       dev_attr_show+0x1b/0x50
       sysfs_kf_seq_show+0xa1/0x100
       kernfs_seq_show+0x22/0x30
       seq_read+0x1ac/0x3a0
       kernfs_fop_read+0x36/0x190
       ? security_file_permission+0x90/0xb0
       __vfs_read+0x16/0x30
       vfs_read+0x81/0x130
       SyS_read+0x44/0xa0
       entry_SYSCALL_64_fastpath+0x1f/0xbd
    
    Link: http://lkml.kernel.org/r/20171013173214.27300-7-pasha.tatashin@oracle.com
    
    
    Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
    Reviewed-by: default avatarSteven Sistare <steven.sistare@oracle.com>
    Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
    Reviewed-by: default avatarBob Picco <bob.picco@oracle.com>
    Tested-by: default avatarBob Picco <bob.picco@oracle.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Cc: Alexander Potapenko <glider@google.com>
    Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
    Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@de.ibm.com>
    Cc: David S. Miller <davem@davemloft.net>
    Cc: Dmitry Vyukov <dvyukov@google.com>
    Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Mark Rutland <mark.rutland@arm.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Sam Ravnborg <sam@ravnborg.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Will Deacon <will.deacon@arm.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    a4a3ede2