linux-imx/Documentation/vm/split_page_table_lock
Kirill A. Shutemov 1d798ca3f1 mm: make compound_head() robust
Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

	CPU0					CPU1

isolate_migratepages_block()
  page_count()
    compound_head()
      !!PageTail() == true
					put_page()
					  tail->first_page = NULL
      head = tail->first_page
					alloc_pages(__GFP_COMP)
					   prep_compound_page()
					     tail->first_page = head
					     __SetPageTail(p);
      !!PageTail() == true
    <head == NULL dereferencing>

The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.

The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.

hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.

The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.

That means page->compound_head shares storage space with:

 - page->lru.next;
 - page->next;
 - page->rcu_head.next;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.

page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().

[1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00

3.5 KiB

Split page table lock

Originally, mm->page_table_lock spinlock protected all page tables of the mm_struct. But this approach leads to poor page fault scalability of multi-threaded applications due high contention on the lock. To improve scalability, split page table lock was introduced.

With split page table lock we have separate per-table lock to serialize access to the table. At the moment we use split lock for PTE and PMD tables. Access to higher level tables protected by mm->page_table_lock.

There are helpers to lock/unlock a table and other accessor functions:

  • pte_offset_map_lock() maps pte and takes PTE table lock, returns pointer to the taken lock;
  • pte_unmap_unlock() unlocks and unmaps PTE table;
  • pte_alloc_map_lock() allocates PTE table if needed and take the lock, returns pointer to taken lock or NULL if allocation failed;
  • pte_lockptr() returns pointer to PTE table lock;
  • pmd_lock() takes PMD table lock, returns pointer to taken lock;
  • pmd_lockptr() returns pointer to PMD table lock;

Split page table lock for PTE tables is enabled compile-time if CONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS. If split lock is disabled, all tables guaded by mm->page_table_lock.

Split page table lock for PMD tables is enabled, if it's enabled for PTE tables and the architecture supports it (see below).

Hugetlb and split page table lock

Hugetlb can support several page sizes. We use split lock only for PMD level, but not for PUD.

Hugetlb-specific helpers:

  • huge_pte_lock() takes pmd split lock for PMD_SIZE page, mm->page_table_lock otherwise;
  • huge_pte_lockptr() returns pointer to table lock;

Support of split page table lock by an architecture

There's no need in special enabling of PTE split page table lock: everything required is done by pgtable_page_ctor() and pgtable_page_dtor(), which must be called on PTE table allocation / freeing.

Make sure the architecture doesn't use slab allocator for page table allocation: slab uses page->slab_cache for its pages. This field shares storage with page->ptl.

PMD split lock only makes sense if you have more than two page table levels.

PMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table allocation and pgtable_pmd_page_dtor() on freeing.

Allocation usually happens in pmd_alloc_one(), freeing in pmd_free() and pmd_free_tlb(), but make sure you cover all PMD table allocation / freeing paths: i.e X86_PAE preallocate few PMDs on pgd_alloc().

With everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK.

NOTE: pgtable_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must be handled properly.

page->ptl

page->ptl is used to access split page table lock, where 'page' is struct page of page containing the table. It shares storage with page->private (and few other fields in union).

To avoid increasing size of struct page and have best performance, we use a trick:

  • if spinlock_t fits into long, we use page->ptr as spinlock, so we can avoid indirect access and save a cache line.
  • if size of spinlock_t is bigger then size of long, we use page->ptl as pointer to spinlock_t and allocate it dynamically. This allows to use split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs one more cache line for indirect access;

The spinlock_t allocated in pgtable_page_ctor() for PTE table and in pgtable_pmd_page_ctor() for PMD table.

Please, never access page->ptl directly -- use appropriate helper.