sched/x86: Optimize switch_mm() for multi-threaded workloads

Dick Fowles, Don Zickus and Joe Mario have been working on
improvements to perf, and noticed heavy cache line contention
on the mm_cpumask, running linpack on a 60 core / 120 thread
system.

The cause turned out to be unnecessary atomic accesses to the
mm_cpumask. When in lazy TLB mode, the CPU is only removed from
the mm_cpumask if there is a TLB flush event.

Most of the time, no such TLB flush happens, and the kernel
skips the TLB reload. It can also skip the atomic memory
set & test.

Here is a summary of Joe's test results:

 * The __schedule function dropped from 24% of all program cycles down
   to 5.5%.

 * The cacheline contention/hotness for accesses to that bitmask went
   from being the 1st/2nd hottest - down to the 84th hottest (0.3% of
   all shared misses which is now quite cold)

 * The average load latency for the bit-test-n-set instruction in
   __schedule dropped from 10k-15k cycles down to an average of 600 cycles.

 * The linpack program results improved from 133 GFlops to 144 GFlops.
   Peak GFlops rose from 133 to 153.

Reported-by: Don Zickus <dzickus@redhat.com>
Reported-by: Joe Mario <jmario@redhat.com>
Tested-by: Joe Mario <jmario@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Paul Turner <pjt@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20130731221421.616d3d20@annuminas.surriel.com
[ Made the comments consistent around the modified code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This commit is contained in:
Rik van Riel 2013-07-31 22:14:21 -04:00 committed by Ingo Molnar
parent 46591962cb
commit 8f898fbbe5
1 changed files with 13 additions and 7 deletions

View File

@ -45,12 +45,10 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
/* Re-load page tables */ /* Re-load page tables */
load_cr3(next->pgd); load_cr3(next->pgd);
/* stop flush ipis for the previous mm */ /* Stop flush ipis for the previous mm */
cpumask_clear_cpu(cpu, mm_cpumask(prev)); cpumask_clear_cpu(cpu, mm_cpumask(prev));
/* /* Load the LDT, if the LDT is different: */
* load the LDT, if the LDT is different:
*/
if (unlikely(prev->context.ldt != next->context.ldt)) if (unlikely(prev->context.ldt != next->context.ldt))
load_LDT_nolock(&next->context); load_LDT_nolock(&next->context);
} }
@ -59,8 +57,16 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next); BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) { if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
/* We were in lazy tlb mode and leave_mm disabled /*
* On established mms, the mm_cpumask is only changed
* from irq context, from ptep_clear_flush() while in
* lazy tlb mode, and here. Irqs are blocked during
* schedule, protecting us from simultaneous changes.
*/
cpumask_set_cpu(cpu, mm_cpumask(next));
/*
* We were in lazy tlb mode and leave_mm disabled
* tlb flush IPI delivery. We must reload CR3 * tlb flush IPI delivery. We must reload CR3
* to make sure to use no freed page tables. * to make sure to use no freed page tables.
*/ */