qemu/include/exec
Xin Tong 88e89a57f9 implementing victim TLB for QEMU system emulated TLB
QEMU system mode page table walks are expensive. Taken by running QEMU
qemu-system-x86_64 system mode on Intel PIN , a TLB miss and walking a
4-level page tables in guest Linux OS takes ~450 X86 instructions on
average.

QEMU system mode TLB is implemented using a directly-mapped hashtable.
This structure suffers from conflict misses. Increasing the
associativity of the TLB may not be the solution to conflict misses as
all the ways may have to be walked in serial.

A victim TLB is a TLB used to hold translations evicted from the
primary TLB upon replacement. The victim TLB lies between the main TLB
and its refill path. Victim TLB is of greater associativity (fully
associative in this patch). It takes longer to lookup the victim TLB,
but its likely better than a full page table walk. The memory
translation path is changed as follows :

Before Victim TLB:
1. Inline TLB lookup
2. Exit code cache on TLB miss.
3. Check for unaligned, IO accesses
4. TLB refill.
5. Do the memory access.
6. Return to code cache.

After Victim TLB:
1. Inline TLB lookup
2. Exit code cache on TLB miss.
3. Check for unaligned, IO accesses
4. Victim TLB lookup.
5. If victim TLB misses, TLB refill
6. Do the memory access.
7. Return to code cache

The advantage is that victim TLB can offer more associativity to a
directly mapped TLB and thus potentially fewer page table walks while
still keeping the time taken to flush within reasonable limits.
However, placing a victim TLB before the refill path increase TLB
refill path as the victim TLB is consulted before the TLB refill. The
performance results demonstrate that the pros outweigh the cons.

some performance results taken on SPECINT2006 train
datasets and kernel boot and qemu configure script on an
Intel(R) Xeon(R) CPU  E5620  @ 2.40GHz Linux machine are shown in the
Google Doc link below.

https://docs.google.com/spreadsheets/d/1eiItzekZwNQOal_h-5iJmC4tMDi051m9qidi5_nwvH4/edit?usp=sharing

In summary, victim TLB improves the performance of qemu-system-x86_64 by
11% on average on SPECINT2006, kernelboot and qemu configscript and with
highest improvement of in 26% in 456.hmmer. And victim TLB does not result
in any performance degradation in any of the measured benchmarks. Furthermore,
the implemented victim TLB is architecture independent and is expected to
benefit other architectures in QEMU as well.

Although there are measurement fluctuations, the performance
improvement is very significant and by no means in the range of
noises.

Signed-off-by: Xin Tong <trent.tong@gmail.com>
Message-id: 1407202523-23553-1-git-send-email-trent.tong@gmail.com
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2014-09-01 17:43:06 +01:00
..
user abitypes.h: Remove incorrect ARM ABI_LLONG_ALIGNMENT 2013-09-10 19:09:33 +01:00
address-spaces.h exec: move include files to include/exec/ 2012-12-19 08:31:31 +01:00
cpu-all.h linux-user: /proc/self/maps content 2014-08-22 15:06:33 +03:00
cpu-common.h NUMA: move numa related code to new file numa.c 2014-06-19 18:44:18 +03:00
cpu-defs.h implementing victim TLB for QEMU system emulated TLB 2014-09-01 17:43:06 +01:00
cpu_ldst.h softmmu: move all load/store functions to cpu_ldst.h 2014-06-05 16:10:33 +02:00
cpu_ldst_template.h softmmu: move all load/store functions to cpu_ldst.h 2014-06-05 16:10:33 +02:00
cputlb.h exec: Change memory_region_section_get_iotlb() argument to CPUState 2014-03-13 19:20:48 +01:00
exec-all.h tcg-ppc: Use uintptr_t in ppc_tb_set_jmp_target 2014-06-23 07:29:30 -07:00
gdbstub.h cpu: Introduce CPUClass::gdb_{read,write}_register() 2013-07-27 00:04:17 +02:00
gen-icount.h cpu: Move icount_decr field from CPU_COMMON to CPUState 2014-03-13 19:20:46 +01:00
helper-gen.h trace: [tcg] Include TCG-tracing helpers 2014-08-12 14:26:12 +01:00
helper-head.h tcg: Move size effects out of dh_arg 2014-05-28 09:33:55 -07:00
helper-proto.h trace: [tcg] Include TCG-tracing helpers 2014-08-12 14:26:12 +01:00
helper-tcg.h trace: [tcg] Include TCG-tracing helpers 2014-08-12 14:26:12 +01:00
hwaddr.h hwaddr: Make hwaddr type usable beyond softmmu 2013-06-28 13:25:13 +02:00
ioport.h portio: Allow to mark portio lists as coalesced MMIO flushing 2013-10-17 17:24:15 +02:00
memory-internal.h memory: split cpu_physical_memory_* functions to its own include 2014-01-13 14:04:54 +01:00
memory.h Revert "memory: Use canonical path component as the name" 2014-08-19 20:05:46 +01:00
poison.h exec: Remove env from list of poisoned names 2013-07-27 11:22:54 +04:00
ram_addr.h exec: fix migration with devices that use address_space_rw 2014-07-22 10:38:50 +02:00
softmmu-semi.h exec: Change cpu_memory_rw_debug() argument to CPUState 2013-07-23 02:41:33 +02:00
spinlock.h exec: move include files to include/exec/ 2012-12-19 08:31:31 +01:00