mirror of https://github.com/python/cpython.git
gh-119786: Add jit.md. Move adaptive.md to a section of interpreter.md. (#127175)
This commit is contained in:
parent
67b18a18b6
commit
89fa7ec74e
|
@ -34,9 +34,9 @@ # CPython Internals Documentation
|
||||||
Program Execution
|
Program Execution
|
||||||
---
|
---
|
||||||
|
|
||||||
- [The Interpreter](interpreter.md)
|
- [The Bytecode Interpreter](interpreter.md)
|
||||||
|
|
||||||
- [Adaptive Instruction Families](adaptive.md)
|
- [The JIT](jit.md)
|
||||||
|
|
||||||
- [Garbage Collector Design](garbage_collector.md)
|
- [Garbage Collector Design](garbage_collector.md)
|
||||||
|
|
||||||
|
|
|
@ -1,146 +0,0 @@
|
||||||
# Adding or extending a family of adaptive instructions.
|
|
||||||
|
|
||||||
## Families of instructions
|
|
||||||
|
|
||||||
The core part of [PEP 659](https://peps.python.org/pep-0659/)
|
|
||||||
(specializing adaptive interpreter) is the families of
|
|
||||||
instructions that perform the adaptive specialization.
|
|
||||||
|
|
||||||
A family of instructions has the following fundamental properties:
|
|
||||||
|
|
||||||
* It corresponds to a single instruction in the code
|
|
||||||
generated by the bytecode compiler.
|
|
||||||
* It has a single adaptive instruction that records an execution count and,
|
|
||||||
at regular intervals, attempts to specialize itself. If not specializing,
|
|
||||||
it executes the base implementation.
|
|
||||||
* It has at least one specialized form of the instruction that is tailored
|
|
||||||
for a particular value or set of values at runtime.
|
|
||||||
* All members of the family must have the same number of inline cache entries,
|
|
||||||
to ensure correct execution.
|
|
||||||
Individual family members do not need to use all of the entries,
|
|
||||||
but must skip over any unused entries when executing.
|
|
||||||
|
|
||||||
The current implementation also requires the following,
|
|
||||||
although these are not fundamental and may change:
|
|
||||||
|
|
||||||
* All families use one or more inline cache entries,
|
|
||||||
the first entry is always the counter.
|
|
||||||
* All instruction names should start with the name of the adaptive
|
|
||||||
instruction.
|
|
||||||
* Specialized forms should have names describing their specialization.
|
|
||||||
|
|
||||||
## Example family
|
|
||||||
|
|
||||||
The `LOAD_GLOBAL` instruction (in [Python/bytecodes.c](../Python/bytecodes.c))
|
|
||||||
already has an adaptive family that serves as a relatively simple example.
|
|
||||||
|
|
||||||
The `LOAD_GLOBAL` instruction performs adaptive specialization,
|
|
||||||
calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero.
|
|
||||||
|
|
||||||
There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE`
|
|
||||||
which is specialized for global variables in the module, and
|
|
||||||
`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables.
|
|
||||||
|
|
||||||
## Performance analysis
|
|
||||||
|
|
||||||
The benefit of a specialization can be assessed with the following formula:
|
|
||||||
`Tbase/Tadaptive`.
|
|
||||||
|
|
||||||
Where `Tbase` is the mean time to execute the base instruction,
|
|
||||||
and `Tadaptive` is the mean time to execute the specialized and adaptive forms.
|
|
||||||
|
|
||||||
`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)`
|
|
||||||
|
|
||||||
`Ti` is the time to execute the `i`th instruction in the family and `Ni` is
|
|
||||||
the number of times that instruction is executed.
|
|
||||||
`Tmiss` is the time to process a miss, including de-optimzation
|
|
||||||
and the time to execute the base instruction.
|
|
||||||
|
|
||||||
The ideal situation is where misses are rare and the specialized
|
|
||||||
forms are much faster than the base instruction.
|
|
||||||
`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`.
|
|
||||||
In which case we have `Tadaptive ≈ sum(Ti*Ni)`.
|
|
||||||
Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and
|
|
||||||
`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction,
|
|
||||||
we would expect the specialization of `LOAD_GLOBAL` to be profitable.
|
|
||||||
|
|
||||||
## Design considerations
|
|
||||||
|
|
||||||
While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and
|
|
||||||
`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti`
|
|
||||||
low for all specialized instructions and `Nmiss` as low as possible.
|
|
||||||
|
|
||||||
Keeping `Nmiss` low means that there should be specializations for almost
|
|
||||||
all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means
|
|
||||||
keeping `Ti` low which means minimizing branches and dependent memory
|
|
||||||
accesses (pointer chasing). These two objectives may be in conflict,
|
|
||||||
requiring judgement and experimentation to design the family of instructions.
|
|
||||||
|
|
||||||
The size of the inline cache should as small as possible,
|
|
||||||
without impairing performance, to reduce the number of
|
|
||||||
`EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache.
|
|
||||||
|
|
||||||
### Gathering data
|
|
||||||
|
|
||||||
Before choosing how to specialize an instruction, it is important to gather
|
|
||||||
some data. What are the patterns of usage of the base instruction?
|
|
||||||
Data can best be gathered by instrumenting the interpreter. Since a
|
|
||||||
specialization function and adaptive instruction are going to be required,
|
|
||||||
instrumentation can most easily be added in the specialization function.
|
|
||||||
|
|
||||||
### Choice of specializations
|
|
||||||
|
|
||||||
The performance of the specializing adaptive interpreter relies on the
|
|
||||||
quality of specialization and keeping the overhead of specialization low.
|
|
||||||
|
|
||||||
Specialized instructions must be fast. In order to be fast,
|
|
||||||
specialized instructions should be tailored for a particular
|
|
||||||
set of values that allows them to:
|
|
||||||
|
|
||||||
1. Verify that incoming value is part of that set with low overhead.
|
|
||||||
2. Perform the operation quickly.
|
|
||||||
|
|
||||||
This requires that the set of values is chosen such that membership can be
|
|
||||||
tested quickly and that membership is sufficient to allow the operation to
|
|
||||||
performed quickly.
|
|
||||||
|
|
||||||
For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()`
|
|
||||||
dictionaries that have a keys with the expected version.
|
|
||||||
|
|
||||||
This can be tested quickly:
|
|
||||||
|
|
||||||
* `globals->keys->dk_version == expected_version`
|
|
||||||
|
|
||||||
and the operation can be performed quickly:
|
|
||||||
|
|
||||||
* `value = entries[cache->index].me_value;`.
|
|
||||||
|
|
||||||
Because it is impossible to measure the performance of an instruction without
|
|
||||||
also measuring unrelated factors, the assessment of the quality of a
|
|
||||||
specialization will require some judgement.
|
|
||||||
|
|
||||||
As a general rule, specialized instructions should be much faster than the
|
|
||||||
base instruction.
|
|
||||||
|
|
||||||
### Implementation of specialized instructions
|
|
||||||
|
|
||||||
In general, specialized instructions should be implemented in two parts:
|
|
||||||
|
|
||||||
1. A sequence of guards, each of the form
|
|
||||||
`DEOPT_IF(guard-condition-is-false, BASE_NAME)`.
|
|
||||||
2. The operation, which should ideally have no branches and
|
|
||||||
a minimum number of dependent memory accesses.
|
|
||||||
|
|
||||||
In practice, the parts may overlap, as data required for guards
|
|
||||||
can be re-used in the operation.
|
|
||||||
|
|
||||||
If there are branches in the operation, then consider further specialization
|
|
||||||
to eliminate the branches.
|
|
||||||
|
|
||||||
### Maintaining stats
|
|
||||||
|
|
||||||
Finally, take care that stats are gather correctly.
|
|
||||||
After the last `DEOPT_IF` has passed, a hit should be recorded with
|
|
||||||
`STAT_INC(BASE_INSTRUCTION, hit)`.
|
|
||||||
After an optimization has been deferred in the adaptive instruction,
|
|
||||||
that should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`.
|
|
|
@ -18,6 +18,11 @@ # Code objects
|
||||||
although they are often written to disk by one process and read back in by another.
|
although they are often written to disk by one process and read back in by another.
|
||||||
The disk version of a code object is serialized using the
|
The disk version of a code object is serialized using the
|
||||||
[marshal](https://docs.python.org/dev/library/marshal.html) protocol.
|
[marshal](https://docs.python.org/dev/library/marshal.html) protocol.
|
||||||
|
When a `CodeObject` is created, the function `_PyCode_Quicken()` from
|
||||||
|
[`Python/specialize.c`](../Python/specialize.c) is called to initialize
|
||||||
|
the caches of all adaptive instructions. This is required because the
|
||||||
|
on-disk format is a sequence of bytes, and some of the caches need to be
|
||||||
|
initialized with 16-bit values.
|
||||||
|
|
||||||
Code objects are nominally immutable.
|
Code objects are nominally immutable.
|
||||||
Some fields (including `co_code_adaptive` and fields for runtime
|
Some fields (including `co_code_adaptive` and fields for runtime
|
||||||
|
|
|
@ -595,16 +595,6 @@
|
||||||
* [Exception Handling](exception_handling.md): Describes the exception table
|
* [Exception Handling](exception_handling.md): Describes the exception table
|
||||||
|
|
||||||
|
|
||||||
Specializing Adaptive Interpreter
|
|
||||||
=================================
|
|
||||||
|
|
||||||
Adding a specializing, adaptive interpreter to CPython will bring significant
|
|
||||||
performance improvements. These documents provide more information:
|
|
||||||
|
|
||||||
* [PEP 659: Specializing Adaptive Interpreter](https://peps.python.org/pep-0659/).
|
|
||||||
* [Adding or extending a family of adaptive instructions](adaptive.md)
|
|
||||||
|
|
||||||
|
|
||||||
References
|
References
|
||||||
==========
|
==========
|
||||||
|
|
||||||
|
|
|
@ -1,8 +1,4 @@
|
||||||
The bytecode interpreter
|
# The bytecode interpreter
|
||||||
========================
|
|
||||||
|
|
||||||
Overview
|
|
||||||
--------
|
|
||||||
|
|
||||||
This document describes the workings and implementation of the bytecode
|
This document describes the workings and implementation of the bytecode
|
||||||
interpreter, the part of python that executes compiled Python code. Its
|
interpreter, the part of python that executes compiled Python code. Its
|
||||||
|
@ -47,8 +43,7 @@
|
||||||
`_PyEval_EvalFrameDefault()`.
|
`_PyEval_EvalFrameDefault()`.
|
||||||
|
|
||||||
|
|
||||||
Instruction decoding
|
## Instruction decoding
|
||||||
--------------------
|
|
||||||
|
|
||||||
The first task of the interpreter is to decode the bytecode instructions.
|
The first task of the interpreter is to decode the bytecode instructions.
|
||||||
Bytecode is stored as an array of 16-bit code units (`_Py_CODEUNIT`).
|
Bytecode is stored as an array of 16-bit code units (`_Py_CODEUNIT`).
|
||||||
|
@ -110,8 +105,7 @@
|
||||||
For various reasons we'll get to later (mostly efficiency, given that `EXTENDED_ARG`
|
For various reasons we'll get to later (mostly efficiency, given that `EXTENDED_ARG`
|
||||||
is rare) the actual code is different.
|
is rare) the actual code is different.
|
||||||
|
|
||||||
Jumps
|
## Jumps
|
||||||
=====
|
|
||||||
|
|
||||||
Note that when the `switch` statement is reached, `next_instr` (the "instruction offset")
|
Note that when the `switch` statement is reached, `next_instr` (the "instruction offset")
|
||||||
already points to the next instruction.
|
already points to the next instruction.
|
||||||
|
@ -120,25 +114,26 @@
|
||||||
- A jump forward (`JUMP_FORWARD`) sets `next_instr += oparg`.
|
- A jump forward (`JUMP_FORWARD`) sets `next_instr += oparg`.
|
||||||
- A jump backward sets `next_instr -= oparg`.
|
- A jump backward sets `next_instr -= oparg`.
|
||||||
|
|
||||||
Inline cache entries
|
## Inline cache entries
|
||||||
====================
|
|
||||||
|
|
||||||
Some (specialized or specializable) instructions have an associated "inline cache".
|
Some (specialized or specializable) instructions have an associated "inline cache".
|
||||||
The inline cache consists of one or more two-byte entries included in the bytecode
|
The inline cache consists of one or more two-byte entries included in the bytecode
|
||||||
array as additional words following the `opcode`/`oparg` pair.
|
array as additional words following the `opcode`/`oparg` pair.
|
||||||
The size of the inline cache for a particular instruction is fixed by its `opcode`.
|
The size of the inline cache for a particular instruction is fixed by its `opcode`.
|
||||||
Moreover, the inline cache size for all instructions in a
|
Moreover, the inline cache size for all instructions in a
|
||||||
[family of specialized/specializable instructions](adaptive.md)
|
[family of specialized/specializable instructions](#Specialization)
|
||||||
(for example, `LOAD_ATTR`, `LOAD_ATTR_SLOT`, `LOAD_ATTR_MODULE`) must all be
|
(for example, `LOAD_ATTR`, `LOAD_ATTR_SLOT`, `LOAD_ATTR_MODULE`) must all be
|
||||||
the same. Cache entries are reserved by the compiler and initialized with zeros.
|
the same. Cache entries are reserved by the compiler and initialized with zeros.
|
||||||
Although they are represented by code units, cache entries do not conform to the
|
Although they are represented by code units, cache entries do not conform to the
|
||||||
`opcode` / `oparg` format.
|
`opcode` / `oparg` format.
|
||||||
|
|
||||||
If an instruction has an inline cache, the layout of its cache is described by
|
If an instruction has an inline cache, the layout of its cache is described in
|
||||||
a `struct` definition in (`pycore_code.h`)[../Include/internal/pycore_code.h].
|
the instruction's definition in [`Python/bytecodes.c`](../Python/bytecodes.c).
|
||||||
This allows us to access the cache by casting `next_instr` to a pointer to this `struct`.
|
The structs defined in [`pycore_code.h`](../Include/internal/pycore_code.h)
|
||||||
The size of such a `struct` must be independent of the machine architecture, word size
|
allow us to access the cache by casting `next_instr` to a pointer to the relevant
|
||||||
and alignment requirements. For a 32-bit field, the `struct` should use `_Py_CODEUNIT field[2]`.
|
`struct`. The size of such a `struct` must be independent of the machine
|
||||||
|
architecture, word size and alignment requirements. For a 32-bit field, the
|
||||||
|
`struct` should use `_Py_CODEUNIT field[2]`.
|
||||||
|
|
||||||
The instruction implementation is responsible for advancing `next_instr` past the inline cache.
|
The instruction implementation is responsible for advancing `next_instr` past the inline cache.
|
||||||
For example, if an instruction's inline cache is four bytes (that is, two code units) in size,
|
For example, if an instruction's inline cache is four bytes (that is, two code units) in size,
|
||||||
|
@ -153,8 +148,7 @@
|
||||||
More information about the use of inline caches can be found in
|
More information about the use of inline caches can be found in
|
||||||
[PEP 659](https://peps.python.org/pep-0659/#ancillary-data).
|
[PEP 659](https://peps.python.org/pep-0659/#ancillary-data).
|
||||||
|
|
||||||
The evaluation stack
|
## The evaluation stack
|
||||||
--------------------
|
|
||||||
|
|
||||||
Most instructions read or write some data in the form of object references (`PyObject *`).
|
Most instructions read or write some data in the form of object references (`PyObject *`).
|
||||||
The CPython bytecode interpreter is a stack machine, meaning that its instructions operate
|
The CPython bytecode interpreter is a stack machine, meaning that its instructions operate
|
||||||
|
@ -193,16 +187,14 @@
|
||||||
> Do not confuse the evaluation stack with the call stack, which is used to implement calling
|
> Do not confuse the evaluation stack with the call stack, which is used to implement calling
|
||||||
> and returning from functions.
|
> and returning from functions.
|
||||||
|
|
||||||
Error handling
|
## Error handling
|
||||||
--------------
|
|
||||||
|
|
||||||
When the implementation of an opcode raises an exception, it jumps to the
|
When the implementation of an opcode raises an exception, it jumps to the
|
||||||
`exception_unwind` label in [Python/ceval.c](../Python/ceval.c).
|
`exception_unwind` label in [Python/ceval.c](../Python/ceval.c).
|
||||||
The exception is then handled as described in the
|
The exception is then handled as described in the
|
||||||
[`exception handling documentation`](exception_handling.md#handling-exceptions).
|
[`exception handling documentation`](exception_handling.md#handling-exceptions).
|
||||||
|
|
||||||
Python-to-Python calls
|
## Python-to-Python calls
|
||||||
----------------------
|
|
||||||
|
|
||||||
The `_PyEval_EvalFrameDefault()` function is recursive, because sometimes
|
The `_PyEval_EvalFrameDefault()` function is recursive, because sometimes
|
||||||
the interpreter calls some C function that calls back into the interpreter.
|
the interpreter calls some C function that calls back into the interpreter.
|
||||||
|
@ -227,8 +219,7 @@
|
||||||
|
|
||||||
A similar check is performed when an unhandled exception occurs.
|
A similar check is performed when an unhandled exception occurs.
|
||||||
|
|
||||||
The call stack
|
## The call stack
|
||||||
--------------
|
|
||||||
|
|
||||||
Up through 3.10, the call stack was implemented as a singly-linked list of
|
Up through 3.10, the call stack was implemented as a singly-linked list of
|
||||||
[frame objects](frames.md). This was expensive because each call would require a
|
[frame objects](frames.md). This was expensive because each call would require a
|
||||||
|
@ -262,8 +253,7 @@
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
|
|
||||||
All sorts of variables
|
## All sorts of variables
|
||||||
----------------------
|
|
||||||
|
|
||||||
The bytecode compiler determines the scope in which each variable name is defined,
|
The bytecode compiler determines the scope in which each variable name is defined,
|
||||||
and generates instructions accordingly. For example, loading a local variable
|
and generates instructions accordingly. For example, loading a local variable
|
||||||
|
@ -297,8 +287,7 @@
|
||||||
|
|
||||||
-->
|
-->
|
||||||
|
|
||||||
Introducing a new bytecode instruction
|
## Introducing a new bytecode instruction
|
||||||
--------------------------------------
|
|
||||||
|
|
||||||
It is occasionally necessary to add a new opcode in order to implement
|
It is occasionally necessary to add a new opcode in order to implement
|
||||||
a new feature or change the way that existing features are compiled.
|
a new feature or change the way that existing features are compiled.
|
||||||
|
@ -355,6 +344,169 @@
|
||||||
bytecode of frozen importlib files. You have to run `make` again after this
|
bytecode of frozen importlib files. You have to run `make` again after this
|
||||||
to recompile the generated C files.
|
to recompile the generated C files.
|
||||||
|
|
||||||
|
## Specialization
|
||||||
|
|
||||||
|
Bytecode specialization, which was introduced in
|
||||||
|
[PEP 659](https://peps.python.org/pep-0659/), speeds up program execution by
|
||||||
|
rewriting instructions based on runtime information. This is done by replacing
|
||||||
|
a generic instruction with a faster version that works for the case that this
|
||||||
|
program encounters. Each specializable instruction is responsible for rewriting
|
||||||
|
itself, using its [inline caches](#inline-cache-entries) for
|
||||||
|
bookkeeping.
|
||||||
|
|
||||||
|
When an adaptive instruction executes, it may attempt to specialize itself,
|
||||||
|
depending on the argument and the contents of its cache. This is done
|
||||||
|
by calling one of the `_Py_Specialize_XXX` functions in
|
||||||
|
[`Python/specialize.c`](../Python/specialize.c).
|
||||||
|
|
||||||
|
|
||||||
|
The specialized instructions are responsible for checking that the special-case
|
||||||
|
assumptions still apply, and de-optimizing back to the generic version if not.
|
||||||
|
|
||||||
|
## Families of instructions
|
||||||
|
|
||||||
|
A *family* of instructions consists of an adaptive instruction along with the
|
||||||
|
specialized instructions that it can be replaced by.
|
||||||
|
It has the following fundamental properties:
|
||||||
|
|
||||||
|
* It corresponds to a single instruction in the code
|
||||||
|
generated by the bytecode compiler.
|
||||||
|
* It has a single adaptive instruction that records an execution count and,
|
||||||
|
at regular intervals, attempts to specialize itself. If not specializing,
|
||||||
|
it executes the base implementation.
|
||||||
|
* It has at least one specialized form of the instruction that is tailored
|
||||||
|
for a particular value or set of values at runtime.
|
||||||
|
* All members of the family must have the same number of inline cache entries,
|
||||||
|
to ensure correct execution.
|
||||||
|
Individual family members do not need to use all of the entries,
|
||||||
|
but must skip over any unused entries when executing.
|
||||||
|
|
||||||
|
The current implementation also requires the following,
|
||||||
|
although these are not fundamental and may change:
|
||||||
|
|
||||||
|
* All families use one or more inline cache entries,
|
||||||
|
the first entry is always the counter.
|
||||||
|
* All instruction names should start with the name of the adaptive
|
||||||
|
instruction.
|
||||||
|
* Specialized forms should have names describing their specialization.
|
||||||
|
|
||||||
|
## Example family
|
||||||
|
|
||||||
|
The `LOAD_GLOBAL` instruction (in [Python/bytecodes.c](../Python/bytecodes.c))
|
||||||
|
already has an adaptive family that serves as a relatively simple example.
|
||||||
|
|
||||||
|
The `LOAD_GLOBAL` instruction performs adaptive specialization,
|
||||||
|
calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero.
|
||||||
|
|
||||||
|
There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE`
|
||||||
|
which is specialized for global variables in the module, and
|
||||||
|
`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables.
|
||||||
|
|
||||||
|
## Performance analysis
|
||||||
|
|
||||||
|
The benefit of a specialization can be assessed with the following formula:
|
||||||
|
`Tbase/Tadaptive`.
|
||||||
|
|
||||||
|
Where `Tbase` is the mean time to execute the base instruction,
|
||||||
|
and `Tadaptive` is the mean time to execute the specialized and adaptive forms.
|
||||||
|
|
||||||
|
`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)`
|
||||||
|
|
||||||
|
`Ti` is the time to execute the `i`th instruction in the family and `Ni` is
|
||||||
|
the number of times that instruction is executed.
|
||||||
|
`Tmiss` is the time to process a miss, including de-optimzation
|
||||||
|
and the time to execute the base instruction.
|
||||||
|
|
||||||
|
The ideal situation is where misses are rare and the specialized
|
||||||
|
forms are much faster than the base instruction.
|
||||||
|
`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`.
|
||||||
|
In which case we have `Tadaptive ≈ sum(Ti*Ni)`.
|
||||||
|
Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and
|
||||||
|
`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction,
|
||||||
|
we would expect the specialization of `LOAD_GLOBAL` to be profitable.
|
||||||
|
|
||||||
|
## Design considerations
|
||||||
|
|
||||||
|
While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and
|
||||||
|
`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti`
|
||||||
|
low for all specialized instructions and `Nmiss` as low as possible.
|
||||||
|
|
||||||
|
Keeping `Nmiss` low means that there should be specializations for almost
|
||||||
|
all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means
|
||||||
|
keeping `Ti` low which means minimizing branches and dependent memory
|
||||||
|
accesses (pointer chasing). These two objectives may be in conflict,
|
||||||
|
requiring judgement and experimentation to design the family of instructions.
|
||||||
|
|
||||||
|
The size of the inline cache should as small as possible,
|
||||||
|
without impairing performance, to reduce the number of
|
||||||
|
`EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache.
|
||||||
|
|
||||||
|
### Gathering data
|
||||||
|
|
||||||
|
Before choosing how to specialize an instruction, it is important to gather
|
||||||
|
some data. What are the patterns of usage of the base instruction?
|
||||||
|
Data can best be gathered by instrumenting the interpreter. Since a
|
||||||
|
specialization function and adaptive instruction are going to be required,
|
||||||
|
instrumentation can most easily be added in the specialization function.
|
||||||
|
|
||||||
|
### Choice of specializations
|
||||||
|
|
||||||
|
The performance of the specializing adaptive interpreter relies on the
|
||||||
|
quality of specialization and keeping the overhead of specialization low.
|
||||||
|
|
||||||
|
Specialized instructions must be fast. In order to be fast,
|
||||||
|
specialized instructions should be tailored for a particular
|
||||||
|
set of values that allows them to:
|
||||||
|
|
||||||
|
1. Verify that incoming value is part of that set with low overhead.
|
||||||
|
2. Perform the operation quickly.
|
||||||
|
|
||||||
|
This requires that the set of values is chosen such that membership can be
|
||||||
|
tested quickly and that membership is sufficient to allow the operation to
|
||||||
|
performed quickly.
|
||||||
|
|
||||||
|
For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()`
|
||||||
|
dictionaries that have a keys with the expected version.
|
||||||
|
|
||||||
|
This can be tested quickly:
|
||||||
|
|
||||||
|
* `globals->keys->dk_version == expected_version`
|
||||||
|
|
||||||
|
and the operation can be performed quickly:
|
||||||
|
|
||||||
|
* `value = entries[cache->index].me_value;`.
|
||||||
|
|
||||||
|
Because it is impossible to measure the performance of an instruction without
|
||||||
|
also measuring unrelated factors, the assessment of the quality of a
|
||||||
|
specialization will require some judgement.
|
||||||
|
|
||||||
|
As a general rule, specialized instructions should be much faster than the
|
||||||
|
base instruction.
|
||||||
|
|
||||||
|
### Implementation of specialized instructions
|
||||||
|
|
||||||
|
In general, specialized instructions should be implemented in two parts:
|
||||||
|
|
||||||
|
1. A sequence of guards, each of the form
|
||||||
|
`DEOPT_IF(guard-condition-is-false, BASE_NAME)`.
|
||||||
|
2. The operation, which should ideally have no branches and
|
||||||
|
a minimum number of dependent memory accesses.
|
||||||
|
|
||||||
|
In practice, the parts may overlap, as data required for guards
|
||||||
|
can be re-used in the operation.
|
||||||
|
|
||||||
|
If there are branches in the operation, then consider further specialization
|
||||||
|
to eliminate the branches.
|
||||||
|
|
||||||
|
### Maintaining stats
|
||||||
|
|
||||||
|
Finally, take care that stats are gathered correctly.
|
||||||
|
After the last `DEOPT_IF` has passed, a hit should be recorded with
|
||||||
|
`STAT_INC(BASE_INSTRUCTION, hit)`.
|
||||||
|
After an optimization has been deferred in the adaptive instruction,
|
||||||
|
that should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`.
|
||||||
|
|
||||||
|
|
||||||
Additional resources
|
Additional resources
|
||||||
--------------------
|
--------------------
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,134 @@
|
||||||
|
# The JIT
|
||||||
|
|
||||||
|
The [adaptive interpreter](interpreter.md) consists of a main loop that
|
||||||
|
executes the bytecode instructions generated by the
|
||||||
|
[bytecode compiler](compiler.md) and their
|
||||||
|
[specializations](interpreter.md#Specialization). Runtime optimization in
|
||||||
|
this interpreter can only be done for one instruction at a time. The JIT
|
||||||
|
is based on a mechanism to replace an entire sequence of bytecode instructions,
|
||||||
|
and this enables optimizations that span multiple instructions.
|
||||||
|
|
||||||
|
Historically, the adaptive interpreter was referred to as `tier 1` and
|
||||||
|
the JIT as `tier 2`. You will see remnants of this in the code.
|
||||||
|
|
||||||
|
## The Optimizer and Executors
|
||||||
|
|
||||||
|
The program begins running on the adaptive interpreter, until a `JUMP_BACKWARD`
|
||||||
|
instruction determines that it is "hot" because the counter in its
|
||||||
|
[inline cache](interpreter.md#inline-cache-entries) indicates that it
|
||||||
|
executed more than some threshold number of times (see
|
||||||
|
[`backoff_counter_triggers`](../Include/internal/pycore_backoff.h)).
|
||||||
|
It then calls the function `_PyOptimizer_Optimize()` in
|
||||||
|
[`Python/optimizer.c`](../Python/optimizer.c), passing it the current
|
||||||
|
[frame](frames.md) and instruction pointer. `_PyOptimizer_Optimize()`
|
||||||
|
constructs an object of type
|
||||||
|
[`_PyExecutorObject`](Include/internal/pycore_optimizer.h) which implements
|
||||||
|
an optimized version of the instruction trace beginning at this jump.
|
||||||
|
|
||||||
|
The optimizer determines where the trace ends, and the executor is set up
|
||||||
|
to either return to the adaptive interpreter and resume execution, or
|
||||||
|
transfer control to another executor (see `_PyExitData` in
|
||||||
|
Include/internal/pycore_optimizer.h).
|
||||||
|
|
||||||
|
The executor is stored on the [`code object`](code_objects.md) of the frame,
|
||||||
|
in the `co_executors` field which is an array of executors. The start
|
||||||
|
instruction of the trace (the `JUMP_BACKWARD`) is replaced by an
|
||||||
|
`ENTER_EXECUTOR` instruction whose `oparg` is equal to the index of the
|
||||||
|
executor in `co_executors`.
|
||||||
|
|
||||||
|
## The micro-op optimizer
|
||||||
|
|
||||||
|
The optimizer that `_PyOptimizer_Optimize()` runs is configurable via the
|
||||||
|
`_Py_SetTier2Optimizer()` function (this is used in test via
|
||||||
|
`_testinternalcapi.set_optimizer()`.)
|
||||||
|
|
||||||
|
The micro-op (abbreviated `uop` to approximate `μop`) optimizer is defined in
|
||||||
|
[`Python/optimizer.c`](../Python/optimizer.c) as the type `_PyUOpOptimizer_Type`.
|
||||||
|
It translates an instruction trace into a sequence of micro-ops by replacing
|
||||||
|
each bytecode by an equivalent sequence of micro-ops (see
|
||||||
|
`_PyOpcode_macro_expansion` in
|
||||||
|
[pycore_opcode_metadata.h](../Include/internal/pycore_opcode_metadata.h)
|
||||||
|
which is generated from [`Python/bytecodes.c`](../Python/bytecodes.c)).
|
||||||
|
The micro-op sequence is then optimized by
|
||||||
|
`_Py_uop_analyze_and_optimize` in
|
||||||
|
[`Python/optimizer_analysis.c`](../Python/optimizer_analysis.c)
|
||||||
|
and an instance of `_PyUOpExecutor_Type` is created to contain it.
|
||||||
|
|
||||||
|
## The JIT interpreter
|
||||||
|
|
||||||
|
After a `JUMP_BACKWARD` instruction invokes the uop optimizer to create a uop
|
||||||
|
executor, it transfers control to this executor via the `GOTO_TIER_TWO` macro.
|
||||||
|
|
||||||
|
CPython implements two executors. Here we describe the JIT interpreter,
|
||||||
|
which is the simpler of them and is therefore useful for debugging and analyzing
|
||||||
|
the uops generation and optimization stages. To run it, we configure the
|
||||||
|
JIT to run on its interpreter (i.e., python is configured with
|
||||||
|
[`--enable-experimental-jit=interpreter`](https://docs.python.org/dev/using/configure.html#cmdoption-enable-experimental-jit)).
|
||||||
|
|
||||||
|
When invoked, the executor jumps to the `tier2_dispatch:` label in
|
||||||
|
[`Python/ceval.c`](../Python/ceval.c), where there is a loop that
|
||||||
|
executes the micro-ops. The body of this loop is a switch statement over
|
||||||
|
the uops IDs, resembling the one used in the adaptive interpreter.
|
||||||
|
|
||||||
|
The swtich implementing the uops is in [`Python/executor_cases.c.h`](../Python/executor_cases.c.h),
|
||||||
|
which is generated by the build script
|
||||||
|
[`Tools/cases_generator/tier2_generator.py`](../Tools/cases_generator/tier2_generator.py)
|
||||||
|
from the bytecode definitions in
|
||||||
|
[`Python/bytecodes.c`](../Python/bytecodes.c).
|
||||||
|
|
||||||
|
When an `_EXIT_TRACE` or `_DEOPT` uop is reached, the uop interpreter exits
|
||||||
|
and execution returns to the adaptive interpreter.
|
||||||
|
|
||||||
|
## Invalidating Executors
|
||||||
|
|
||||||
|
In addition to being stored on the code object, each executor is also
|
||||||
|
inserted into a list of all executors, which is stored in the interpreter
|
||||||
|
state's `executor_list_head` field. This list is used when it is necessary
|
||||||
|
to invalidate executors because values they used in their construction may
|
||||||
|
have changed.
|
||||||
|
|
||||||
|
## The JIT
|
||||||
|
|
||||||
|
When the full jit is enabled (python was configured with
|
||||||
|
[`--enable-experimental-jit`](https://docs.python.org/dev/using/configure.html#cmdoption-enable-experimental-jit),
|
||||||
|
the uop executor's `jit_code` field is populated with a pointer to a compiled
|
||||||
|
C function that implements the executor logic. This function's signature is
|
||||||
|
defined by `jit_func` in [`pycore_jit.h`](Include/internal/pycore_jit.h).
|
||||||
|
When the executor is invoked by `ENTER_EXECUTOR`, instead of jumping to
|
||||||
|
the uop interpreter at `tier2_dispatch`, the executor runs the function
|
||||||
|
that `jit_code` points to. This function returns the instruction pointer
|
||||||
|
of the next Tier 1 instruction that needs to execute.
|
||||||
|
|
||||||
|
The generation of the jitted functions uses the copy-and-patch technique
|
||||||
|
which is described in
|
||||||
|
[Haoran Xu's article](https://sillycross.github.io/2023/05/12/2023-05-12/).
|
||||||
|
At its core are statically generated `stencils` for the implementation
|
||||||
|
of the micro ops, which are completed with runtime information while
|
||||||
|
the jitted code is constructed for an executor by
|
||||||
|
[`_PyJIT_Compile`](../Python/jit.c).
|
||||||
|
|
||||||
|
The stencils are generated at build time under the Makefile target `regen-jit`
|
||||||
|
by the scripts in [`/Tools/jit`](/Tools/jit). This script reads
|
||||||
|
[`Python/executor_cases.c.h`](../Python/executor_cases.c.h) (which is
|
||||||
|
generated from [`Python/bytecodes.c`](../Python/bytecodes.c)). For
|
||||||
|
each opcode, it constructs a `.c` file that contains a function for
|
||||||
|
implementing this opcode, with some runtime information injected.
|
||||||
|
This is done by replacing `CASE` by the bytecode definition in the
|
||||||
|
template file [`Tools/jit/template.c`](../Tools/jit/template.c).
|
||||||
|
|
||||||
|
Each of the `.c` files is compiled by LLVM, to produce an object file
|
||||||
|
that contains a function that executes the opcode. These compiled
|
||||||
|
functions are used to generate the file
|
||||||
|
[`jit_stencils.h`](../jit_stencils.h), which contains the functions
|
||||||
|
that the JIT can use to emit code for each of the bytecodes.
|
||||||
|
|
||||||
|
For Python maintainers this means that changes to the bytecodes and
|
||||||
|
their implementations do not require changes related to the stencils,
|
||||||
|
because everything is automatically generated from
|
||||||
|
[`Python/bytecodes.c`](../Python/bytecodes.c) at build time.
|
||||||
|
|
||||||
|
See Also:
|
||||||
|
|
||||||
|
* [Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode](https://arxiv.org/abs/2011.13127)
|
||||||
|
|
||||||
|
* [PyCon 2024: Building a JIT compiler for CPython](https://www.youtube.com/watch?v=kMO3Ju0QCDo)
|
Loading…
Reference in New Issue