gh-119786: Add jit.md. Move adaptive.md to a section of interpreter.md. (#127175)

2024-12-06 16:36:06 +00:00 · 2024-12-06 16:36:06 +00:00 · 89fa7ec74e
parent 67b18a18b6
commit 89fa7ec74e
6 changed files with 322 additions and 187 deletions
--- a/InternalDocs/README.md
+++ b/InternalDocs/README.md
@ -34,9 +34,9 @@ # CPython Internals Documentation
 Program Execution
 ---
- [The Interpreter](interpreter.md)
+- [The Bytecode Interpreter](interpreter.md)
- [Adaptive Instruction Families](adaptive.md)
+- [The JIT](jit.md)
 - [Garbage Collector Design](garbage_collector.md)
--- a/InternalDocs/adaptive.md
+++ b/InternalDocs/adaptive.md
@ -1,146 +0,0 @@
 # Adding or extending a family of adaptive instructions.
 ## Families of instructions
 The core part of [PEP 659](https://peps.python.org/pep-0659/)
 (specializing adaptive interpreter) is the families of
 instructions that perform the adaptive specialization.
 A family of instructions has the following fundamental properties:
 * It corresponds to a single instruction in the code
  generated by the bytecode compiler.
 * It has a single adaptive instruction that records an execution count and,
  at regular intervals, attempts to specialize itself. If not specializing,
  it executes the base implementation.
 * It has at least one specialized form of the instruction that is tailored
  for a particular value or set of values at runtime.
 * All members of the family must have the same number of inline cache entries,
  to ensure correct execution.
  Individual family members do not need to use all of the entries,
  but must skip over any unused entries when executing.
 The current implementation also requires the following,
 although these are not fundamental and may change:
 * All families use one or more inline cache entries,
  the first entry is always the counter.
 * All instruction names should start with the name of the adaptive
  instruction.
 * Specialized forms should have names describing their specialization.
 ## Example family
 The `LOAD_GLOBAL` instruction (in [Python/bytecodes.c](../Python/bytecodes.c))
 already has an adaptive family that serves as a relatively simple example.
 The `LOAD_GLOBAL` instruction performs adaptive specialization,
 calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero.
 There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE`
 which is specialized for global variables in the module, and
 `LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables.
 ## Performance analysis
 The benefit of a specialization can be assessed with the following formula:
 `Tbase/Tadaptive`.
 Where `Tbase` is the mean time to execute the base instruction,
 and `Tadaptive` is the mean time to execute the specialized and adaptive forms.
 `Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)`
 `Ti` is the time to execute the `i`th instruction in the family and `Ni` is
 the number of times that instruction is executed.
 `Tmiss` is the time to process a miss, including de-optimzation
 and the time to execute the base instruction.
 The ideal situation is where misses are rare and the specialized
 forms are much faster than the base instruction.
 `LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`.
 In which case we have `Tadaptive ≈ sum(Ti*Ni)`.
 Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and
 `LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction,
 we would expect the specialization of `LOAD_GLOBAL` to be profitable.
 ## Design considerations
 While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and
 `CALL_FUNCTION` are not. For maximum performance we want to keep `Ti`
 low for all specialized instructions and `Nmiss` as low as possible.
 Keeping `Nmiss` low means that there should be specializations for almost
 all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means
 keeping `Ti` low which means minimizing branches and dependent memory
 accesses (pointer chasing). These two objectives may be in conflict,
 requiring judgement and experimentation to design the family of instructions.
 The size of the inline cache should as small as possible,
 without impairing performance, to reduce the number of
 `EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache.
 ### Gathering data
 Before choosing how to specialize an instruction, it is important to gather
 some data. What are the patterns of usage of the base instruction?
 Data can best be gathered by instrumenting the interpreter. Since a
 specialization function and adaptive instruction are going to be required,
 instrumentation can most easily be added in the specialization function.
 ### Choice of specializations
 The performance of the specializing adaptive interpreter relies on the
 quality of specialization and keeping the overhead of specialization low.
 Specialized instructions must be fast. In order to be fast,
 specialized instructions should be tailored for a particular
 set of values that allows them to:
 1. Verify that incoming value is part of that set with low overhead.
 2. Perform the operation quickly.
 This requires that the set of values is chosen such that membership can be
 tested quickly and that membership is sufficient to allow the operation to
 performed quickly.
 For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()`
 dictionaries that have a keys with the expected version.
 This can be tested quickly:
 * `globals->keys->dk_version == expected_version`
 and the operation can be performed quickly:
 * `value = entries[cache->index].me_value;`.
 Because it is impossible to measure the performance of an instruction without
 also measuring unrelated factors, the assessment of the quality of a
 specialization will require some judgement.
 As a general rule, specialized instructions should be much faster than the
 base instruction.
 ### Implementation of specialized instructions
 In general, specialized instructions should be implemented in two parts:
 1. A sequence of guards, each of the form
   `DEOPT_IF(guard-condition-is-false, BASE_NAME)`.
 2. The operation, which should ideally have no branches and
   a minimum number of dependent memory accesses.
 In practice, the parts may overlap, as data required for guards
 can be re-used in the operation.
 If there are branches in the operation, then consider further specialization
 to eliminate the branches.
 ### Maintaining stats
 Finally, take care that stats are gather correctly.
 After the last `DEOPT_IF` has passed, a hit should be recorded with
 `STAT_INC(BASE_INSTRUCTION, hit)`.
 After an optimization has been deferred in the adaptive instruction,
 that should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`.
--- a/InternalDocs/code_objects.md
+++ b/InternalDocs/code_objects.md
@ -18,6 +18,11 @@ # Code objects
 although they are often written to disk by one process and read back in by another.
 The disk version of a code object is serialized using the
 [marshal](https://docs.python.org/dev/library/marshal.html) protocol.
 When a `CodeObject` is created, the function `_PyCode_Quicken()` from
 [`Python/specialize.c`](../Python/specialize.c) is called to initialize
 the caches of all adaptive instructions. This is required because the
 on-disk format is a sequence of bytes, and some of the caches need to be
 initialized with 16-bit values.
 Code objects are nominally immutable.
 Some fields (including `co_code_adaptive` and fields for runtime
--- a/InternalDocs/compiler.md
+++ b/InternalDocs/compiler.md
@ -595,16 +595,6 @@
 * [Exception Handling](exception_handling.md): Describes the exception table
 Specializing Adaptive Interpreter
 =================================
 Adding a specializing, adaptive interpreter to CPython will bring significant
 performance improvements. These documents provide more information:
 * [PEP 659: Specializing Adaptive Interpreter](https://peps.python.org/pep-0659/).
 * [Adding or extending a family of adaptive instructions](adaptive.md)
 References
 ==========
--- a/InternalDocs/interpreter.md
+++ b/InternalDocs/interpreter.md
@ -1,8 +1,4 @@
-The bytecode interpreter
+# The bytecode interpreter
 ========================
 Overview
 --------
 This document describes the workings and implementation of the bytecode
 interpreter, the part of python that executes compiled Python code. Its
@ -47,8 +43,7 @@
 `_PyEval_EvalFrameDefault()`.
-Instruction decoding
+## Instruction decoding
 --------------------
 The first task of the interpreter is to decode the bytecode instructions.
 Bytecode is stored as an array of 16-bit code units (`_Py_CODEUNIT`).
@ -110,8 +105,7 @@
 For various reasons we'll get to later (mostly efficiency, given that `EXTENDED_ARG`
 is rare) the actual code is different.
-Jumps
+## Jumps
 =====
 Note that when the `switch` statement is reached, `next_instr` (the "instruction offset")
 already points to the next instruction.
@ -120,25 +114,26 @@
 - A jump forward (`JUMP_FORWARD`) sets `next_instr += oparg`.
 - A jump backward sets `next_instr -= oparg`.
-Inline cache entries
+## Inline cache entries
 ====================
 Some (specialized or specializable) instructions have an associated "inline cache".
 The inline cache consists of one or more two-byte entries included in the bytecode
 array as additional words following the `opcode`/`oparg` pair.
 The size of the inline cache for a particular instruction is fixed by its `opcode`.
 Moreover, the inline cache size for all instructions in a
-[family of specialized/specializable instructions](adaptive.md)
+[family of specialized/specializable instructions](#Specialization)
 (for example, `LOAD_ATTR`, `LOAD_ATTR_SLOT`, `LOAD_ATTR_MODULE`) must all be
 the same.  Cache entries are reserved by the compiler and initialized with zeros.
 Although they are represented by code units, cache entries do not conform to the
 `opcode` / `oparg` format.
-If an instruction has an inline cache, the layout of its cache is described by
+If an instruction has an inline cache, the layout of its cache is described in
-a `struct` definition in (`pycore_code.h`)[../Include/internal/pycore_code.h].
+the instruction's definition in [`Python/bytecodes.c`](../Python/bytecodes.c).
-This allows us to access the cache by casting `next_instr` to a pointer to this `struct`.
+The structs defined in [`pycore_code.h`](../Include/internal/pycore_code.h)
-The size of such a `struct` must be independent of the machine architecture, word size
+allow us to access the cache by casting `next_instr` to a pointer to the relevant
-and alignment requirements.  For a 32-bit field, the `struct` should use `_Py_CODEUNIT field[2]`.
+`struct`.  The size of such a `struct` must be independent of the machine
 architecture, word size and alignment requirements.  For a 32-bit field, the
 `struct` should use `_Py_CODEUNIT field[2]`.
 The instruction implementation is responsible for advancing `next_instr` past the inline cache.
 For example, if an instruction's inline cache is four bytes (that is, two code units) in size,
@ -153,8 +148,7 @@
 More information about the use of inline caches can be found in
 [PEP 659](https://peps.python.org/pep-0659/#ancillary-data).
-The evaluation stack
+## The evaluation stack
 --------------------
 Most instructions read or write some data in the form of object references (`PyObject *`).
 The CPython bytecode interpreter is a stack machine, meaning that its instructions operate
@ -193,16 +187,14 @@
 > Do not confuse the evaluation stack with the call stack, which is used to implement calling
 > and returning from functions.
-Error handling
+## Error handling
 --------------
 When the implementation of an opcode raises an exception, it jumps to the
 `exception_unwind` label in [Python/ceval.c](../Python/ceval.c).
 The exception is then handled as described in the
 [`exception handling documentation`](exception_handling.md#handling-exceptions).
-Python-to-Python calls
+## Python-to-Python calls
 ----------------------
 The `_PyEval_EvalFrameDefault()` function is recursive, because sometimes
 the interpreter calls some C function that calls back into the interpreter.
@ -227,8 +219,7 @@
 A similar check is performed when an unhandled exception occurs.
-The call stack
+## The call stack
 --------------
 Up through 3.10, the call stack was implemented as a singly-linked list of
 [frame objects](frames.md). This was expensive because each call would require a
@ -262,8 +253,7 @@
 <!--
-All sorts of variables
+## All sorts of variables
 ----------------------
 The bytecode compiler determines the scope in which each variable name is defined,
 and generates instructions accordingly.  For example, loading a local variable
@ -297,8 +287,7 @@
 -->
-Introducing a new bytecode instruction
+## Introducing a new bytecode instruction
 --------------------------------------
 It is occasionally necessary to add a new opcode in order to implement
 a new feature or change the way that existing features are compiled.
@ -355,6 +344,169 @@
 bytecode of frozen importlib files.  You have to run `make` again after this
 to recompile the generated C files.
 ## Specialization
 Bytecode specialization, which was introduced in
 [PEP 659](https://peps.python.org/pep-0659/), speeds up program execution by
 rewriting instructions based on runtime information. This is done by replacing
 a generic instruction with a faster version that works for the case that this
 program encounters. Each specializable instruction is responsible for rewriting
 itself, using its [inline caches](#inline-cache-entries) for
 bookkeeping.
 When an adaptive instruction executes, it may attempt to specialize itself,
 depending on the argument and the contents of its cache. This is done
 by calling one of the `_Py_Specialize_XXX` functions in
 [`Python/specialize.c`](../Python/specialize.c).
 The specialized instructions are responsible for checking that the special-case
 assumptions still apply, and de-optimizing back to the generic version if not.
 ## Families of instructions
 A *family* of instructions consists of an adaptive instruction along with the
 specialized instructions that it can be replaced by.
 It has the following fundamental properties:
 * It corresponds to a single instruction in the code
  generated by the bytecode compiler.
 * It has a single adaptive instruction that records an execution count and,
  at regular intervals, attempts to specialize itself. If not specializing,
  it executes the base implementation.
 * It has at least one specialized form of the instruction that is tailored
  for a particular value or set of values at runtime.
 * All members of the family must have the same number of inline cache entries,
  to ensure correct execution.
  Individual family members do not need to use all of the entries,
  but must skip over any unused entries when executing.
 The current implementation also requires the following,
 although these are not fundamental and may change:
 * All families use one or more inline cache entries,
  the first entry is always the counter.
 * All instruction names should start with the name of the adaptive
  instruction.
 * Specialized forms should have names describing their specialization.
 ## Example family
 The `LOAD_GLOBAL` instruction (in [Python/bytecodes.c](../Python/bytecodes.c))
 already has an adaptive family that serves as a relatively simple example.
 The `LOAD_GLOBAL` instruction performs adaptive specialization,
 calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero.
 There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE`
 which is specialized for global variables in the module, and
 `LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables.
 ## Performance analysis
 The benefit of a specialization can be assessed with the following formula:
 `Tbase/Tadaptive`.
 Where `Tbase` is the mean time to execute the base instruction,
 and `Tadaptive` is the mean time to execute the specialized and adaptive forms.
 `Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)`
 `Ti` is the time to execute the `i`th instruction in the family and `Ni` is
 the number of times that instruction is executed.
 `Tmiss` is the time to process a miss, including de-optimzation
 and the time to execute the base instruction.
 The ideal situation is where misses are rare and the specialized
 forms are much faster than the base instruction.
 `LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`.
 In which case we have `Tadaptive ≈ sum(Ti*Ni)`.
 Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and
 `LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction,
 we would expect the specialization of `LOAD_GLOBAL` to be profitable.
 ## Design considerations
 While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and
 `CALL_FUNCTION` are not. For maximum performance we want to keep `Ti`
 low for all specialized instructions and `Nmiss` as low as possible.
 Keeping `Nmiss` low means that there should be specializations for almost
 all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means
 keeping `Ti` low which means minimizing branches and dependent memory
 accesses (pointer chasing). These two objectives may be in conflict,
 requiring judgement and experimentation to design the family of instructions.
 The size of the inline cache should as small as possible,
 without impairing performance, to reduce the number of
 `EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache.
 ### Gathering data
 Before choosing how to specialize an instruction, it is important to gather
 some data. What are the patterns of usage of the base instruction?
 Data can best be gathered by instrumenting the interpreter. Since a
 specialization function and adaptive instruction are going to be required,
 instrumentation can most easily be added in the specialization function.
 ### Choice of specializations
 The performance of the specializing adaptive interpreter relies on the
 quality of specialization and keeping the overhead of specialization low.
 Specialized instructions must be fast. In order to be fast,
 specialized instructions should be tailored for a particular
 set of values that allows them to:
 1. Verify that incoming value is part of that set with low overhead.
 2. Perform the operation quickly.
 This requires that the set of values is chosen such that membership can be
 tested quickly and that membership is sufficient to allow the operation to
 performed quickly.
 For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()`
 dictionaries that have a keys with the expected version.
 This can be tested quickly:
 * `globals->keys->dk_version == expected_version`
 and the operation can be performed quickly:
 * `value = entries[cache->index].me_value;`.
 Because it is impossible to measure the performance of an instruction without
 also measuring unrelated factors, the assessment of the quality of a
 specialization will require some judgement.
 As a general rule, specialized instructions should be much faster than the
 base instruction.
 ### Implementation of specialized instructions
 In general, specialized instructions should be implemented in two parts:
 1. A sequence of guards, each of the form
  `DEOPT_IF(guard-condition-is-false, BASE_NAME)`.
 2. The operation, which should ideally have no branches and
  a minimum number of dependent memory accesses.
 In practice, the parts may overlap, as data required for guards
 can be re-used in the operation.
 If there are branches in the operation, then consider further specialization
 to eliminate the branches.
 ### Maintaining stats
 Finally, take care that stats are gathered correctly.
 After the last `DEOPT_IF` has passed, a hit should be recorded with
 `STAT_INC(BASE_INSTRUCTION, hit)`.
 After an optimization has been deferred in the adaptive instruction,
 that should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`.
 Additional resources
 --------------------
--- a/InternalDocs/jit.md
+++ b/InternalDocs/jit.md
@ -0,0 +1,134 @@
 # The JIT
 The [adaptive interpreter](interpreter.md) consists of a main loop that
 executes the bytecode instructions generated by the
 [bytecode compiler](compiler.md) and their
 [specializations](interpreter.md#Specialization). Runtime optimization in
 this interpreter can only be done for one instruction at a time. The JIT
 is based on a mechanism to replace an entire sequence of bytecode instructions,
 and this enables optimizations that span multiple instructions.
 Historically, the adaptive interpreter was referred to as `tier 1` and
 the JIT as `tier 2`. You will see remnants of this in the code.
 ## The Optimizer and Executors
 The program begins running on the adaptive interpreter, until a `JUMP_BACKWARD`
 instruction determines that it is "hot" because the counter in its
 [inline cache](interpreter.md#inline-cache-entries) indicates that it
 executed more than some threshold number of times (see
 [`backoff_counter_triggers`](../Include/internal/pycore_backoff.h)).
 It then calls the function `_PyOptimizer_Optimize()` in
 [`Python/optimizer.c`](../Python/optimizer.c), passing it the current
 [frame](frames.md) and instruction pointer. `_PyOptimizer_Optimize()`
 constructs an object of type
 [`_PyExecutorObject`](Include/internal/pycore_optimizer.h) which implements
 an optimized version of the instruction trace beginning at this jump.
 The optimizer determines where the trace ends, and the executor is set up
 to either return to the adaptive interpreter and resume execution, or
 transfer control to another executor (see `_PyExitData` in
 Include/internal/pycore_optimizer.h).
 The executor is stored on the [`code object`](code_objects.md) of the frame,
 in the `co_executors` field which is an array of executors. The start
 instruction of the trace (the `JUMP_BACKWARD`) is replaced by an
 `ENTER_EXECUTOR` instruction whose `oparg` is equal to the index of the
 executor in `co_executors`.
 ## The micro-op optimizer
 The optimizer that `_PyOptimizer_Optimize()` runs is configurable via the
 `_Py_SetTier2Optimizer()` function (this is used in test via
 `_testinternalcapi.set_optimizer()`.)
 The micro-op (abbreviated `uop` to approximate `μop`) optimizer is defined in
 [`Python/optimizer.c`](../Python/optimizer.c) as the type `_PyUOpOptimizer_Type`.
 It translates an instruction trace into a sequence of micro-ops by replacing
 each bytecode by an equivalent sequence of micro-ops (see
 `_PyOpcode_macro_expansion` in
 [pycore_opcode_metadata.h](../Include/internal/pycore_opcode_metadata.h)
 which is generated from [`Python/bytecodes.c`](../Python/bytecodes.c)).
 The micro-op sequence is then optimized by
 `_Py_uop_analyze_and_optimize` in
 [`Python/optimizer_analysis.c`](../Python/optimizer_analysis.c)
 and an instance of `_PyUOpExecutor_Type` is created to contain it.
 ## The JIT interpreter
 After a `JUMP_BACKWARD` instruction invokes the uop optimizer to create a uop
 executor, it transfers control to this executor via the `GOTO_TIER_TWO` macro.
 CPython implements two executors. Here we describe the JIT interpreter,
 which is the simpler of them and is therefore useful for debugging and analyzing
 the uops generation and optimization stages. To run it, we configure the
 JIT to run on its interpreter (i.e., python is configured with
 [`--enable-experimental-jit=interpreter`](https://docs.python.org/dev/using/configure.html#cmdoption-enable-experimental-jit)).
 When invoked, the executor jumps to the `tier2_dispatch:` label in
 [`Python/ceval.c`](../Python/ceval.c), where there is a loop that
 executes the micro-ops. The body of this loop is a switch statement over
 the uops IDs, resembling the one used in the adaptive interpreter.
 The swtich implementing the uops is in [`Python/executor_cases.c.h`](../Python/executor_cases.c.h),
 which is generated by the build script
 [`Tools/cases_generator/tier2_generator.py`](../Tools/cases_generator/tier2_generator.py)
 from the bytecode definitions in
 [`Python/bytecodes.c`](../Python/bytecodes.c).
 When an `_EXIT_TRACE` or `_DEOPT` uop is reached, the uop interpreter exits
 and execution returns to the adaptive interpreter.
 ## Invalidating Executors
 In addition to being stored on the code object, each executor is also
 inserted into a list of all executors, which is stored in the interpreter
 state's `executor_list_head` field. This list is used when it is necessary
 to invalidate executors because values they used in their construction may
 have changed.
 ## The JIT
 When the full jit is enabled (python was configured with
 [`--enable-experimental-jit`](https://docs.python.org/dev/using/configure.html#cmdoption-enable-experimental-jit),
 the uop executor's `jit_code` field is populated with a pointer to a compiled
 C function that implements the executor logic. This function's signature is
 defined by `jit_func` in [`pycore_jit.h`](Include/internal/pycore_jit.h).
 When the executor is invoked by `ENTER_EXECUTOR`, instead of jumping to
 the uop interpreter at `tier2_dispatch`, the executor runs the function
 that `jit_code` points to. This function returns the instruction pointer
 of the next Tier 1 instruction that needs to execute.
 The generation of the jitted functions uses the copy-and-patch technique
 which is described in
 [Haoran Xu's article](https://sillycross.github.io/2023/05/12/2023-05-12/).
 At its core are statically generated `stencils` for the implementation
 of the micro ops, which are completed with runtime information while
 the jitted code is constructed for an executor by
 [`_PyJIT_Compile`](../Python/jit.c).
 The stencils are generated at build time under the Makefile target `regen-jit`
 by the scripts in [`/Tools/jit`](/Tools/jit). This script reads
 [`Python/executor_cases.c.h`](../Python/executor_cases.c.h) (which is
 generated from [`Python/bytecodes.c`](../Python/bytecodes.c)). For
 each opcode, it constructs a `.c` file that contains a function for
 implementing this opcode, with some runtime information injected.
 This is done by replacing `CASE` by the bytecode definition in the
 template file [`Tools/jit/template.c`](../Tools/jit/template.c).
 Each of the `.c` files is compiled by LLVM, to produce an object file
 that contains a function that executes the opcode. These compiled
 functions are used to generate the file
 [`jit_stencils.h`](../jit_stencils.h), which contains the functions
 that the JIT can use to emit code for each of the bytecodes.
 For Python maintainers this means that changes to the bytecodes and
 their implementations do not require changes related to the stencils,
 because everything is automatically generated from
 [`Python/bytecodes.c`](../Python/bytecodes.c) at build time.
 See Also:
 * [Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode](https://arxiv.org/abs/2011.13127)
 * [PyCon 2024: Building a JIT compiler for CPython](https://www.youtube.com/watch?v=kMO3Ju0QCDo)