We use the same approach that was used for specialization of LOAD_GLOBAL in free-threaded builds:
_CHECK_ATTR_MODULE is renamed to _CHECK_ATTR_MODULE_PUSH_KEYS; it pushes the keys object for the following _LOAD_ATTR_MODULE_FROM_KEYS (nee _LOAD_ATTR_MODULE). This arrangement avoids having to recheck the keys version.
_LOAD_ATTR_MODULE is renamed to _LOAD_ATTR_MODULE_FROM_KEYS; it loads the value from the keys object pushed by the preceding _CHECK_ATTR_MODULE_PUSH_KEYS at the cached index.
* Enable specialization of CALL_KW
* Fix bug pushing frame in _PY_FRAME_KW
`_PY_FRAME_KW` pushes a pointer to the new frame onto the stack for
consumption by the next uop. When pushing the frame fails, we do not
want to push the result, `NULL`, to the stack because it is not
a valid stackref. This works in the default build because `PyStackRef_NULL`
and `NULL` are the same value, so the `PyStackRef_XCLOSE()` in the error
handler ignores it. In the free-threaded build the values are not the same;
`PyStackRef_XCLOSE()` will attempt to decref a null pointer.
Adds a `use_system_log` config item to enable stdout/stderr redirection for
Apple platforms. This log streaming is then used by a new iOS test runner
script, allowing the display of test suite output at runtime. The iOS test
runner script can be used by any Python project, not just the CPython test
suite.
The CALL family of instructions were mostly thread-safe already and only required a small number of changes, which are documented below.
A few changes were needed to make CALL_ALLOC_AND_ENTER_INIT thread-safe:
Added _PyType_LookupRefAndVersion, which returns the type version corresponding to the returned ref.
Added _PyType_CacheInitForSpecialization, which takes an init method and the corresponding type version and only populates the specialization cache if the current type version matches the supplied version. This prevents potentially caching a stale value in free-threaded builds if we race with an update to __init__.
Only cache __init__ functions that are deferred in free-threaded builds. This ensures that the reference to __init__ that is stored in the specialization cache is valid if the type version guard in _CHECK_AND_ALLOCATE_OBJECT passes.
Fix a bug in _CREATE_INIT_FRAME where the frame is pushed to the stack on failure.
A few other miscellaneous changes were also needed:
Use {LOCK,UNLOCK}_OBJECT in LIST_APPEND. This ensures that the list's per-object lock is held while we are appending to it.
Add missing co_tlbc for _Py_InitCleanup.
Stop/start the world around setting the eval frame hook. This allows us to read interp->eval_frame non-atomically and preserves the behavior of _CHECK_PEP_523 documented below.
* Replace uses of `PyCell_GET` and `PyCell_SET`. These macros are not
safe to use in the free-threaded build. Use `PyCell_GetRef()` and
`PyCell_SetTakeRef()` instead.
* Since `PyCell_GetRef()` returns a strong rather than borrowed ref, some
code restructuring was required, e.g. `frame_get_var()` returns a strong
ref now.
* Add critical sections to `PyCell_GET` and `PyCell_SET`.
* Move critical_section.h earlier in the Python.h file.
* Add `PyCell_GET` to the free-threading howto table of APIs that return
borrowed refs.
* Add additional unit tests for free-threading.
No additional thread safety changes are required. Note that sending to
a generator that is shared between threads is currently not safe in the
free-threaded build.
Use existing helpers to atomically modify the bytecode. Add unit tests
to ensure specializing is happening as expected. Add test_specialize.py
that can be used with ThreadSanitizer to detect data races.
Fix thread safety issue with cell_set_contents().
* Mark almost all reachable objects before doing collection phase
* Add stats for objects marked
* Visit new frames before each increment
* Update docs
* Clearer calculation of work to do.
The specialization only depends on the type, so no special thread-safety
considerations there.
STORE_SUBSCR_LIST_INT needs to lock the list before modifying it.
`_PyDict_SetItem_Take2` already internally locks the dictionary using a
critical section.
"Generally, mixed-mode arithmetic combining real and complex variables should
be performed directly, not by first coercing the real to complex, lest the sign
of zero be rendered uninformative; the same goes for combinations of pure
imaginary quantities with complex variables." (c) Kahan, W: Branch cuts for
complex elementary functions.
This patch implements mixed-mode arithmetic rules, combining real and
complex variables as specified by C standards since C99 (in particular,
there is no special version for the true division with real lhs
operand). Most C compilers implementing C99+ Annex G have only these
special rules (without support for imaginary type, which is going to be
deprecated in C2y).
The interpreter now handles `_PyStackRef`s pointing to immortal objects
without the deferred bit set, so `_PyEvalFramePushAndInit_UnTagged` is
no longer necessary.
Record success in `specialize`
This matches the existing behavior where we increment the success
stat for the generic opcode each time we successfully specialize
an instruction.
If Python fails to start newly created thread
due to failure of underlying PyThread_start_new_thread() call,
its state should be removed from interpreter' thread states list
to avoid its double cleanup.
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
This gets rid of the immortal check in `PyStackRef_FromPyObjectSteal()`.
Overall, this improves performance about 2% in the free threading
build.
This also renames `PyStackRef_Is()` to `PyStackRef_IsExactly()` because
the macro requires that the tag bits of the arguments match, which is
only true in certain special cases.
Add free-threaded specialization for `UNPACK_SEQUENCE` opcode.
`UNPACK_SEQUENCE_TUPLE/UNPACK_SEQUENCE_TWO_TUPLE` are already thread safe since tuples are immutable.
`UNPACK_SEQUENCE_LIST` is not thread safe because of nature of lists (there is nothing preventing another thread from adding items to or removing them the list while the instruction is executing). To achieve thread safety we add a critical section to the implementation of `UNPACK_SEQUENCE_LIST`, especially around the parts where we check the size of the list and push items onto the stack.
---------
Co-authored-by: Matt Page <mpage@meta.com>
Co-authored-by: mpage <mpage@cs.stanford.edu>
Enable specialization of LOAD_GLOBAL in free-threaded builds.
Thread-safety of specialization in free-threaded builds is provided by the following:
A critical section is held on both the globals and builtins objects during specialization. This ensures we get an atomic view of both builtins and globals during specialization.
Generation of new keys versions is made atomic in free-threaded builds.
Existing helpers are used to atomically modify the opcode.
Thread-safety of specialized instructions in free-threaded builds is provided by the following:
Relaxed atomics are used when loading and storing dict keys versions. This avoids potential data races as the dict keys versions are read without holding the dictionary's per-object lock in version guards.
Dicts keys objects are passed from keys version guards to the downstream uops. This ensures that we are loading from the correct offset in the keys object. Once a unicode key has been stored in a keys object for a combined dictionary in free-threaded builds, the offset that it is stored in will never be reused for a different key. Once the version guard passes, we know that we are reading from the correct offset.
The dictionary read fast-path is used to read values from the dictionary once we know the correct offset.
This is a precursor to the actual fix for gh-114940, where we will change these macros to use the new lock. This change is almost entirely mechanical; the exceptions are the loops in codeobject.c and ceval.c, which now hold the "head" lock. Note that almost all of the uses of _Py_FOR_EACH_TSTATE_UNLOCKED() here will change to _Py_FOR_EACH_TSTATE_BEGIN() once we add the new per-interpreter lock.
Don't take a reason in unspecialize
We only want to compute the reason if stats are enabled. Optimizing
compilers should optimize this away for us (gcc and clang do), but
it's better to be safe than sorry.
This approach eliminates the originally reported race. It also gets rid of the deadlock reported in gh-96071, so we can remove the workaround added then.
* Mark almost all reachable objects before doing collection phase
* Add stats for objects marked
* Visit new frames before each increment
* Remove lazy dict tracking
* Update docs
* Clearer calculation of work to do.
* Document that slices can be marshalled
* Deduplicate and organize the list of supported types
in docs
* Organize the type code list in marshal.c, to make
it more obvious that this is a versioned format
* Back-fill some historical info
Co-authored-by: Michael Droettboom <mdboom@gmail.com>
The PyMutex implementation supports unlocking after fork because we
clear the list of waiters in parking_lot.c. This doesn't work as well
for _PyRecursiveMutex because on some systems, such as SerenityOS, the
thread id is not preserved across fork().
These changes makes it easier to backport the _interpreters, _interpqueues, and _interpchannels modules to Python 3.12.
This involves the following:
* add the _PyXI_GET_STATE() and _PyXI_GET_GLOBAL_STATE() macros
* add _PyXIData_lookup_context_t and _PyXIData_GetLookupContext()
* add _Py_xi_state_init() and _Py_xi_state_fini()
These changes makes it easier to backport the _interpreters, _interpqueues, and _interpchannels modules to Python 3.12.
This involves the following:
* rename several structs and typedefs
* add several typedefs
* stop using the PyThreadState.state field directly in parking_lot.c
Move creation of a tuple for var-positional parameter out of
_PyArg_UnpackKeywordsWithVararg().
Merge _PyArg_UnpackKeywordsWithVararg() with _PyArg_UnpackKeywords().
Add a new parameter in _PyArg_UnpackKeywords().
The "parameters" and "converters" attributes of ParseArgsCodeGen no
longer contain the var-positional parameter. It is now available as the
"varpos" attribute. Optimize code generation for var-positional
parameter and reuse the same generating code for functions with and without
keyword parameters.
Add special converters for var-positional parameter. "tuple" represents it as
a Python tuple and "array" represents it as a continuous array of PyObject*.
"object" is a temporary alias of "tuple".
The primary objective here is to allow some later changes to be cleaner. Mostly this involves renaming things and moving a few things around.
* CrossInterpreterData -> XIData
* crossinterpdatafunc -> xidatafunc
* split out pycore_crossinterp_data_registry.h
* add _PyXIData_lookup_t
Introduce helpers for (un)specializing instructions
Consolidate the code to specialize/unspecialize instructions into
two helper functions and use them in `_Py_Specialize_BinaryOp`.
The resulting code is more concise and keeps all of the logic at
the point where we decide to specialize/unspecialize an instruction.
- The specialization logic determines the appropriate specialization using only the operand's type, which is safe to read non-atomically (changing it requires stopping the world). We are guaranteed that the type will not change in between when it is checked and when we specialize the bytecode because the types involved are immutable (you cannot assign to `__class__` for exact instances of `dict`, `set`, or `frozenset`). The bytecode is mutated atomically using helpers.
- The specialized instructions rely on the operand type not changing in between the `DEOPT_IF` checks and the calls to the appropriate type-specific helpers (e.g. `_PySet_Contains`). This is a correctness requirement in the default builds and there are no changes to the opcodes in the free-threaded builds that would invalidate this.
Each thread specializes a thread-local copy of the bytecode, created on the first RESUME, in free-threaded builds. All copies of the bytecode for a code object are stored in the co_tlbc array on the code object. Threads reserve a globally unique index identifying its copy of the bytecode in all co_tlbc arrays at thread creation and release the index at thread destruction. The first entry in every co_tlbc array always points to the "main" copy of the bytecode that is stored at the end of the code object. This ensures that no bytecode is copied for programs that do not use threads.
Thread-local bytecode can be disabled at runtime by providing either -X tlbc=0 or PYTHON_TLBC=0. Disabling thread-local bytecode also disables specialization.
Concurrent modifications to the bytecode made by the specializing interpreter and instrumentation use atomics, with specialization taking care not to overwrite an instruction that was instrumented concurrently.
* Fix comprehensions comment to inlined by pep 709
* Update spacing
Co-authored-by: RUANG (James Roy) <longjinyii@outlook.com>
* Add reference to PEP 709
---------
Co-authored-by: Carol Willing <carolcode@willingconsulting.com>
Co-authored-by: RUANG (James Roy) <longjinyii@outlook.com>
Temporarily ignore warnings about JIT deactivation when perf support is active.
This will be reverted as soon as a way is found to determine at run time whether the interpreter was built with JIT. Currently, this is not possible on Windows.
Co-authored-by: Kirill Podoprigora <kirill.bast9@mail.ru>
Co-authored-by: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com>
Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
Previously, if the `ast.AST._fields` attribute was deleted, attempts to create a new `as`t node would crash due to the assumption that `_fields` always had a non-NULL value. Now it has been fixed by adding an extra check to ensure that `_fields` does not have a NULL value (this can happen when you manually remove `_fields` attribute).
* Remove `@suppress_immortalization` decorator
* Make suppression flag per-thread instead of per-interpreter
* Suppress immortalization in `eval()` to avoid refleaks in three tests
(test_datetime.test_roundtrip, test_logging.test_config8_ok, and
test_random.test_after_fork).
* frozenset() is constant, but not a singleton. When run multiple times,
the test could fail due to constant interning.
This replaces `_PyEval_BuiltinsFromGlobals` with
`_PyDict_LoadBuiltinsFromGlobals`, which returns a new reference
instead of a borrowed reference. Internally, the new function uses
per-thread reference counting when possible to avoid contention on the
refcount fields on the builtins module.
On Windows, `long` is a signed 32-bit integer so it can't represent
`0xffff_ffff` without overflow. Windows exit codes are unsigned 32-bit
integers, so if a child process exits with `-1`, it will be represented
as `0xffff_ffff`.
Also fix a number of other possible cases where `_Py_HandleSystemExit`
could return with an exception set, leading to a `SystemError` (or
fatal error in debug builds) later on during shutdown.
This fixes a crash when `gc.get_objects()` or `gc.get_referrers()` is
called during a GC in the free threading build.
Switch to `_PyObjectStack` to avoid corrupting the `struct worklist`
linked list maintained by the GC. Also, don't return objects that are frozen
(`gc.freeze()`) or in the process of being collected to more closely match
the behavior of the default build.
They used to be shared, before 3.12. Returning to sharing them resolves a failure on Py_TRACE_REFS builds.
Co-authored-by: Petr Viktorin <encukou@gmail.com>
* Fix usage of PyStackRef_FromPyObjectSteal in CALL_TUPLE_1
This was missed in gh-124894
* Fix usage of PyStackRef_FromPyObjectSteal in _CALL_STR_1
This was missed in gh-124894
* Regenerate code
This is essentially a cleanup, moving a handful of API declarations to the header files where they fit best, creating new ones when needed.
We do the following:
* add pycore_debug_offsets.h and move _Py_DebugOffsets, etc. there
* inline struct _getargs_runtime_state and struct _gilstate_runtime_state in _PyRuntimeState
* move struct _reftracer_runtime_state to the existing pycore_object_state.h
* add pycore_audit.h and move to it _Py_AuditHookEntry , _PySys_Audit(), and _PySys_ClearAuditHooks
* add audit.h and cpython/audit.h and move the existing audit-related API there
*move the perfmap/trampoline API from cpython/sysmodule.h to cpython/ceval.h, and remove the now-empty cpython/sysmodule.h
Users want to know when the current context switches to a different
context object. Right now this happens when and only when a context
is entered or exited, so the enter and exit events are synonymous with
"switched". However, if the changes proposed for gh-99633 are
implemented, the current context will also switch for reasons other
than context enter or exit. Since users actually care about context
switches and not enter or exit, replace the enter and exit events with
a single switched event.
The former exit event was emitted just before exiting the context.
The new switched event is emitted after the context is exited to match
the semantics users expect of an event with a past-tense name. If
users need the ability to clean up before the switch takes effect,
another event type can be added in the future. It is not added here
because YAGNI.
I skipped 0 in the enum as a matter of practice. Skipping 0 makes it
easier to troubleshoot when code forgets to set zeroed memory, and it
aligns with best practices for other tools (e.g.,
https://protobuf.dev/programming-guides/dos-donts/#unspecified-enum).
Co-authored-by: Richard Hansen <rhansen@rhansen.org>
Co-authored-by: Victor Stinner <vstinner@python.org>
Use per-thread refcounting for the reference from function objects to
their corresponding code object. This can be a source of contention when
frequently creating nested functions. Deferred refcounting alone isn't a
great fit here because these references are on the heap and may be
modified by other libraries.
This fixes a crash when running the PyO3 test suite on the free-threaded
build. The `qsbr` field is initialized after the `PyThreadState` is
added to the interpreter's linked list -- it might still be NULL.
Instead, we "steal" the queue of to-be-freed memory blocks. This is
always initialized (possibly empty) and protected by the stop the world
pause.
Users want to know when the current context switches to a different
context object. Right now this happens when and only when a context
is entered or exited, so the enter and exit events are synonymous with
"switched". However, if the changes proposed for gh-99633 are
implemented, the current context will also switch for reasons other
than context enter or exit. Since users actually care about context
switches and not enter or exit, replace the enter and exit events with
a single switched event.
The former exit event was emitted just before exiting the context.
The new switched event is emitted after the context is exited to match
the semantics users expect of an event with a past-tense name. If
users need the ability to clean up before the switch takes effect,
another event type can be added in the future. It is not added here
because YAGNI.
I skipped 0 in the enum as a matter of practice. Skipping 0 makes it
easier to troubleshoot when code forgets to set zeroed memory, and it
aligns with best practices for other tools (e.g.,
https://protobuf.dev/programming-guides/dos-donts/#unspecified-enum).
When formatting the AST as a string, infinite values are replaced by
1e309, which evaluates to infinity. The initialization of this string
replacement was not thread-safe in the free threading build.
Each of the `LOAD_GLOBAL` specializations is implemented roughly as:
1. Load keys version.
2. Load cached keys version.
3. Deopt if (1) and (2) don't match.
4. Load keys.
5. Load cached index into keys.
6. Load object from (4) at offset from (5).
This is not thread-safe in free-threaded builds; the keys object may be replaced
in between steps (3) and (4).
This change refactors the specializations to avoid reloading the keys object and
instead pass the keys object from guards to be consumed by downstream uops.
* Replace unicode_compare_eq() with unicode_eq().
* Use unicode_eq() in setobject.c.
* Replace _PyUnicode_EQ() with _PyUnicode_Equal().
* Remove unicode_compare_eq() and _PyUnicode_EQ().
* Make slices marshallable
* Emit slices as constants
* Update Python/marshal.c
Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
* Refactor codegen_slice into two functions so it
always has the same net effect
* Fix for free-threaded builds
* Simplify marshal loading of slices
* Only return SUCCESS/ERROR from codegen_slice
---------
Co-authored-by: Mark Shannon <mark@hotpy.org>
Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
Stop the world when invalidating function versions
The tier1 interpreter specializes `CALL` instructions based on the values
of certain function attributes (e.g. `__code__`, `__defaults__`). The tier1
interpreter uses function versions to verify that the attributes of a function
during execution of a specialization match those seen during specialization.
A function's version is initialized in `MAKE_FUNCTION` and is invalidated when
any of the critical function attributes are changed. The tier1 interpreter stores
the function version in the inline cache during specialization. A guard is used by
the specialized instruction to verify that the version of the function on the operand
stack matches the cached version (and therefore has all of the expected attributes).
It is assumed that once the guard passes, all attributes will remain unchanged
while executing the rest of the specialized instruction.
Stopping the world when invalidating function versions ensures that all critical
function attributes will remain unchanged after the function version guard passes
in free-threaded builds. It's important to note that this is only true if the remainder
of the specialized instruction does not enter and exit a stop-the-world point.
We will stop the world the first time any of the following function attributes
are mutated:
- defaults
- vectorcall
- kwdefaults
- closure
- code
This should happen rarely and only happens once per function, so the performance
impact on majority of code should be minimal.
Additionally, refactor the API for manipulating function versions to more clearly
match the stated semantics.
* Spill the evaluation around escaping calls in the generated interpreter and JIT.
* The code generator tracks live, cached values so they can be saved to memory when needed.
* Spills the stack pointer around escaping calls, so that the exact stack is visible to the cycle GC.
Instead of surprise crashes and memory corruption, we now hang threads that attempt to re-enter the Python interpreter after Python runtime finalization has started. These are typically daemon threads (our long standing mis-feature) but could also be threads spawned by extension modules that then try to call into Python. This marks the `PyThread_exit_thread` public C API as deprecated as there is no plausible safe way to accomplish that on any supported platform in the face of things like C++ code with finalizers anywhere on a thread's stack. Doing this was the least bad option.
Co-authored-by: Gregory P. Smith <greg@krypto.org>
Currently, we only use per-thread reference counting for heap type objects and
the naming reflects that. We will extend it to a few additional types in an
upcoming change to avoid scaling bottlenecks when creating nested functions.
Rename some of the files and functions in preparation for this change.
Instead of be limited just by the size of addressable memory (2**63
bytes), Python integers are now also limited by the number of bits, so
the number of bit now always fit in a 64-bit integer.
Both limits are much larger than what might be available in practice,
so it doesn't affect users.
_PyLong_NumBits() and _PyLong_Frexp() are now always successful.
Fix a bug that can cause a crash when sub-interpreters use "basic"
single-phase extension modules. Shared objects could refer to PyGC_Head
nodes that had been freed as part of interpreter shutdown.
Use a `_PyStackRef` and defer the reference to `f_funcobj` when
possible. This avoids some reference count contention in the common case
of executing the same code object from multiple threads concurrently in
the free-threaded build.
* Detect source file encoding.
* Use the "replace" error handler even for UTF-8 (default) encoding.
* Remove the BOM.
* Fix detection of too long lines if they contain NUL.
* Return the head rather than the tail for truncated long lines.
If the generator is already cleared, then most fields in the
generator's frame are not valid other than f_executable. The invalid
fields may contain dangling pointers and should not be used.
Use a `_PyStackRef` and defer the reference to `f_executable` when
possible. This avoids some reference count contention in the common case
of executing the same code object from multiple threads concurrently in
the free-threaded build.
Setting dev_mode to 1 in an isolated configuration now enables also
faulthandler.
Moreover, setting "module_search_paths" option with
PyInitConfig_SetStrList() now sets "module_search_paths_set" to 1.
Add PyConfig_Get(), PyConfig_GetInt(), PyConfig_Set() and
PyConfig_Names() functions to get and set the current runtime Python
configuration.
Add visibility and "sys spec" to config and preconfig specifications.
_PyConfig_AsDict() now converts PyConfig.xoptions as a dictionary.
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Switch more _Py_IsImmortal(...) assertions to _Py_IsImmortalLoose(...)
The remaining calls to _Py_IsImmortal are in free-threaded-only code,
initialization of core objects, tests, and guards that fall back to
code that works with mortal objects.
The `zip_next` function uses a common optimization technique for methods
that generate tuples. The iterator maintains an internal reference to
the returned tuple. When the method is called again, it checks if the
internal tuple's reference count is 1. If so, the tuple can be reused.
However, this approach is not safe under the free-threading build:
after checking the reference count, another thread may perform the same
check and also reuse the tuple. This can result in a double decref on
the items of the replaced tuple and a double incref (memory leak) on
the items of the tuple being set.
This adds a function, `_PyObject_IsUniquelyReferenced` that
encapsulates the stricter logic necessary for the free-threaded build:
the internal tuple must be owned by the current thread, have a local
refcount of one, and a shared refcount of zero.
`Py_DECREF` and `PyStackRef_CLOSE` are now implemented as macros in the
free-threaded build in ceval.c. There are two motivations;
* MSVC has problems inlining functions in ceval.c in the PGO build.
* We will want to mark escaping calls in order to spill the stack
pointer in ceval.c and we will want to do this around `_Py_Dealloc`
(or `_Py_MergeZeroLocalRefcount` or `_Py_DecRefShared`), not around
the entire `Py_DECREF` or `PyStackRef_CLOSE` call.
The free-threaded GC now visits interpreter stacks to keep objects
that use deferred reference counting alive.
Interpreter frames are zero initialized in the free-threaded GC so
that the GC doesn't see garbage data. This is a temporary measure
until stack spilling around escaping calls is implemented.
Co-authored-by: Ken Jin <kenjin@python.org>
As of 529a160 (gh-118204), building with HAVE_DYNAMIC_LOADING stopped working. This is a minimal fix just to get builds working again. There are actually a number of long-standing deficiencies with HAVE_DYNAMIC_LOADING builds that need to be resolved separately.
This replaces `_PyList_FromArraySteal` with `_PyList_FromStackRefSteal`.
It's functionally equivalent, but takes a `_PyStackRef` array instead of
an array of `PyObject` pointers.
Co-authored-by: Ken Jin <kenjin@python.org>
`BUILD_SET` should use a borrow instead of a steal. The cleanup in `_DO_CALL`
`CONVERSION_FAILED` was incorrect.
Co-authored-by: Ken Jin <kenjin@python.org>
We were not properly accounting for interpreter memory leaks at
shutdown and had two sources of leaks:
* Objects that use deferred reference counting and were reachable via
static types outlive the final GC. We now disable deferred reference
counting on all objects if we are calling the GC due to interpreter
shutdown.
* `_PyMem_FreeDelayed` did not properly check for interpreter shutdown
so we had some memory blocks that were enqueued to be freed, but
never actually freed.
* `_PyType_FinalizeIdPool` wasn't called at interpreter shutdown.
Fix _PyArg_UnpackKeywordsWithVararg for the case when argument for
positional-or-keyword parameter is passed by keyword.
There was only one such case in the stdlib -- the TypeVar constructor.
This automatically spills the results from `_PyStackRef_FromPyObjectNew`
to the in-memory stack so that the deferred references are visible to
the GC before we make any possibly escaping call.
Co-authored-by: Ken Jin <kenjin@python.org>