Replace the private _PyUnicodeWriter with the public PyUnicodeWriter.
Replace PyObject_Repr() + _PyUnicodeWriter_WriteStr()
with PyUnicodeWriter_WriteRepr().
Fix a crash caused by immortal interned strings being shared between
sub-interpreters that use basic single-phase init. In that case, the string
can be used by an interpreter that outlives the interpreter that created and
interned it. For interpreters that share obmalloc state, also share the
interned dict with the main interpreter.
This is an un-revert of gh-124646 that then addresses the Py_TRACE_REFS
failures identified by gh-124785.
* Replace unicode_compare_eq() with unicode_eq().
* Use unicode_eq() in setobject.c.
* Replace _PyUnicode_EQ() with _PyUnicode_Equal().
* Remove unicode_compare_eq() and _PyUnicode_EQ().
* Make slices marshallable
* Emit slices as constants
* Update Python/marshal.c
Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
* Refactor codegen_slice into two functions so it
always has the same net effect
* Fix for free-threaded builds
* Simplify marshal loading of slices
* Only return SUCCESS/ERROR from codegen_slice
---------
Co-authored-by: Mark Shannon <mark@hotpy.org>
Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
Stop the world when invalidating function versions
The tier1 interpreter specializes `CALL` instructions based on the values
of certain function attributes (e.g. `__code__`, `__defaults__`). The tier1
interpreter uses function versions to verify that the attributes of a function
during execution of a specialization match those seen during specialization.
A function's version is initialized in `MAKE_FUNCTION` and is invalidated when
any of the critical function attributes are changed. The tier1 interpreter stores
the function version in the inline cache during specialization. A guard is used by
the specialized instruction to verify that the version of the function on the operand
stack matches the cached version (and therefore has all of the expected attributes).
It is assumed that once the guard passes, all attributes will remain unchanged
while executing the rest of the specialized instruction.
Stopping the world when invalidating function versions ensures that all critical
function attributes will remain unchanged after the function version guard passes
in free-threaded builds. It's important to note that this is only true if the remainder
of the specialized instruction does not enter and exit a stop-the-world point.
We will stop the world the first time any of the following function attributes
are mutated:
- defaults
- vectorcall
- kwdefaults
- closure
- code
This should happen rarely and only happens once per function, so the performance
impact on majority of code should be minimal.
Additionally, refactor the API for manipulating function versions to more clearly
match the stated semantics.
Currently, we only use per-thread reference counting for heap type objects and
the naming reflects that. We will extend it to a few additional types in an
upcoming change to avoid scaling bottlenecks when creating nested functions.
Rename some of the files and functions in preparation for this change.
Instead of be limited just by the size of addressable memory (2**63
bytes), Python integers are now also limited by the number of bits, so
the number of bit now always fit in a 64-bit integer.
Both limits are much larger than what might be available in practice,
so it doesn't affect users.
_PyLong_NumBits() and _PyLong_Frexp() are now always successful.
* Setting the __module__ attribute for a class now removes the
__firstlineno__ item from the type's dict.
* The _collections_abc and _pydecimal modules now completely replace the
collections.abc and decimal modules after importing them. This
allows to get the source of classes and functions defined in these
modules.
* inspect.findsource() now checks whether the first line number for a
class is out of bound.
Use a `_PyStackRef` and defer the reference to `f_funcobj` when
possible. This avoids some reference count contention in the common case
of executing the same code object from multiple threads concurrently in
the free-threaded build.
We need to return immediately if there's an error during dictionary
lookup.
Also avoid the conditional-if operator. MSVC versions through v19.27 miscompile
compound literals with side effects within a conditional operator. This caused
crashes in the Windows10 buildbot.
Use a `_PyStackRef` and defer the reference to `f_executable` when
possible. This avoids some reference count contention in the common case
of executing the same code object from multiple threads concurrently in
the free-threaded build.
In gh-121602, I applied a fix to a builtin types initialization bug.
That fix made sense in the context of some broader future changes,
but introduced a little bit of extra complexity. That fix has turned
out to be incomplete for some of the builtin types we haven't
been testing. I found that out while improving the tests.
A while back, @markshannon suggested a simpler fix that doesn't
have that problem, which I've already applied to 3.12 and 3.13.
I'm switching to that here. Given the potential long-term
benefits of the more complex (but still incomplete) approach,
I'll circle back to it in the future, particularly after I've improved
the tests so no corner cases slip through the cracks.
(This is effectively a "forward-port" of 716c677 from 3.13.)
Switch more _Py_IsImmortal(...) assertions to _Py_IsImmortalLoose(...)
The remaining calls to _Py_IsImmortal are in free-threaded-only code,
initialization of core objects, tests, and guards that fall back to
code that works with mortal objects.
* Replace PyLong_AS_LONG() with PyLong_AsLong().
* Call PyLong_AsLong() only once per the replacement code.
* Use PyMapping_GetOptionalItem() instead of PyObject_GetItem().
Check that the current default heap is initialized in
`_mi_os_get_aligned_hint` and `mi_os_claim_huge_pages`.
The mimalloc function `_mi_os_get_aligned_hint` assumes that there is an
initialized default heap. This is true for our main thread, but not for
background threads. The problematic code path is usually called during
initialization (i.e., `Py_Initialize`), but it may also be called if the
program allocates large amounts of memory in total.
The crash only affected the free-threaded build.
There were a still a number of gaps in the tests, including not looking
at all the builtin types and not checking wrappers in subinterpreters
that weren't in the main interpreter. This fixes all that.
I considered incorporating the names of the PyTypeObject fields
(a la gh-122866), but figured doing so doesn't add much value.
This replaces `_PyList_FromArraySteal` with `_PyList_FromStackRefSteal`.
It's functionally equivalent, but takes a `_PyStackRef` array instead of
an array of `PyObject` pointers.
Co-authored-by: Ken Jin <kenjin@python.org>
* Parameters after the var-positional parameter are now keyword-only
instead of positional-or-keyword.
* Correctly calculate min_kw_only.
* Raise errors for invalid combinations of the var-positional parameter
with "*", "/" and deprecation markers.
We were not properly accounting for interpreter memory leaks at
shutdown and had two sources of leaks:
* Objects that use deferred reference counting and were reachable via
static types outlive the final GC. We now disable deferred reference
counting on all objects if we are calling the GC due to interpreter
shutdown.
* `_PyMem_FreeDelayed` did not properly check for interpreter shutdown
so we had some memory blocks that were enqueued to be freed, but
never actually freed.
* `_PyType_FinalizeIdPool` wasn't called at interpreter shutdown.
Return -1 and set an exception on error; return 0 if the iterator is
exhausted, and return 1 if the next item was fetched successfully.
Prefer this API to PyIter_Next(), which requires the caller to use
PyErr_Occurred() to differentiate between iterator exhaustion and errors.
Co-authered-by: Irit Katriel <iritkatriel@yahoo.com>
The free-threaded build partially stores heap type reference counts in
distributed manner in per-thread arrays. This avoids reference count
contention when creating or destroying instances.
Co-authored-by: Ken Jin <kenjin@python.org>
The `PyStructSequence` destructor would crash if it was deallocated after
its type's dictionary was cleared by the GC, because it couldn't compute
the "real size" of the instance. This could occur with relatively
straightforward code in the free-threaded build or with a reference
cycle involving the type in the default build, due to differing orders
in which `tp_clear()` was called.
Account for the non-sequence fields in `tp_basicsize` and use that,
along with `Py_SIZE()`, to compute the "real" size of a
`PyStructSequence` in the dealloc function. This avoids the accesses to
the type's dictionary during dealloc, which were unsafe.
* gh-120974: Make _asyncio._leave_task atomic in the free-threaded build
Update `_PyDict_DelItemIf` to allow for an argument to be passed to the
predicate.
This refactors asyncio to use the common freelist helper functions and
macros. As a side effect, the freelist for _asyncio.Future is now
re-enabled in the free-threaded build.
This combines and updates our freelist handling to use a consistent
implementation. Objects in the freelist are linked together using the
first word of memory block.
If configured with freelists disabled, these operations are essentially
no-ops.
compare_unicode_generic(), compare_unicode_unicode() and
compare_generic() are callbacks used by do_lookup(). When enabling
assertions, it's not possible to inline these functions.
* Switch PyUnicode_InternInPlace to _PyUnicode_InternMortal, clarify docs
* Document immortality in some functions that take `const char *`
This is PyUnicode_InternFromString;
PyDict_SetItemString, PyObject_SetAttrString;
PyObject_DelAttrString; PyUnicode_InternFromString;
and the PyModule_Add convenience functions.
Always point out a non-immortalizing alternative.
* Don't immortalize user-provided attr names in _ctypes
We should maintain the invariant that a zero `ob_tid` implies the
refcount fields are merged.
* Move the assignment in `_Py_MergeZeroLocalRefcount` to immediately
before the refcount merge.
* Update `_PyTrash_thread_destroy_chain` to set `ob_ref_shared` to
`_Py_REF_MERGED` when setting `ob_tid` to zero.
Also check this invariant with assertions in the GC in debug builds.
That uncovered a bug when running out of memory during GC.
They are alternate constructors which only accept numbers
(including objects with special methods __float__, __complex__
and __index__), but not strings.
Performance improvement to `float.fromhex`: use a lookup table
for computing the hexadecimal value of a character, in place of the
previous switch-case construct. Patch by Bruno Lima.
* The result has type Py_ssize_t, not intptr_t.
* Type cast between unsigned and signdet integer types should be explicit.
* Downcasting should be explicit.
* Fix integer overflow check in sum().
When builtin static types are initialized for a subinterpreter, various "tp" slots have already been inherited (for the main interpreter). This was interfering with the logic in add_operators() (in Objects/typeobject.c), causing a wrapper to get created when it shouldn't. This change fixes that by preserving the original data from the static type struct and checking that.
The `used` field must be written using atomic stores because `set_len`
and iterators may access the field concurrently without holding the
per-object lock.
The `_PySeqLock_EndRead` function needs an acquire fence to ensure that
the load of the sequence happens after any loads within the read side
critical section. The missing fence can trigger bugs on macOS arm64.
Additionally, we need a release fence in `_PySeqLock_LockWrite` to
ensure that the sequence update is visible before any modifications to
the cache entry.
Make error message for index() methods consistent
Remove the repr of the searched value (which can be arbitrary large)
from ValueError messages for list.index(), range.index(), deque.index(),
deque.remove() and ShareableList.index(). Make the error messages
consistent with error messages for other index() and remove()
methods.
Refactor the fast Unicode hash check into `_PyObject_HashFast` and use relaxed
atomic loads in the free-threaded build.
After this change, the TSAN doesn't report data races for this method.
In some cases, previously computed as (nan+nanj), we could
recover meaningful component values in the result, see
e.g. the C11, Annex G.5.2, routine _Cdivd().
Fix warnings when using -Wimplicit-fallthrough compiler flag.
Annotate explicitly "fall through" switch cases with a new
_Py_FALLTHROUGH macro which uses __attribute__((fallthrough)) if
available. Replace "fall through" comments with _Py_FALLTHROUGH.
Add _Py__has_attribute() macro. No longer define __has_attribute()
macro if it's not defined. Move also _Py__has_builtin() at the top
of pyport.h.
Co-Authored-By: Nikita Sobolev <mail@sobolevn.me>
This PR sets up tagged pointers for CPython.
The general idea is to create a separate struct _PyStackRef for everything on the evaluation stack to store the bits. This forces the C compiler to warn us if we try to cast things or pull things out of the struct directly.
Only for free threading: We tag the low bit if something is deferred - that means we skip incref and decref operations on it. This behavior may change in the future if Mark's plans to defer all objects in the interpreter loop pans out.
This implies a strict stack reference discipline is required. ALL incref and decref operations on stackrefs must use the stackref variants. It is unsafe to untag something then do normal incref/decref ops on it.
The new incref and decref variants are called dup and close. They mimic a "handle" API operating on these stackrefs.
Please read Include/internal/pycore_stackref.h for more information!
---------
Co-authored-by: Mark Shannon <9448417+markshannon@users.noreply.github.com>
Remove the const qualifier of the argument of functions:
* _PyLong_IsCompact()
* _PyLong_CompactValue()
Py_TYPE() argument is not const.
Fix the compiler warning:
Include/cpython/longintrepr.h: In function ‘_PyLong_CompactValue’:
Include/pyport.h:19:31: error: cast discards ‘const’ qualifier from
pointer target type [-Werror=cast-qual]
(...)
Include/cpython/longintrepr.h:133:30: note: in expansion of macro
‘Py_TYPE’
assert(PyType_HasFeature(Py_TYPE(op), Py_TPFLAGS_LONG_SUBCLASS));
PyDict_Next no longer locks the dictionary in the free-threaded build. Locking
around individual PyDict_Next calls is not sufficient because the function
returns borrowed references and because it allows concurrent modifications
during the iteraiton loop.
The internal locking also interferes with correct external synchronization
because it may suspend outer critical sections created by the caller.
Moves the logic to update the type's dictionary to its own function in order
to make the lock scoping more clear.
Also, ensure that `name` is decref'd on the error path.
PyUnicode_FromFormat() no longer produces the ending \ufffd
character for truncated C string when use precision with %s and %V.
It now truncates the string before the start of truncated multibyte sequences.
This makes the following macros public as part of the non-limited C-API for
locking a single object or two objects at once.
* `Py_BEGIN_CRITICAL_SECTION(op)` / `Py_END_CRITICAL_SECTION()`
* `Py_BEGIN_CRITICAL_SECTION2(a, b)` / `Py_END_CRITICAL_SECTION2()`
The supporting functions and structs used by the macros are also exposed for
cases where C macros are not available.
* Add an InternalDocs file describing how interning should work and how to use it.
* Add internal functions to *explicitly* request what kind of interning is done:
- `_PyUnicode_InternMortal`
- `_PyUnicode_InternImmortal`
- `_PyUnicode_InternStatic`
* Switch uses of `PyUnicode_InternInPlace` to those.
* Disallow using `_Py_SetImmortal` on strings directly.
You should use `_PyUnicode_InternImmortal` instead:
- Strings should be interned before immortalization, otherwise you're possibly
interning a immortalizing copy.
- `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
`SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
backports, as they are now part of public API and version-specific ABI.
* Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.
* Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
- `_Py_ID`
- `_Py_STR` (including the empty string)
- one-character latin-1 singletons
Now, when you intern a singleton, that exact singleton will be interned.
* Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).
* Intern `_Py_STR` singletons at startup.
* For free-threaded builds, intern `_Py_LATIN1_CHR` singletons at startup.
* Beef up the tests. Cover internal details (marked with `@cpython_only`).
* Add lots of assertions
Co-Authored-By: Eric Snow <ericsnowcurrently@gmail.com>
The public PyUnicodeWriter API enables overallocation by default and
so is more efficient.
Benchmark:
python -m pyperf timeit \
-s 't = int | float | complex | str | bytes | bytearray' \
' | memoryview | list | dict' \
'str(t)'
Result:
1.29 us +- 0.02 us -> 1.00 us +- 0.02 us: 1.29x faster
The public PyUnicodeWriter API enables overallocation by default and
so is more efficient.
Benchmark:
python -m pyperf timeit \
-s 't = list[int, float, complex, str, bytes, bytearray, ' \
'memoryview, list, dict]' \
'str(t)'
Result:
1.49 us +- 0.03 us -> 1.10 us +- 0.02 us: 1.35x faster
This exposes `PyUnstable_Object_ClearWeakRefsNoCallbacks` as an unstable
C-API function to provide a thread-safe mechanism for clearing weakrefs
without executing callbacks.
Some C-API extensions need to clear weakrefs without calling callbacks,
such as after running finalizers like we do in subtype_dealloc.
Previously they could use `_PyWeakref_ClearRef` on each weakref, but
that's not thread-safe in the free-threaded build.
Co-authored-by: Petr Viktorin <encukou@gmail.com>
In gh-120009 I used an atexit hook to finalize the _datetime module's static types at interpreter shutdown. However, atexit hooks are executed very early in finalization, which is a problem in the few cases where a subclass of one of those static types is still alive until the final GC collection. The static builtin types don't have this probably because they are finalized toward the end, after the final GC collection. To avoid the problem for _datetime, I have applied a similar approach here.
Also, credit goes to @mgorny and @neonene for the new tests.
FYI, I would have liked to take a slightly cleaner approach with managed static types, but wanted to get a smaller fix in first for the sake of backporting. I'll circle back to the cleaner approach with a future change on the main branch.
We need to write to `ob_ref_local` and `ob_tid` before `ob_ref_shared`.
Once we mark `ob_ref_shared` as merged, some other thread may free the
object because the caller also passes in `-1` as `extra` to give up its
only reference.
We make use of the same mechanism that we use for the static builtin types. This required a few tweaks.
The relevant code could use some cleanup but I opted to avoid the significant churn in this change. I'll tackle that separately.
This change is the final piece needed to make _datetime support multiple interpreters. I've updated the module slot accordingly.
The free-threaded build currently immortalizes objects that use deferred
reference counting (see gh-117783). This typically happens once the
first non-main thread is created, but the behavior can be suppressed for
tests, in subinterpreters, or during a compile() call.
This fixes a race condition involving the tracking of whether the
behavior is suppressed.
Remove the delegation of `int` to the `__trunc__` special method: `int` will now only delegate to `__int__` and `__index__` (in that order). `__trunc__` continues to exist, but its sole purpose is to support `math.trunc`.
---------
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
* Add docs for new APIs
* Add soft-deprecation notices
* Add What's New porting entries
* Update comments referencing `PyFrame_LocalsToFast()` to mention the proxy instead
* Other related cleanups found when looking for refs to the deprecated APIs
* Passing a string as the "real" keyword argument is now an error;
it should only be passed as a single positional argument.
* Passing a complex number as the "real" or "imag" argument is now deprecated;
it should only be passed as a single positional argument.
* Remove the equivalence with real+imag*1j which can be incorrect in corner
cases (non-finite numbers, the sign of zeroes).
* Separately document the three roles of the constructor: parsing a string,
converting a number, and constructing a complex from components.
* Document positional-only parameters of complex(), float(), int() and bool()
as positional-only.
* Add examples for complex() and int().
* Specify the grammar of the string for complex().
* Improve the grammar of the string for float().
* Describe more explicitly the behavior when real and/or imag arguments are
complex numbers. (This will be deprecated in future.)
The deadlock only affected the free-threaded build and only occurred
when the GIL was enabled at runtime. The `Py_DECREF(old_name)` call
might temporarily release the GIL while holding the type seqlock.
Another thread may spin trying to acquire the seqlock while holding the
GIL.
The deadlock occurred roughly 1 in ~1,000 runs of `pool_in_threads.py`
from `test_multiprocessing_pool_circular_import`.
Add unicode_decode_utf8_writer() to write directly characters into a
_PyUnicodeWriter writer: avoid the creation of a temporary string.
Optimize PyUnicode_FromFormat() by using the new
unicode_decode_utf8_writer().
Rename unicode_fromformat_write_cstr() to
unicode_fromformat_write_utf8().
Microbenchmark on the code:
return PyUnicode_FromFormat(
"%s %s %s %s %s.",
"format", "multiple", "utf8", "short", "strings");
Result: 620 ns +- 8 ns -> 382 ns +- 2 ns: 1.62x faster.
Add `Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST` and
`Py_END_CRITICAL_SECTION_SEQUENCE_FAST` macros and update `str.join` to use
them. Also add a regression test that would crash reliably without this
patch.
As reported in #117847 and #115366, an unpaired backtick in a docstring
tends to confuse e.g. Sphinx running on subclasses of standard library
objects, and the typographic style of using a backtick as an opening
quote is no longer in favor. Convert almost all uses of the form
The variable `foo' should do xyz
to
The variable 'foo' should do xyz
and also fix up miscellaneous other unpaired backticks (extraneous /
missing characters).
No functional change is intended here other than in human-readable
docstrings.
The PEP 649 implementation will require a way to load NotImplementedError
from the bytecode. @markshannon suggested implementing this by converting
LOAD_ASSERTION_ERROR into a more general mechanism for loading constants.
This PR adds this new opcode. I will work on the rest of the implementation
of the PEP separately.
Co-authored-by: Irit Katriel <1055913+iritkatriel@users.noreply.github.com>
* BaseException_vectorcall() now creates a tuple from 'args' array.
* Creation an exception using BaseException_vectorcall() is now a
single function call, rather than having to call
BaseException_new() and then BaseException_init().
Calling BaseException_init() is inefficient since it overrides
the 'args' attribute.
* _PyErr_SetKeyError() now uses PyObject_CallOneArg() to create the
KeyError instance to use BaseException_vectorcall().
The `list_preallocate_exact` function did not zero initialize array
contents. In the free-threaded build, this could expose uninitialized
memory to concurrent readers between the call to
`list_preallocate_exact` and the filling of the array contents with
items.
Use relaxed atomics when reading / writing to the field. There are still a
few places in the GC where we do not use atomics. Those should be safe as
the world is stopped.
We already intern and immortalize most string constants. In the
free-threaded build, other constants can be a source of reference count
contention because they are shared by all threads running the same code
objects.
Add _PyType_LookupRef and use incref before setting attribute on type
Makes setting an attribute on a class and signaling type modified atomic
Avoid adding re-entrancy exposing the type cache in an inconsistent state by decrefing after type is updated
This interns the strings for `co_filename`, `co_name`, and `co_qualname`
on codeobjects in the free-threaded build. This partially addresses a
reference counting bottleneck when creating closures concurrently. The
closures take the name and qualified name from the code object.
This PR adds the ability to enable the GIL if it was disabled at
interpreter startup, and modifies the multi-phase module initialization
path to enable the GIL when loading a module, unless that module's spec
includes a slot indicating it can run safely without the GIL.
PEP 703 called the constant for the slot `Py_mod_gil_not_used`; I went
with `Py_MOD_GIL_NOT_USED` for consistency with gh-104148.
A warning will be issued up to once per interpreter for the first
GIL-using module that is loaded. If `-v` is given, a shorter message
will be printed to stderr every time a GIL-using module is loaded
(including the first one that issues a warning).
The module itself is a thin wrapper around calls to functions in
`Python/codecs.c`, so that's where the meaningful changes happened:
- Move codecs-related state that lives on `PyInterpreterState` to a
struct declared in `pycore_codecs.h`.
- In free-threaded builds, add a mutex to `codecs_state` to synchronize
operations on `search_path`. Because `search_path_mutex` is used as a
normal mutex and not a critical section, we must be extremely careful
with operations called while holding it.
- The codec registry is explicitly initialized as part of
`_PyUnicode_InitEncodings` to simplify thread-safety.
The code for Tier 2 is now only compiled when configured
with `--enable-experimental-jit[=yes|interpreter]`.
We drop support for `PYTHON_UOPS` and -`Xuops`,
but you can disable the interpreter or JIT
at runtime by setting `PYTHON_JIT=0`.
You can also build it without enabling it by default
using `--enable-experimental-jit=yes-off`;
enable with `PYTHON_JIT=1`.
On Windows, the `build.bat` script supports
`--experimental-jit`, `--experimental-jit-off`,
`--experimental-interpreter`.
In the C code, `_Py_JIT` is defined as before
when the JIT is enabled; the new variable
`_Py_TIER2` is defined when the JIT *or* the
interpreter is enabled. It is actually a bitmask:
1: JIT; 2: default-off; 4: interpreter.
Deferred reference counting is not fully implemented yet. As a temporary
measure, we immortalize objects that would use deferred reference
counting to avoid multi-threaded scaling bottlenecks.
This is only performed in the free-threaded build once the first
non-main thread is started. Additionally, some tests, including refleak
tests, suppress this behavior.
It's not safe to raise an exception in `PyObject_ClearWeakRefs()` if one
is not already set, since it may be called by `_Py_Dealloc()`, which
requires that the active exception does not change.
Additionally, make sure we clear the weakrefs even when tuple allocation
fails.
Fix data races in the method cache in free-threaded builds
These are technically data races, but I think they're benign (to
the extent that that is actually possible). We update cache entries
non-atomically but read them atomically from another thread, and there's
nothing that establishes a happens-before relationship between the
reads and writes that I can see.
Fix mimalloc allocator for huge memory allocation (around
8,589,934,592 GiB) on s390x.
Abort allocation early in mimalloc if the number of slices doesn't
fit into uint32_t, to prevent a integer overflow (cast 64-bit
size_t to uint32_t).
We want code objects to use deferred reference counting in the
free-threaded build. This requires them to be tracked by the GC, so we
set `Py_TPFLAGS_HAVE_GC` in the free-threaded build, but not the default
build.
Guido pointed out to me that some details about the per-interpreter state for the builtin types aren't especially clear. I'm addressing that by:
* adding a comment explaining that state
* adding some asserts to point out the relationship between each index and the interp/global runtime state
This change gives a significant speedup, as the METH_FASTCALL calling
convention is now used. The following bytes and bytearray methods are adapted:
- count()
- find()
- index()
- rfind()
- rindex()
Co-authored-by: Inada Naoki <songofacandy@gmail.com>
Fall back to tp_call() for cases when arguments are passed by name.
Co-authored-by: Donghee Na <donghee.na@python.org>
Co-authored-by: Victor Stinner <vstinner@python.org>
This keeps track of the per-thread total reference count operations in
PyThreadState in the free-threaded builds. The count is merged into the
interpreter's total when the thread exits.
Most mutable data is protected by a striped lock that is keyed on the
referenced object's address. The weakref's hash is protected using the
weakref's per-object lock.
Note that this only affects free-threaded builds. Apart from some minor
refactoring, the added code is all either gated by `ifdef`s or is a no-op
(e.g. `Py_BEGIN_CRITICAL_SECTION`).
I had meant to switch everything to InterpreterError when I added it a while back. At the time I missed a few key spots.
As part of this, I've added print-the-exception to _PyXI_InitTypes() and fixed an error case in `_PyStaticType_InitBuiltin().
This change gives a significant speedup, as the METH_FASTCALL calling
convention is now used. The following methods are adapted:
- str.count
- str.find
- str.index
- str.rfind
- str.rindex
Use the fully qualified type name in repr() of weakref.ref and
weakref.proxy types.
Fix a crash in proxy_repr() when the reference is dead.
Add also test_ref_repr() and test_proxy_repr().
Add a special case for `list.extend(dict)` and `list(dict)` so that those
patterns behave atomically with respect to modifications to the list or
dictionary.
This is required by multiprocessing, which assumes that
`list(_finalizer_registry)` is atomic.
Read the MRO in a thread-unsafe way in `PyType_IsSubtype` to avoid locking. Fixing this is tracked in #117306.
The motivation for this change is in support of making weakrefs thread-safe in free-threaded builds:
`WeakValueDictionary` uses a special dictionary function, `_PyDict_DelItemIf`
to remove dead weakrefs from the dictionary. `_PyDict_DelItemIf` removes a key
if a user supplied predicate evaluates to true for the value associated with
the key. Crucially for the `WeakValueDictionary` use case, the predicate
evaluation + deletion sequence is atomic, provided that the predicate doesn’t
suspend. The predicate used by `WeakValueDictionary` includes a subtype check,
which we must ensure doesn't suspend in free-threaded builds.
Rewrote binarysort() for clarity.
Also changed the signature to be more coherent (it was mixing sortslice with raw pointers).
No change in method or functionality. However, I left some experiments in, disabled for now
via `#if` tricks. Since this code was first written, some kinds of comparisons have gotten
enormously faster (like for lists of floats), which changes the tradeoffs.
For example, plain insertion sort's simpler innermost loop and highly predictable branches
leave it very competitive (even beating, by a bit) binary insertion when comparisons are
very cheap, despite that it can do many more compares. And it wins big on runs that
are already sorted (moving the next one in takes only 1 compare then).
So I left code for a plain insertion sort, to make future experimenting easier.
Also made the maximum value of minrun a `#define` (``MAX_MINRUN`) to make
experimenting with that easier too.
And another bit of `#if``-disabled code rewrites binary insertion's innermost loop to
remove its unpredictable branch. Surprisingly, this doesn't really seem to help
overall. I'm unclear on why not. It certainly adds more instructions, but they're very
simple, and it's hard to be believe they cost as much as a branch miss.
Changes to the function version cache:
- In addition to the function object, also store the code object,
and allow the latter to be retrieved even if the function has been evicted.
- Stop assigning new function versions after a critical attribute (e.g. `__code__`)
has been modified; the version is permanently reset to zero in this case.
- Changes to `__annotations__` are no longer considered critical. (This fixes gh-109998.)
Changes to the Tier 2 optimization machinery:
- If we cannot map a function version to a function, but it is still mapped to a code object,
we continue projecting the trace.
The operand of the `_PUSH_FRAME` and `_POP_FRAME` opcodes can be either NULL,
a function object, or a code object with the lowest bit set.
This allows us to trace through code that calls an ephemeral function,
i.e., a function that may not be alive when we are constructing the executor,
e.g. a generator expression or certain nested functions.
We will lose globals removal inside such functions,
but we can still do other peephole operations
(and even possibly [call inlining](https://github.com/python/cpython/pull/116290),
if we decide to do it), which only need the code object.
As before, if we cannot retrieve the code object from the cache, we stop projecting.
I added it quite a while ago as a strategy for managing interpreter lifetimes relative to the PEP 554 (now 734) implementation. Relatively recently I refactored that implementation to no longer rely on InterpreterID objects. Thus now I'm removing it.
Add Py_GetConstant() and Py_GetConstantBorrowed() functions.
In the limited C API version 3.13, getting Py_None, Py_False,
Py_True, Py_Ellipsis and Py_NotImplemented singletons is now
implemented as function calls at the stable ABI level to hide
implementation details. Getting these constants still return borrowed
references.
Add _testlimitedcapi/object.c and test_capi/test_object.py to test
Py_GetConstant() and Py_GetConstantBorrowed() functions.
Mostly we unify the two different implementations of the conversion code (from PyObject * to int64_t. We also drop the PyArg_ParseTuple()-style converter function, as well as rename and move PyInterpreterID_LookUp().
Starting in Python 3.12, we prevented calling fork() and starting new threads
during interpreter finalization (shutdown). This has led to a number of
regressions and flaky tests. We should not prevent starting new threads
(or `fork()`) until all non-daemon threads exit and finalization starts in
earnest.
This changes the checks to use `_PyInterpreterState_GetFinalizing(interp)`,
which is set immediately before terminating non-daemon threads.
Somehow we ended up with two separate counter variables tracking "the next function version".
Most likely this was a historical accident where an old branch was updated incorrectly.
This PR merges the two counters into a single one: `interp->func_state.next_version`.
Since 3.12, allocating a GC object cannot immediately trigger GC. This
allows us to simplify the logic for creating the canonical callback-less
weakref.
* GH-116554: Relax list.sort()'s notion of "descending" run
Rewrote `count_run()` so that sub-runs of equal elements no longer end a descending run. Both ascending and descending runs can have arbitrarily many sub-runs of arbitrarily many equal elements now. This is tricky, because we only use ``<`` comparisons, so checking for equality doesn't come "for free". Surprisingly, it turned out there's a very cheap (one comparison) way to determine whether an ascending run consisted of all-equal elements. That sealed the deal.
In addition, after a descending run is reversed in-place, we now go on to see whether it can be extended by an ascending run that just happens to be adjacent. This succeeds in finding at least one additional element to append about half the time, and so appears to more than repay its cost (the savings come from getting to skip a binary search, when a short run is artificially forced to length MIINRUN later, for each new element `count_run()` can add to the initial run).
While these have been in the back of my mind for years, a question on StackOverflow pushed it to action:
https://stackoverflow.com/questions/78108792/
They were wondering why it took about 4x longer to sort a list like:
[999_999, 999_999, ..., 2, 2, 1, 1, 0, 0]
than "similar" lists. Of course that runs very much faster after this patch.
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
Co-authored-by: Pieter Eendebak <pieter.eendebak@gmail.com>
This makes nearly all the operations on set thread-safe in the free-threaded build, with the exception of `_PySet_NextEntry` and `setiter_iternext`.
Co-authored-by: Sam Gross <colesbury@gmail.com>
Co-authored-by: Erlend E. Aasland <erlend.aasland@protonmail.com>
This implements the delayed reuse of mimalloc pages that contain Python
objects in the free-threaded build.
Allocations of the same size class are grouped in data structures called
pages. These are different from operating system pages. For thread-safety, we
want to ensure that memory used to store PyObjects remains valid as long as
there may be concurrent lock-free readers; we want to delay using it for
other size classes, in other heaps, or returning it to the operating system.
When a mimalloc page becomes empty, instead of immediately freeing it, we tag
it with a QSBR goal and insert it into a per-thread state linked list of
pages to be freed. When mimalloc needs a fresh page, we process the queue and
free any still empty pages that are now deemed safe to be freed. Pages
waiting to be freed are still available for allocations of the same size
class and allocating from a page prevent it from being freed. There is
additional logic to handle abandoned pages when threads exit.