In the free-threaded build, avoid data races caused by updating type slots
or type flags after the type was initially created. For those (typically
rare) cases, use the stop-the-world mechanism. Remove the use of atomics
when reading or writing type flags. The use of atomics is not sufficient to
avoid races (since flags are sometimes read without a lock and without
atomics) and are no longer required.
The PyThreadState field gains a reference count field to avoid
issues with PyThreadState being a dangling pointer to freed memory.
The refcount starts with a value of two: one reference is owned by the
interpreter's linked list of thread states and one reference is owned by
the OS thread. The reference count is decremented when the thread state
is removed from the interpreter's linked list and before the OS thread
calls `PyThread_hang_thread()`. The thread that decrements it to zero
frees the `PyThreadState` memory.
The `holds_gil` field is moved out of the `_status` bit field, to avoid
a data race where on thread calls `PyThreadState_Clear()`, modifying the
`_status` bit field while the OS thread reads `holds_gil` when
attempting to acquire the GIL.
The `PyThreadState.state` field now has `_Py_THREAD_SHUTTING_DOWN` as a
possible value. This corresponds to the `_PyThreadState_MustExit()`
check. This avoids race conditions in the free threading build when
checking `_PyThreadState_MustExit()`.
The race condition with `free_threadstate` and daemon threads exists in
both the free threading and default builds. We were missing a
suppression in the default build.
CPython current temporarily changes `PYMEM_DOMAIN_RAW` to the default
allocator during initialization and shutdown. The motivation is to
ensure that core runtime structures are allocated and freed using the
same allocator. However, modifying the current allocator changes global
state and is not thread-safe even with the GIL. Other threads may be
allocating or freeing objects use PYMEM_DOMAIN_RAW; they are not
required to hold the GIL to call PyMem_RawMalloc/PyMem_RawFree.
This adds new internal-only functions like `_PyMem_DefaultRawMalloc`
that aren't affected by calls to `PyMem_SetAllocator()`, so they're
appropriate for Python runtime initialization and finalization. Use
these calls in places where we previously swapped to the default raw
allocator.
Make tuple iteration more thread-safe, and actually test concurrent iteration of tuple, range and list. (This is prep work for enabling specialization of FOR_ITER in free-threaded builds.) The basic premise is:
Iterating over a shared iterable (list, tuple or range) should be safe, not involve data races, and behave like iteration normally does.
Using a shared iterator should not crash or involve data races, and should only produce items regular iteration would produce. It is not guaranteed to produce all items, or produce each item only once. (This is not the case for range iteration even after this PR.)
Providing stronger guarantees is possible for some of these iterators, but it's not always straight-forward and can significantly hamper the common case. Since iterators in general aren't shared between threads, and it's simply impossible to concurrently use many iterators (like generators), better to make sharing iterators without explicit synchronization clearly wrong.
Specific issues fixed in order to make the tests pass:
- List iteration could occasionally fail an assertion when a shared list was shrunk and an item past the new end was retrieved concurrently. There's still some unsafety when deleting/inserting multiple items through for example slice assignment, which uses memmove/memcpy.
- Tuple iteration could occasionally crash when the iterator's reference to the tuple was cleared on exhaustion. Like with list iteration, in free-threaded builds we can't safely and efficiently clear the iterator's reference to the iterable (doing it safely would mean extra, slow refcount operations), so just keep the iterable reference around.
Fix a few thread-safety bugs to enable test_opcache when run with TSAN:
* Use relaxed atomics when clearing `ht->_spec_cache.getitem`
(gh-115999)
* Add temporary suppression for type slot modifications (gh-127266)
* Use atomic load when reading `*dictptr`
The adaptive counter doesn't do anything currently in the free-threaded
build and TSan reports a data race due to concurrent modifications to
the counter.
In the free-threaded build, we need to lock pending->mutex when clearing
the handling_thread in order not to race with a concurrent
make_pending_calls in the same interpreter.
The `used` field must be written using atomic stores because `set_len`
and iterators may access the field concurrently without holding the
per-object lock.
Refactor the fast Unicode hash check into `_PyObject_HashFast` and use relaxed
atomic loads in the free-threaded build.
After this change, the TSAN doesn't report data races for this method.
This adds a `_PyRecursiveMutex` type based on `PyMutex` and uses that
for the import lock. This fixes some data races in the free-threaded
build and generally simplifies the import lock code.
The `_PyThreadState_Bind()` function is called before the first
`PyEval_AcquireThread()` so it's not synchronized with the stop the
world GC. We had a race where `gc_visit_heaps()` might visit a thread's
heap while it's being initialized.
Use a simple atomic int to avoid visiting heaps for threads that are not
yet fully initialized (i.e., before `tstate_mimalloc_bind()` is called).
The race was reproducible by running:
`python Lib/test/test_importlib/partial/pool_in_threads.py`.
The free-threaded build currently immortalizes objects that use deferred
reference counting (see gh-117783). This typically happens once the
first non-main thread is created, but the behavior can be suppressed for
tests, in subinterpreters, or during a compile() call.
This fixes a race condition involving the tracking of whether the
behavior is suppressed.
Only call `gc_restore_tid()` from stop-the-world contexts.
`worklist_pop()` can be called while other threads are running, so use a
relaxed atomic to modify `ob_tid`.
This ensures we don't lose races that occur in subprocesses or
interleave races from workers running in parallel.
Log files are collected and packaged into a zipfile that can be
downloaded from the "Artifacts" section of the workflow run.
`_Py_qsbr_unregister` is called when the PyThreadState is already
detached, so the access to `tstate->qsbr` isn't safe without locking the
shared mutex. Grab the `struct _qsbr_shared` from the interpreter
instead.
Using `race:` filters out warnings if the function appears anywhere in
the stack trace. This can hide a lot of unrelated warnings, especially
for a function like `_PyEval_EvalFrameDefault`, which is somewhere on
the stack more often than not.
Change all free-threaded suppressions to `race_top:`, which only matches
the top frame, and add any new suppressions this exposes.
Use relaxed atomics when reading / writing to the field. There are still a
few places in the GC where we do not use atomics. Those should be safe as
the world is stopped.
Quiet erroneous TSAN reports of data races in `_PySeqLock`
TSAN reports a couple of data races between the compare/exchange in
`_PySeqLock_LockWrite` and the non-atomic loads in `_PySeqLock_{Abandon,Unlock}Write`.
This is another instance of TSAN incorrectly modeling failed compare/exchange
as a write instead of a load.
Fix data races in the method cache in free-threaded builds
These are technically data races, but I think they're benign (to
the extent that that is actually possible). We update cache entries
non-atomically but read them atomically from another thread, and there's
nothing that establishes a happens-before relationship between the
reads and writes that I can see.
Additionally, reduce the iterations for a few weakref tests that would
otherwise take a prohibitively long amount of time (> 1 hour) when TSAN
is enabled and the GIL is disabled.