Adds `_PyObject_GetMethodStackRef` which uses stackrefs and takes advantage of deferred reference counting in free-threading while calling method objects in vectorcall.
Add a separate benchmark that measures the effect of
`_PyObject_LookupSpecial()` on scaling.
In the process of cleaning up the scaling benchmarks for inclusion, I
unintentionally changed the "cmodule_function" benchmark to pass an
`int` to `math.floor()` instead of a `float`, which causes it to use the
`_PyObject_LookupSpecial()` code path. `_PyObject_LookupSpecial()` has
its own scaling issues that we want to measure separately from calling a
function on a C module.
These consist of a number of short snippets that help identify scaling
bottlenecks in the free threaded interpreter.
The current bottlenecks are in calling functions in benchmarks that call
functions (due to `LOAD_ATTR` not yet using deferred reference counting)
and when accessing thread-local data.