The PR introduces a new shared secret that is shared over all the nodes
on the Redis cluster. The main idea is to leverage the cluster bus to
share a secret between all the nodes such that later the nodes will be
able to authenticate using this secret and send internal commands to
each other (see #13740 for more information about internal commands).
The way the shared secret is chosen is the following:
1. Each node, when start, randomly generate its own internal secret.
2. Each node share its internal secret over the cluster ping messages.
3. If a node gets a ping message with secret smaller then his current
secret, it embrace it.
4. Eventually all nodes should embrace the minimal secret
The converges of the secret is as good as the topology converges.
To extend the ping messages to contain the secret, we leverage the
extension mechanism. Nodes that runs an older Redis version will just
ignore those extensions.
Specific tests were added to verify that eventually all nodes see the
secrets. In addition, a verification was added to the test infra to
verify the secret on `cluster_config_consistent` and to
`assert_cluster_state`.
This PR adds a flag to the `RM_GetContextFlags` module-API function that
depicts whether the context may execute debug commands, according to
redis's standards.
This PR introduces Codecov to automate code coverage tracking for our
project's tests.
For more information about the Codecov platform, please refer to
https://docs.codecov.com/docs/quick-start
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
This PR addresses an issue where if a module does not provide a
defragmentation callback, we cannot defragment the fragmentation it
generates. However, the defragmentation process still considers a large
amount of fragmentation to be present, leading to more aggressive
defragmentation efforts that ultimately have no effect.
To mitigate this, the PR introduces a mechanism to gradually reduce the
CPU consumption for defragmentation when the defragmentation
effectiveness is poor. This occurs when the fragmentation rate drops
below 2% and the hit ratio is less than 1%, or when the fragmentation
rate increases by no more than 2%. The CPU consumption will be gradually
decreased until it reaches the minimum threshold defined by
`active-defrag-cycle-min`.
---------
Co-authored-by: oranagra <oran@redislabs.com>
After upgrading of ubuntu 24.04, clang18 can check runtime error: call
to function XXX through pointer to incorrect function type, our daily CI
reports the errors by UndefinedBehaviorSanitizer (UBSan):
https://github.com/redis/redis/actions/runs/12738281720/job/35500380251#step:6:346
now we add generic version of some existing `free` functions to support
to call function through (void*) pointer, actually, they just are the
wrapper functions that will cast the data type and call the
corresponding functions.
This PR is based on:
https://github.com/redis/redis/pull/12109https://github.com/valkey-io/valkey/pull/60
Closes: https://github.com/redis/redis/issues/11678
**Motivation**
During a full sync, when master is delivering RDB to the replica,
incoming write commands are kept in a replication buffer in order to be
sent to the replica once RDB delivery is completed. If RDB delivery
takes a long time, it might create memory pressure on master. Also, once
a replica connection accumulates replication data which is larger than
output buffer limits, master will kill replica connection. This may
cause a replication failure.
The main benefit of the rdb channel replication is streaming incoming
commands in parallel to the RDB delivery. This approach shifts
replication stream buffering to the replica and reduces load on master.
We do this by opening another connection for RDB delivery. The main
channel on replica will be receiving replication stream while rdb
channel is receiving the RDB.
This feature also helps to reduce master's main process CPU load. By
opening a dedicated connection for the RDB transfer, the bgsave process
has access to the new connection and it will stream RDB directly to the
replicas. Before this change, due to TLS connection restriction, the
bgsave process was writing RDB bytes to a pipe and the main process was
forwarding
it to the replica. This is no longer necessary, the main process can
avoid these expensive socket read/write syscalls. It also means RDB
delivery to replica will be faster as it avoids this step.
In summary, replication will be faster and master's performance during
full syncs will improve.
**Implementation steps**
1. When replica connects to the master, it sends 'rdb-channel-repl' as
part of capability exchange to let master to know replica supports rdb
channel.
2. When replica lacks sufficient data for PSYNC, master sends
+RDBCHANNELSYNC reply with replica's client id. As the next step, the
replica opens a new connection (rdb-channel) and configures it against
the master with the appropriate capabilities and requirements. It also
sends given client id back to master over rdbchannel, so that master can
associate these channels. (initial replica connection will be referred
as main-channel) Then, replica requests fullsync using the RDB channel.
3. Prior to forking, master attaches the replica's main channel to the
replication backlog to deliver replication stream starting at the
snapshot end offset.
4. The master main process sends replication stream via the main
channel, while the bgsave process sends the RDB directly to the replica
via the rdb-channel. Replica accumulates replication stream in a local
buffer, while the RDB is being loaded into the memory.
5. Once the replica completes loading the rdb, it drops the rdb channel
and streams the accumulated replication stream into the db. Sync is
completed.
**Some details**
- Currently, rdbchannel replication is supported only if
`repl-diskless-sync` is enabled on master. Otherwise, replication will
happen over a single connection as in before.
- On replica, there is a limit to replication stream buffering. Replica
uses a new config `replica-full-sync-buffer-limit` to limit number of
bytes to accumulate. If it is not set, replica inherits
`client-output-buffer-limit <replica>` hard limit config. If we reach
this limit, replica stops accumulating. This is not a failure scenario
though. Further accumulation will happen on master side. Depending on
the configured limits on master, master may kill the replica connection.
**API changes in INFO output:**
1. New replica state: `send_bulk_and_stream`. Indicates full sync is
still in progress for this replica. It is receiving replication stream
and rdb in parallel.
```
slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0
```
Replica state changes in steps:
- First, replica sends psync and receives +RDBCHANNELSYNC
:`state=wait_bgsave`
- After replica connects with rdbchannel and delivery starts:
`state=send_bulk_and_stream`
- After full sync: `state=online`
2. On replica side, replication stream buffering metrics:
- replica_full_sync_buffer_size: Currently accumulated replication
stream data in bytes.
- replica_full_sync_buffer_peak: Peak number of bytes that this instance
accumulated in the lifetime of the process.
```
replica_full_sync_buffer_size:20485
replica_full_sync_buffer_peak:1048560
```
**API changes in CLIENT LIST**
In `client list` output, rdbchannel clients will have 'C' flag in
addition to 'S' replica flag:
```
id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0
```
**Config changes:**
- `replica-full-sync-buffer-limit`: Controls how much replication data
replica can accumulate during rdbchannel replication. If it is not set,
a value of 0 means replica will inherit `client-output-buffer-limit
<replica>` hard limit config to limit accumulated data.
- `repl-rdb-channel` config is added as a hidden config. This is mostly
for testing as we need to support both rdbchannel replication and the
older single connection replication (to keep compatibility with older
versions and rdbchannel replication will not be enabled if
repl-diskless-sync is not enabled). it affects both the master (not to
respond to rdb channel requests), and the replica (not to declare
capability)
**Internal API changes:**
Changes that were introduced to Redis replication:
- New replication capability is added to replconf command: `capa
rdb-channel-repl`. Indicates replica is capable of rdb channel
replication. Replica sends it when it connects to master along with
other capabilities.
- If replica needs fullsync, master replies `+RDBCHANNELSYNC
<client-id>` to the replica's PSYNC request.
- When replica opens rdbchannel connection, as part of replconf command,
it sends `rdb-channel 1` to let master know this is rdb channel. Also,
it sends `main-ch-client-id <client-id>` as part of replconf command so
master can associate channels.
**Testing:**
As rdbchannel replication is enabled by default, we run whole test suite
with it. Though, as we need to support both rdbchannel and single
connection replication, we'll be running some tests twice with
`repl-rdb-channel yes/no` config.
**Replica state diagram**
```
* * Replica state machine *
*
* Main channel state
* ┌───────────────────┐
* │RECEIVE_PING_REPLY │
* └────────┬──────────┘
* │ +PONG
* ┌────────▼──────────┐
* │SEND_HANDSHAKE │ RDB channel state
* └────────┬──────────┘ ┌───────────────────────────────┐
* │+OK ┌───► RDB_CH_SEND_HANDSHAKE │
* ┌────────▼──────────┐ │ └──────────────┬────────────────┘
* │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid>
* └────────┬──────────┘ │ ┌──────────────▼────────────────┐
* │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │
* ┌────────▼──────────┐ │ └──────────────┬────────────────┘
* │RECEIVE_PORT_REPLY │ │ │ +OK
* └────────┬──────────┘ │ ┌──────────────▼────────────────┐
* │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│
* ┌────────▼──────────┐ │ └──────────────┬────────────────┘
* │RECEIVE_IP_REPLY │ │ │ +OK
* └────────┬──────────┘ │ ┌──────────────▼────────────────┐
* │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │
* ┌────────▼──────────┐ │ └──────────────┬────────────────┘
* │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC
* └────────┬──────────┘ │ │Rdb delivery
* │ │ ┌──────────────▼────────────────┐
* ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │
* │SEND_PSYNC │ │ └──────────────┬────────────────┘
* └─┬─────────────────┘ │ │ Done loading
* │PSYNC (use cached-master) │ │
* ┌─▼─────────────────┐ │ │
* │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication
* └─┬─────────────────┘ │ │ │ buffer into memory
* │ │ │ │
* │+RDBCHANNELSYNC client-id │ │ │
* ├──────┬───────────────────┘ │ │
* │ │ Main channel │ │
* │ │ accumulates repl data │ │
* │ ┌──▼────────────────┐ │ ┌───────▼───────────┐
* │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │
* │ └───────────────────┘ └────▲───▲──────────┘
* │ │ │
* │ │ │
* │ +FULLRESYNC ┌───────────────────┐ │ │
* ├────────────────► REPL_TRANSFER ├────┘ │
* │ └───────────────────┘ │
* │ +CONTINUE │
* └──────────────────────────────────────────────┘
*/
```
-----
This PR also contains changes and ideas from:
https://github.com/valkey-io/valkey/pull/837https://github.com/valkey-io/valkey/pull/1173https://github.com/valkey-io/valkey/pull/804https://github.com/valkey-io/valkey/pull/945https://github.com/valkey-io/valkey/pull/989
---------
Co-authored-by: Yuan Wang <wangyuancode@163.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: Moti Cohen <moticless@gmail.com>
Co-authored-by: naglera <anagler123@gmail.com>
Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
This update refactors prepareClientToWrite by introducing
_prepareClientToWrite for inline checks within network.c file, and
separates replica and non-replica handling for pending replies and
writes (_clientHasPendingRepliesSlave/NonSlave and
_writeToClientSlave/NonSlave).
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: Yuan Wang <wangyuancode@163.com>
Found by @ShooterIT
## Describe
If a client first creates a command with a very large number of
parameters, such as 10,000 parameters, the argv will be expanded to
accommodate 10,000. If the subsequent commands have fewer than 10,000
parameters, this argv will continue to be reused and will never be
shrunk.
## Solution
When determining whether it is necessary to rebuild argv, if the length
of the previous argv has already exceeded 1024, we will progressively
create argv regardless.
## Free argv in cron
Add a new condition to determine whether argv needs to be resized in
cron. When the number of parameters exceeds 128, we will resize it
regardless to avoid a single client consuming too much memory. It will
now occupy a maximum of (128 * 8 bytes).
---------
Co-authored-by: Yuan Wang <wangyuancode@163.com>
Introduced by https://github.com/redis/redis/issues/13521
If the client argv was released due to a timeout before sending the
complete command, `argv_len` will be reset to 0.
When argv is parsed again and resized, requesting a length of 0 may
result in argv being NULL, then leading to a crash.
And fix a bug that `argv_len` is not updated correctly in
`replaceClientCommandVector()`.
---------
Co-authored-by: ShooterIT <wangyuancode@163.com>
Co-authored-by: meiravgri <109056284+meiravgri@users.noreply.github.com>
The main job of the IO thread is read queries and write replies, so
reads/writes metrics can reflect the workload of IO threads, now we also
support this metrics `io_threaded_reads/writes_processed` in detail for
each IO thread.
Of course, to avoid break changes, `io_threaded_reads/writes_processed`
is still there. But before async io thread commit, we may sum the IO
done by the main thread if IO threads are active, but now we only sum
the IO done by IO threads.
Now threads section in `info` command output is as follows:
```
# Threads
io_thread_0:clients=0,reads=0,writes=0
io_thread_1:clients=54,reads=6546940,writes=6546919
io_thread_2:clients=54,reads=6513650,writes=6513625
io_thread_3:clients=54,reads=6396571,writes=6396525
io_thread_4:clients=53,reads=6511120,writes=6511097
io_thread_5:clients=53,reads=6539302,writes=6539280
io_thread_6:clients=53,reads=6502269,writes=6502248
```
close#13709
Fix the index error of CRLF character for integer-encoded strings
in addReplyBulk function
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
This PR is based on the commits from PR
https://github.com/valkey-io/valkey/pull/1212.
When explored the cycles distribution for main thread with io-threads
enabled. We found this security attack check takes significant time in
main thread, **~3%** cycles were used to do the commands security check
in main thread.
This patch try to completely avoid doing it in the hot path. We can do
it only after we looked up the command and it wasn't found, just before
we call commandCheckExistence.
---------
Co-authored-by: Lipeng Zhu <lipeng.zhu@intel.com>
Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
Fixes four cases where `stringmatchlen` could overrun the pattern if it is not
terminated with NUL.
These commits are cherry-picked from my
[fork](https://github.com/thaliaarchi/antirez-stringmatch) which
extracts `stringmatch` as a library and compares it to other projects by
antirez which uses the same matcher.
From flame graph, we could see `lookupCommand` in main thread costs much
CPU, so we can let IO threads to perform `lookupCommand`.
To avoid race condition among multiple IO threads, made the following
changes:
- Pause all IO threads when register or unregister commands
- Force a full rehashing of the command table dict when resizing
## Introduction
Redis introduced IO Thread in 6.0, allowing IO threads to handle client
request reading, command parsing and reply writing, thereby improving
performance. The current IO thread implementation has a few drawbacks.
- The main thread is blocked during IO thread read/write operations and
must wait for all IO threads to complete their current tasks before it
can continue execution. In other words, the entire process is
synchronous. This prevents the efficient utilization of multi-core CPUs
for parallel processing.
- When the number of clients and requests increases moderately, it
causes all IO threads to reach full CPU utilization due to the busy wait
mechanism used by the IO threads. This makes it challenging for us to
determine which part of Redis has reached its bottleneck.
- When IO threads are enabled with TLS and io-threads-do-reads, a
disconnection of a connection with pending data may result in it being
assigned to multiple IO threads simultaneously. This can cause race
conditions and trigger assertion failures. Related issue:
redis#12540
Therefore, we designed an asynchronous IO threads solution. The IO
threads adopt an event-driven model, with the main thread dedicated to
command processing, meanwhile, the IO threads handle client read and
write operations in parallel.
## Implementation
### Overall
As before, we did not change the fact that all client commands must be
executed on the main thread, because Redis was originally designed to be
single-threaded, and processing commands in a multi-threaded manner
would inevitably introduce numerous race and synchronization issues. But
now each IO thread has independent event loop, therefore, IO threads can
use a multiplexing approach to handle client read and write operations,
eliminating the CPU overhead caused by busy-waiting.
the execution process can be briefly described as follows:
the main thread assigns clients to IO threads after accepting
connections, IO threads will notify the main thread when clients
finish reading and parsing queries, then the main thread processes
queries from IO threads and generates replies, IO threads handle
writing reply to clients after receiving clients list from main thread,
and then continue to handle client read and write events.
### Each IO thread has independent event loop
We now assign each IO thread its own event loop. This approach
eliminates the need for the main thread to perform the costly
`epoll_wait` operation for handling connections (except for specific
ones). Instead, the main thread processes requests from the IO threads
and hands them back once completed, fully offloading read and write
events to the IO threads.
Additionally, all TLS operations, including handling pending data, have
been moved entirely to the IO threads. This resolves the issue where
io-threads-do-reads could not be used with TLS.
### Event-notified client queue
To facilitate communication between the IO threads and the main thread,
we designed an event-notified client queue. Each IO thread and the main
thread have two such queues to store clients waiting to be processed.
These queues are also integrated with the event loop to enable handling.
We use pthread_mutex to ensure the safety of queue operations, as well
as data visibility and ordering, and race conditions are minimized, as
each IO thread and the main thread operate on independent queues,
avoiding thread suspension due to lock contention. And we implemented an
event notifier based on `eventfd` or `pipe` to support event-driven
handling.
### Thread safety
Since the main thread and IO threads can execute in parallel, we must
handle data race issues carefully.
**client->flags**
The primary tasks of IO threads are reading and writing, i.e.
`readQueryFromClient` and `writeToClient`. However, IO threads and the
main thread may concurrently modify or access `client->flags`, leading
to potential race conditions. To address this, we introduced an io-flags
variable to record operations performed by IO threads, thereby avoiding
race conditions on `client->flags`.
**Pause IO thread**
In the main thread, we may want to operate data of IO threads, maybe
uninstall event handler, access or operate query/output buffer or resize
event loop, we need a clean and safe context to do that. We pause IO
thread in `IOThreadBeforeSleep`, do some jobs and then resume it. To
avoid thread suspended, we use busy waiting to confirm the target
status. Besides we use atomic variable to make sure memory visibility
and ordering. We introduce these functions to pause/resume IO Threads as
below.
```
pauseIOThread, resumeIOThread
pauseAllIOThreads, resumeAllIOThreads
pauseIOThreadsRange, resumeIOThreadsRange
```
Testing has shown that `pauseIOThread` is highly efficient, allowing the
main thread to execute nearly 200,000 operations per second during
stress tests. Similarly, `pauseAllIOThreads` with 8 IO threads can
handle up to nearly 56,000 operations per second. But operations
performed between pausing and resuming IO threads must be quick;
otherwise, they could cause the IO threads to reach full CPU
utilization.
**freeClient and freeClientAsync**
The main thread may need to terminate a client currently running on an
IO thread, for example, due to ACL rule changes, reaching the output
buffer limit, or evicting a client. In such cases, we need to pause the
IO thread to safely operate on the client.
**maxclients and maxmemory-clients updating**
When adjusting `maxclients`, we need to resize the event loop for all IO
threads. Similarly, when modifying `maxmemory-clients`, we need to
traverse all clients to calculate their memory usage. To ensure safe
operations, we pause all IO threads during these adjustments.
**Client info reading**
The main thread may need to read a client’s fields to generate a
descriptive string, such as for the `CLIENT LIST` command or logging
purposes. In such cases, we need to pause the IO thread handling that
client. If information for all clients needs to be displayed, all IO
threads must be paused.
**Tracking redirect**
Redis supports the tracking feature and can even send invalidation
messages to a connection with a specified ID. But the target client may
be running on IO thread, directly manipulating the client’s output
buffer is not thread-safe, and the IO thread may not be aware that the
client requires a response. In such cases, we pause the IO thread
handling the client, modify the output buffer, and install a write event
handler to ensure proper handling.
**clientsCron**
In the `clientsCron` function, the main thread needs to traverse all
clients to perform operations such as timeout checks, verifying whether
they have reached the soft output buffer limit, resizing the
output/query buffer, or updating memory usage. To safely operate on a
client, the IO thread handling that client must be paused.
If we were to pause the IO thread for each client individually, the
efficiency would be very low. Conversely, pausing all IO threads
simultaneously would be costly, especially when there are many IO
threads, as clientsCron is invoked relatively frequently.
To address this, we adopted a batched approach for pausing IO threads.
At most, 8 IO threads are paused at a time. The operations mentioned
above are only performed on clients running in the paused IO threads,
significantly reducing overhead while maintaining safety.
### Observability
In the current design, the main thread always assigns clients to the IO
thread with the least clients. To clearly observe the number of clients
handled by each IO thread, we added the new section in INFO output. The
`INFO THREADS` section can show the client count for each IO thread.
```
# Threads
io_thread_0:clients=0
io_thread_1:clients=2
io_thread_2:clients=2
```
Additionally, in the `CLIENT LIST` output, we also added a field to
indicate the thread to which each client is assigned.
`id=244 addr=127.0.0.1:41870 laddr=127.0.0.1:6379 ... resp=2 lib-name=
lib-ver= io-thread=1`
## Trade-off
### Special Clients
For certain special types of clients, keeping them running on IO threads
would result in severe race issues that are difficult to resolve.
Therefore, we chose not to offload these clients to the IO threads.
For replica, monitor, subscribe, and tracking clients, main thread may
directly write them a reply when conditions are met. Race issues are
difficult to resolve, so we have them processed in the main thread. This
includes the Lua debug clients as well, since we may operate connection
directly.
For blocking client, after the IO thread reads and parses a command and
hands it over to the main thread, if the client is identified as a
blocking type, it will be remained in the main thread. Once the blocking
operation completes and the reply is generated, the client is
transferred back to the IO thread to send the reply and wait for event
triggers.
### Clients Eviction
To support client eviction, it is necessary to update each client’s
memory usage promptly during operations such as read, write, or command
execution. However, when a client operates on an IO thread, it is not
feasible to update the memory usage immediately due to the risk of data
races. As a result, memory usage can only be updated either in the main
thread while processing commands or in the `ClientsCron` periodically.
The downside of this approach is that updates might experience a delay
of up to one second, which could impact the precision of memory
management for eviction.
To avoid incorrectly evicting clients. We adopted a best-effort
compensation solution, when we decide to eviction a client, we update
its memory usage again before evicting, if the memory used by the client
does not decrease or memory usage bucket is not changed, then we will
evict it, otherwise, not evict it.
However, we have not completely solved this problem. Due to the delay in
memory usage updates, it may lead us to make incorrect decisions about
the need to evict clients.
### Defragment
In the majority of cases we do NOT use the data from argv directly in
the db.
1. key names
We store a copy that we allocate in the main thread, see `sdsdup()` in
`dbAdd()`.
2. hash key and value
We store key as hfield and store value as sds, see `hfieldNew()` and
`sdsdup()` in `hashTypeSet()`.
3. other datatypes
They don't even use SDS, so there is no reference issues.
But in some cases client the data from argv may be retain by the main
thread.
As a result, during fragmentation cleanup, we need to move allocations
from the IO thread’s arena to the main thread’s arena. We always
allocate new memory in the main thread’s arena, but the memory released
by IO threads may not yet have been reclaimed. This ultimately causes
the fragmentation rate to be higher compared to creating and allocating
entirely within a single thread.
The following cases below will lead to memory allocated by the IO thread
being kept by the main thread.
1. string related command: `append`, `getset`, `mset` and `set`.
If `tryObjectEncoding()` does not change argv, we will keep it directly
in the main thread, see the code in `tryObjectEncoding()`(specifically
`trimStringObjectIfNeeded()`)
2. block related command.
the key names will be kept in `c->db->blocking_keys`.
3. watch command
the key names will be kept in `c->db->watched_keys`.
4. [s]subscribe command
channel name will be kept in `serverPubSubChannels`.
5. script load command
script will be kept in `server.lua_scripts`.
7. some module API: `RM_RetainString`, `RM_HoldString`
Those issues will be handled in other PRs.
## Testing
### Functional Testing
The commit with enabling IO Threads has passed all TCL tests, but we did
some changes:
**Client query buffer**: In the original code, when using a reusable
query buffer, ownership of the query buffer would be released after the
command was processed. However, with IO threads enabled, the client
transitions from an IO thread to the main thread for processing. This
causes the ownership release to occur earlier than the command
execution. As a result, when IO threads are enabled, the client's
information will never indicate that a shared query buffer is in use.
Therefore, we skip the corresponding query buffer tests in this case.
**Defragment**: Add a new defragmentation test to verify the effect of
io threads on defragmentation.
**Command delay**: For deferred clients in TCL tests, due to clients
being assigned to different threads for execution, delays may occur. To
address this, we introduced conditional waiting: the process proceeds to
the next step only when the `client list` contains the corresponding
commands.
### Sanitizer Testing
The commit passed all TCL tests and reported no errors when compiled
with the `fsanitizer=thread` and `fsanitizer=address` options enabled.
But we made the following modifications: we suppressed the sanitizer
warnings for clients with watched keys when updating `client->flags`, we
think IO threads read `client->flags`, but never modify it or read the
`CLIENT_DIRTY_CAS` bit, main thread just only modifies this bit, so
there is no actual data race.
## Others
### IO thread number
In the new multi-threaded design, the main thread is primarily focused
on command processing to improve performance. Typically, the main thread
does not handle regular client I/O operations but is responsible for
clients such as replication and tracking clients. To avoid breaking
changes, we still consider the main thread as the first IO thread.
When the io-threads configuration is set to a low value (e.g., 2),
performance does not show a significant improvement compared to a
single-threaded setup for simple commands (such as SET or GET), as the
main thread does not consume much CPU for these simple operations. This
results in underutilized multi-core capacity. However, for more complex
commands, having a low number of IO threads may still be beneficial.
Therefore, it’s important to adjust the `io-threads` based on your own
performance tests.
Additionally, you can clearly monitor the CPU utilization of the main
thread and IO threads using `top -H -p $redis_pid`. This allows you to
easily identify where the bottleneck is. If the IO thread is the
bottleneck, increasing the `io-threads` will improve performance. If the
main thread is the bottleneck, the overall performance can only be
scaled by increasing the number of shards or replicas.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: oranagra <oran@redislabs.com>
This pull request enhances the no_value flag option in the dict implementation,
which is used to store keys without associated values. Previously, when a key
had an odd memory address and was the only item in a table entry, it could be
directly stored as a pointer without requiring an intermediate dictEntry. With
this update, the optimization has been extended to also handle keys with even
memory addresses in the same manner.
This PR is based on the commits from PR
https://github.com/valkey-io/valkey/pull/1442.
We deprecate the usage of classic malloc and free, but under certain
circumstances they might get imported from intrinsics. The original
thought is we should just override malloc and free to use zmalloc and
zfree, but I think we should continue to deprecate it to avoid
accidental imports of allocations.
---------
Co-authored-by: Madelyn Olson <matolson@amazon.com>
The bug was introduced in #13558 .
When merging dense hll structures, `hllDenseCompress` writes to wrong
location and the result will be zero. The unit tests didn't cover this
case.
This PR
+ fixes the bug
+ adds `PFDEBUG SIMD (ON|OFF)` for unit tests
+ adds a new TCL test to cover the cases
Synchronized from https://github.com/valkey-io/valkey/pull/1293
---------
Signed-off-by: Xuyang Wang <xuyangwang@link.cuhk.edu.cn>
Co-authored-by: debing.sun <debing.sun@redis.com>
This PR eliminates unnecessary creation and destruction of hfield
objects, ensuring only required updates or insertions are performed.
This reduces overhead and improves performance by streamlining field
management in hash dictionaries, particularly in scenarios involving
frequent updates, like the benchmarks in:
-
[memtier_benchmark-100Kkeys-load-hash-50-fields-with-100B-values](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-100Kkeys-load-hash-50-fields-with-100B-values.yml)
-
[memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values-pipeline-10](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values-pipeline-10.yml)
To test it we can simply focus on the hfield related tests
```
tclsh tests/test_helper.tcl --single unit/type/hash-field-expire
tclsh tests/test_helper.tcl --single unit/type/hash
tclsh tests/test_helper.tcl --dump-logs --single unit/other
```
Extra check on full CI:
- [x] https://github.com/filipecosta90/redis/actions/runs/12225788759
## microbenchmark results
16.7% improvement (drop in time) in dictAddNonExistingRaw vs dictAddRaw
```
make REDIS_CFLAGS="-g -fno-omit-frame-pointer -O3 -DREDIS_TEST" -j
$ ./src/redis-server test dict --accurate
(...)
Inserting via dictAddRaw() non existing: 5000000 items in 2592 ms
(...)
Inserting via dictAddNonExistingRaw() non existing: 5000000 items in 2160 ms
```
8% improvement (drop in time) in find (non existing) and adding via
`dictGetHash()+dictFindWithHash()+dictAddNonExistingRaw()` vs
`dictFind()+dictAddRaw()`
```
make REDIS_CFLAGS="-g -fno-omit-frame-pointer -O3 -DREDIS_TEST" -j
$ ./src/redis-server test dict --accurate
(...)
Find() and inserting via dictFind()+dictAddRaw() non existing: 5000000 items in 2983 ms
Find() and inserting via dictGetHash()+dictFindWithHash()+dictAddNonExistingRaw() non existing: 5000000 items in 2740 ms
```
## benchmark results
To benchmark:
```
pip3 install redis-benchmarks-specification==0.1.250
taskset -c 0 ./src/redis-server --save '' --protected-mode no --daemonize yes
redis-benchmarks-spec-client-runner --tests-regexp ".*load-hash.*" --flushall_on_every_test_start --flushall_on_every_test_end --cpuset_start_pos 2 --override-memtier-test-time 60
```
Improvements on achievable throughput in:
test | ops/sec unstable (59953d2df6) |
ops/sec this PR (24af7190fd) | % change
-- | -- | -- | --
memtier_benchmark-1key-load-hash-1K-fields-with-5B-values | 4097 | 5032
| 22.8%
memtier_benchmark-100Kkeys-load-hash-50-fields-with-100B-values | 37658
| 44688 | 18.7%
memtier_benchmark-100Kkeys-load-hash-50-fields-with-1000B-values | 14736
| 17350 | 17.7%
memtier_benchmark-1Mkeys-load-hash-5-fields-with-1000B-values-pipeline-10
| 131848 | 143485 | 8.8%
memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values |
82071 | 85681 | 4.4%
memtier_benchmark-1Mkeys-load-hash-5-fields-with-1000B-values | 82882 |
86336 | 4.2%
memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values-pipeline-10
| 262502 | 273376 | 4.1%
memtier_benchmark-10Kkeys-load-hash-50-fields-with-10000B-values | 2821
| 2936 | 4.1%
---------
Co-authored-by: Moti Cohen <moticless@gmail.com>
- Add empty string test for the new API
`RedisModule_ACLCheckKeyPrefixPermissions`.
- Fix order of checks: `(pattern[patternLen - 1] != '*' || patternLen ==
0)`
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
This PR introduces API to query Expiration time of hash fields.
# New `RedisModule_HashFieldMinExpire()`
For a given hash, retrieves the minimum expiration time across all
fields. If no fields have expiration or if the key is not a hash then
return `REDISMODULE_NO_EXPIRE` (-1).
```
mstime_t RM_HashFieldMinExpire(RedisModuleKey *hash);
```
# Extension to `RedisModule_HashGet()`
Adds a new flag, `REDISMODULE_HASH_EXPIRE_TIME`, to retrieve the
expiration time of a specific hash field. If the field does not exist or
has no expiration, returns `REDISMODULE_NO_EXPIRE`. It is fully
backward-compatible (RM_HashGet retains its original behavior unless the
new flag is used).
Example:
```
mstime_t expiry1, expiry2;
RedisModule_HashGet(mykey, REDISMODULE_HASH_EXPIRE_TIME, "field1", &expiry1, NULL);
RedisModule_HashGet(mykey, REDISMODULE_HASH_EXPIRE_TIME, "field1", &expiry1, "field2", &expiry2, NULL);
```
This PR focused on refining listpack encoding/decoding functions and
optimizing reply handling mechanisms related to it.
Each commit has the measured improvement up until the last accumulated
improvement of 16.3% on
[memtier_benchmark-1key-list-100-elements-lrange-all-elements-pipeline-10](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1key-list-100-elements-lrange-all-elements-pipeline-10.yml)
benchmark.
Connection mode | CE Baseline (Nov 14th)
701f06657d | CE PR #13652 | CE PR vs CE
Unstable
-- | -- | -- | --
TCP | 155696 | 178874 | 14.9%
Unix socket | 169743 | 197428 | 16.3%
To test it we can simply focus on the scan.tcl
```
tclsh tests/test_helper.tcl --single unit/replybufsize
```
### Commit details:
- 2e58d048fd +
29c6c86c6b : Eliminate an indirect memory
access on lpCurrentEncodedSizeBytes and completely avoid passing p*
fully to lpCurrentEncodedSizeBytes + Add lpNextWithBytes helper function
and optimize addListListpackRangeReply
**- Improvement of 3.1%, from 168969.88 ops/sec to 174239.75 ops/sec**
- af52aacff8 Refactor lpDecodeBacklen for
loop-based decoding, improving readability and branch efficiency.
**- NO CHANGE. REVERTED in 09f6680ba0d0b5acabca537c651008f0c8ec061b**
- 048bfe4eda +
03e8ff3af7 : reducing condition checks in
_addReplyToBuffer, inlining it, and avoid entering it when there are
there already entries in the reply list
and check if the reply length exceeds available buffer space before
calling _addReplyToBuffer
**- accumulated Improvement of 12.4%, from 168969.88 ops/sec to
189726.81 ops/sec**
- 9a63d4d6a9fa946505e31ecce4c7796845fc022c: always update the buf_peak
on _addReplyToBufferOrList
**- accumulated Improvement of 14.2%, from 168969.88 ops/sec to 193887
ops/sec**
- b544ade67628a1feaf714d6cfd114930e0c7670b: Introduce
lpEncodeBacklenBytes to avoid any indirect memory access on previous
usage of lpEncodeBacklen(NULL,...). inline lpEncodeBacklenBytes().
**- accumulated Improvement of 16.3%, from 168969.88 ops/sec to
197427.70 ops/sec**
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
This pull request optimizes the `dictFind` function by adding software
prefetching and branch prediction hints to improve cache efficiency and
reduce memory latency.
It introduces 2 prefetch hints (read/write) that became no-ops in case
the compiler does not support it.
Baseline profiling with Intel VTune indicated that dictFind was
significantly back-end bound, with memory latency accounting for 59.6%
of clockticks, with frequent stalls from DRAM-bound operations due to
cache misses during hash table lookups.

---------
Co-authored-by: Yuan Wang <wangyuancode@163.com>
In https://github.com/redis/redis/pull/13495, we introduced a feature to
reply -LOADING while flushing a large db on a replica.
While `_dictClear()` is in progress, it calls a callback for every 65k
items and we yield back to eventloop to reply -LOADING.
This change has made some tests unstable as those tests don't expect new
-LOADING reply.
One observation, inside `_dictClear()`, we call the callback even if db
has a few keys. Most tests run with small amount of keys. So, each
replication and cluster test has to handle potential -LOADING reply now.
This PR changes this behavior, skips calling callback when `i=0` to
stabilize replication tests.
Callback will be called after the first 65k items. Most tests use less
than 65k keys and they won't get -LOADING reply.
This PR introduces a new API function to the Redis Module API:
```
int RedisModule_ACLCheckKeyPrefixPermissions(RedisModuleUser *user, RedisModuleString *prefix, int flags);
```
Purpose:
The function checks if a given user has access permissions to any key
that match a specific prefix. This validation is based on the user’s ACL
permissions and the specified flags.
Note, this prefix-based approach API may fail to detect prefixes that
are individually uncovered but collectively covered by the patterns. For
example the prefix `ID-*` is not fully included in pattern `ID-[0]*` and
is not fully included in pattern `ID-[^0]*` but it is fully included in
the set of patterns `{ID-[0]*, ID-[^0]*}`
1. Ubuntu Lunar reached End of Life on January 25, 2024, so upgrade the
ubuntu version to plucky in action `test-ubuntu-jemalloc-fortify` to
pass the daily CI
2. The macOS-12 environment is deprecated so upgrade macos-12 to
macos-13 in daily CI
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
### Summary
By profing 1KiB 100% GET's use-case, on high pipeline use-cases, we can
see that addReplyBulk and it's inner calls takes 8.30% of the CPU
cycles. This PR reduces from 2.2% to 4% the CPU time spent on
addReplyBulk. Specifically for GET use-cases, we saw an improvement from
2.7% to 9.1% on the achievable ops/sec
### Improvement
By reducing the duplicate work we can improve by around 2.7% on sds
encoded strings, and around 9% on int encoded strings. This PR does the
following:
- Avoid duplicate sdslen on addReplyBulk() for sds enconded objects
- Avoid duplicate sdigits10() call on int incoded objects on
addReplyBulk()
- avoid final "\r\n" addReplyProto() in the OBJ_ENCODING_INT type on
addReplyBulk
Altogether this improvements results in the following improvement on the
achievable ops/sec :
Encoding | unstable (commit 9906daf5c9) |
this PR | % improvement
-- | -- | -- | --
1KiB Values string SDS encoded | 1478081.88 | 1517635.38 | 2.7%
Values string "1" OBJ_ENCODING_INT | 1521139.36 | 1658876.59 | 9.1%
### CPU Time: Total of addReplyBulk
Encoding | unstable (commit 9906daf5c9) |
this PR | reduction of CPU Time: Total
-- | -- | -- | --
1KiB Values string SDS encoded | 8.30% | 6.10% | 2.2%
Values string "1" OBJ_ENCODING_INT | 7.20% | 3.20% | 4.0%
### To reproduce
Run redis with unix socket enabled
```
taskset -c 0 /root/redis/src/redis-server --unixsocket /tmp/1.socket --save '' --enable-debug-command local
```
#### 1KiB Values string SDS encoded
Load data
```
taskset -c 2-5 memtier_benchmark --ratio 1:0 -n allkeys --key-pattern P:P --key-maximum 1000000 --hide-histogram --pipeline 10 -S /tmp/1.socket
```
Benchmark
```
taskset -c 2-6 memtier_benchmark --ratio 0:1 -c 1 -t 5 --test-time 60 --hide-histogram -d 1000 --pipeline 500 -S /tmp/1.socket --key-maximum 1000000 --json-out-file results.json
```
#### Values string "1" OBJ_ENCODING_INT
Load data
```
$ taskset -c 2-5 memtier_benchmark --command "SET __key__ 1" -n allkeys --command-key-pattern P --key-maximum 1000000 --hide-histogram -c 1 -t 1 --pipeline 100 -S /tmp/1.socket
# confirm we have the expected reply and format
$ redis-cli get memtier-1
"1"
$ redis-cli debug object memtier-1
Value at:0x7f14cec57570 refcount:2147483647 encoding:int serializedlength:2 lru:2861503 lru_seconds_idle:8
```
Benchmark
```
taskset -c 2-6 memtier_benchmark --ratio 0:1 -c 1 -t 5 --test-time 60 --hide-histogram -d 1000 --pipeline 500 -S /tmp/1.socket --key-maximum 1000000 --json-out-file results.json
```
Starting from https://github.com/redis/redis/pull/13133, we allocate a
jemalloc thread cache and use it for lua vm.
On certain cases, like `script flush` or `function flush` command, we
free the existing thread cache and create a new one.
Though, for `function flush`, we were not actually destroying the
existing thread cache itself. Each call creates a new thread cache on
jemalloc and we leak the previous thread cache instances. Jemalloc
allows maximum 4096 thread cache instances. If we reach this limit,
Redis prints "Failed creating the lua jemalloc tcache" log and abort.
There are other cases that can cause this memory leak, including
replication scenarios when emptyData() is called.
The implication is that it looks like redis `used_memory` is low, but
`allocator_allocated` and RSS remain high.
Co-authored-by: debing.sun <debing.sun@redis.com>
PR #10285 introduced support for modules to register four types of
configurations — Bool, Numeric, String, and Enum. Accessible through the
Redis config file and the CONFIG command.
With this PR, it will be possible to register configuration parameters
without automatically prefixing the parameter names. This provides
greater flexibility in configuration naming, enabling, for instance,
both `bf-initial-size` or `initial-size` to be defined in the module
without automatically prefixing with `<MODULE-NAME>.`. In addition it
will also be possible to create a single additional alias via the same
API. This brings us another step closer to integrate modules into redis
core.
**Example:** Register a configuration parameter `bf-initial-size` with
an alias `initial-size` without the automatic module name prefix, set
with new `REDISMODULE_CONFIG_UNPREFIXED` flag:
```
RedisModule_RegisterBoolConfig(ctx, "bf-initial-size|initial-size", default_val, optflags | REDISMODULE_CONFIG_UNPREFIXED, getfn, setfn, applyfn, privdata);
```
# API changes
Related functions that now support unprefixed configuration flag
(`REDISMODULE_CONFIG_UNPREFIXED`) along with optional alias:
```
RedisModule_RegisterBoolConfig
RedisModule_RegisterEnumConfig
RedisModule_RegisterNumericConfig
RedisModule_RegisterStringConfig
```
# Implementation Details:
`config.c`: On load server configuration, at function
`loadServerConfigFromString()`, it collects all unknown configurations
into `module_configs_queue` dictionary. These may include valid module
configurations or invalid ones. They will be validated later by
`loadModuleConfigs()` against the configurations declared by the loaded
module(s).
`Module.c:` The `ModuleConfig` structure has been modified to store now:
(1) Full configuration name (2) Alias (3) Unprefixed flag status -
ensuring that configurations retain their original registration format
when triggered in notifications.
Added error printout:
This change introduces an error printout for unresolved configurations,
detailing each unresolved parameter detected during startup. The last
line in the output existed prior to this change and has been retained to
systems relies on it:
```
595011:M 18 Nov 2024 08:26:23.616 # Unresolved Configuration(s) Detected:
595011:M 18 Nov 2024 08:26:23.616 # >>> 'bf-initiel-size 8'
595011:M 18 Nov 2024 08:26:23.616 # >>> 'search-sizex 32'
595011:M 18 Nov 2024 08:26:23.616 # Module Configuration detected without loadmodule directive or no ApplyConfig call: aborting
```
# Backward Compatibility:
Existing modules will function without modification, as the new
functionality only applies if REDISMODULE_CONFIG_UNPREFIXED is
explicitly set.
# Module vs. Core API Conflict Behavior
The new API allows to modules loading duplication of same configuration
name or same configuration alias, just like redis core configuration
allows (i.e. the users sets two configs with a different value, but
these two configs are actually the same one). Unlike redis core, given a
name and its alias, it doesn't allow have both configuration on load. To
implement it, it is required to modify DS `module_configs_queue` to
reflect the order of their loading and later on, during
`loadModuleConfigs()`, resolve pairs of names and aliases and which one
is the last one to apply. "Relaxing" this limitation can be deferred to
a future update if necessary, but for now, we error in this case.
To complement the work done in #13133.
it added the script VMs memory to be counted as part of zmalloc, but
that means they
should be also counted as part of the non-value overhead.
this commit contains some refactoring to make variable names and
function names less confusing.
it also adds a new field named `script.VMs` into the `MEMORY STATS`
command.
additionally, clear scripts and stats between tests in external mode
(which is related to how this issue was discovered)
Fix to https://github.com/redis/redis/issues/13650
providing an invalid config to a module with datatype crashes when redis
tries to unload the module due to the invalid config
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
inspred by https://github.com/redis/redis/pull/12730
Before this PR, we allocate new memory to store the user command
arguments, however, if the size of the current `c->argv` is larger than
the current command, we can reuse the previously allocated argv to avoid
allocating new memory for the current command.
And we will free `c->argv` in client cron when the client is idle for 2
seconds.
---------
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Improve the performance of crc64 for large batches by processing large
number
of bytes in parallel and combining the results.
---------
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Josiah Carlson <josiah.carlson@gmail.com>
If `hide-user-data-from-log` config is enabled, we don't print client
argv in the crashlog to avoid leaking user info.
Though, debugging a crash becomes harder as we don't see the command
arguments causing the crash.
With this PR, we'll be printing command tokens to the log. As we have
command tokens defined in json schema for each command, using this data,
we can find tokens in the client argv.
e.g.
`SET key value GET EX 10` ---> we'll print `SET * * GET EX *` in the
log.
Modules should introduce their command structure via
`RM_SetCommandInfo()`.
Then, on a crash we'll able to know module command tokens.
This PR replaces old .../topics/... links with current links,
specifically for the modules-api-ref.md file and the new automation that
Paolo Lazzari is working on. A few of the topics links have redirects,
but some don't. Best to use updated links.