This PR fixes
https://github.com/redis/redis/issues/14056#issuecomment-3026114590
## Summary
Because evport uses `eventLoop->events[fd].mask` to determine whether to
remove the event, but in ae.c we call `aeApiDelEvent()` before updating
`eventLoop->events[fd].mask`, this causes evport to always see the old
value, and as a result, `port_dissociate()` is never called to remove
the fd.
This issue may not surface easily in a non-multithreaded, but since in
the multi-threaded case we frequently reassign fds to different threads,
it makes the crash much more likely to occur.
## Description
`updateClientMemUsageAndBucket` is called from the main thread to update
memory usage and memory bucket of a client. That's why it has assertion
that it's being called by the main thread.
But it may also be called from a thread spawned by a module.
Specifically, when a module calls `RedisModule_Call` which in turn calls
`call`->`replicationFeedMonitors`->`updateClientMemUsageAndBucket`.
This is generally safe as module calls inside a spawned thread should be
guarded by a call to `ThreadSafeContextLock`, i.e the module is holding
the GIL at this point.
This commit fixes the assertion inside `updateClientMemUsageAndBucket`
so that it encompasses that case also. Generally calls from
module-spawned threads are safe to operate on clients that are not
running on IO-threads when the module is holding the GIL.
---------
Co-authored-by: Yuan Wang <wangyuancode@163.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
Add CLUSTER SLOT-STATS command for key count, cpu time and network IO
per slot currently.
The command has the following syntax
CLUSTER SLOT-STATS SLOTSRANGE start-slot end-slot
or
CLUSTER SLOT-STATS ORDERBY metric [LIMIT limit] [ASC/DESC]
where metric can currently be one of the following
key-count -- Number of keys in a given slot
cpu-usec -- Amount of CPU time (in microseconds) spent on a given slot
network-bytes-in -- Amount of network ingress (in bytes) received for
given slot
network-bytes-out -- Amount of network egress (in bytes) sent out for
given slot
This PR is based on:
valkey-io/valkey#351valkey-io/valkey#709valkey-io/valkey#710valkey-io/valkey#720valkey-io/valkey#840
Co-authored-by: Kyle Kim <kimkyle@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
---------
Co-authored-by: Kyle Kim <kimkyle@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
## Summary and detailed design for new stream command
## XDELEX
### Syntax
```
XDELEX key [KEEPREF | DELREF | ACKED] IDS numids id [id ...]
```
### Description
The `XDELEX` command extends the Redis Streams `XDEL` command, offering
enhanced control over message entry deletion with respect to consumer
groups. It accepts optional `DELREF` or `ACKED` parameters to modify its
behavior:
- **KEEPREF:** Deletes the specified entries from the stream, but
preserves existing references to these entries in all consumer groups'
PEL. This behavior is similar to XDEL.
- **DELREF:** Deletes the specified entries from the stream and also
removes all references to these entries from all consumer groups'
pending entry lists, effectively cleaning up all traces of the messages.
- **ACKED:** Only trims entries that were read and acknowledged by all
consumer groups.
**Note:** The `IDS` block can appear at any position in the command,
consistent with other commands.
### Reply
Array reply, for each `id`:
- `-1`: No such `id` exists in the provided stream `key`.
- `1`: Entry was deleted from the stream.
- `2`: Entry was not deleted, but there are still dangling references.
(ACKED option)
## XACKDEL
### Syntax
```
XACKDEL key group [KEEPREF | DELREF | ACKED] IDS numids id [id ...]
```
### Description
The `XACKDEL` command combines `XACK` and `XDEL` functionalities in
Redis Streams. It acknowledges specified message IDs in the given
consumer group and attempts to delete corresponding stream entries. It
accepts optional `DELREF` or `ACKED` parameters:
- **KEEPREF:** Acknowledges the messages in the specified consumer group
and deletes the entries from the stream, but preserves existing
references to these entries in all consumer groups' PEL.
- **DELREF:** Acknowledges the messages in the specified consumer group,
deletes the entries from the stream, and also removes all references to
these entries from all consumer groups' pending entry lists, effectively
cleaning up all traces of the messages.
- **ACKED:** Acknowledges the messages in the specified consumer group
and only trims entries that were read and acknowledged by all consumer
groups.
### Reply
Array reply, for each `id`:
- `-1`: No such `id` exists in the provided stream `key`.
- `1`: Entry was acknowledged and deleted from the stream.
- `2`: Entry was acknowledged but not deleted, but there are still
dangling references. (ACKED option)
# Redis Streams Commands Extension
## XTRIM
### Syntax
```
XTRIM key <MAXLEN | MINID> [= | ~] threshold [LIMIT count] [KEEPREF | DELREF | ACKED]
```
### Description
The `XTRIM` command trims a stream by removing entries based on
specified criteria, extended to include optional `DELREF` or `ACKED`
parameters for consumer group handling:
- **KEEPREF:** Trims the stream according to the specified strategy
(MAXLEN or MINID) regardless of whether entries are referenced by any
consumer groups, but preserves existing references to these entries in
all consumer groups' PEL.
- **DELREF:** Trims the stream according to the specified strategy and
also removes all references to the trimmed entries from all consumer
groups' PEL.
- **ACKED:** Only trims entries that were read and acknowledged by all
consumer groups.
### Reply
No change.
## XADD
### Syntax
```
XADD key [NOMKSTREAM] [<MAXLEN | MINID> [= | ~] threshold [LIMIT count]] [KEEPREF | DELREF | ACKED] <* | id> field value [field value ...]
```
### Description
The `XADD` command appends a new entry to a stream and optionally trims
it in the same operation, extended to include optional `DELREF` or
`ACKED` parameters for trimming behavior:
- **KEEPREF:** When trimming, removes entries from the stream according
to the specified strategy (MAXLEN or MINID), regardless of whether they
are referenced by any consumer groups, but preserves existing references
to these entries in all consumer groups' PEL.
- **DELREF:** When trimming, removes entries from the stream according
to the specified strategy and also removes all references to these
entries from all consumer groups' PEL.
- **ACKED:** When trimming, only removes entries that were read and
acknowledged by all consumer groups. Note that if the number of
referenced entries is bigger than MAXLEN, we will still stop.
### Reply
No change.
## Key implementation
Since we currently have no simple way to track the association between
an entry and consumer groups without iterating over all groups, we
introduce two mechanisms to establish this link. This allows us to
determine whether an entry has been seen by all consumer groups, and to
identify which groups are referencing it. With this links, we can break
the association when the entry is either acknowledged or deleted.
1) Added reference tracking between stream messages and consumer groups
using `cgroups_ref`
The cgroups_ref is implemented as a rax that maps stream message IDs to
lists of consumer groups that reference those messages, and streamNACK
stores the corresponding nodes of this list, so that the corresponding
groups can be deleted during `ACK`.
In this way, we can determine whether an entry has been seen but not
ack.
2) Store a cache minimum last_id in the stream structure.
The reason for doing this is that there is a situation where an entry
has never been seen by the consume group. In this case, we think this
entry has not been consumed either. If there is an "ACKED" option, we
cannot directly delete this entry either.
When a consumer group updates its last_id, we don’t immediately update
the cached minimum last_id. Instead, we check whether the group’s
previous last_id was equal to the current minimum, or whether the new
last_id is smaller than the current minimum (when using `XGROUP SETID`).
If either is true, we mark the cached minimum last_id as invalid, and
defer the actual update until the next time it’s needed.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: moticless <moticless@github.com>
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Slavomir Kaslev <slavomir.kaslev@gmail.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
When the PEL is empty, the reply of `XPENDING` without `start` option
will be:
```
1) (integer) 0
2) (nil)
3) (nil)
4) (nil)
```
It is not an empty array, so we need to create an individual reply
schema for it.
In this PR we added hidden config - `lazyexpire-nested-arbitrary-keys`,
which can take:
* yes - the default. produce and propagate lazy-expire DELs as usual.
* no - avoid lazy-expire from commands that touch arbitrary keys (such
as SCAN, RANDOMKEY), if generated within a transactions (MULTI/EXEC,
LUA). This ensures such commands won't induce CROSSSLOT on remote proxy,
as happened in when replicating one db into another (possibly sharded
differently).
Since the issue is relevant only in replicated servers (RE's replica-of
mode or CRDT) - it was added to the core as a hidden config.
Please note that this config will always apply to read-only commands
(see EXPIRE_FORCE_DELETE_EXPIRED flag).
Since write commands may require key expiration to operate correctly.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
it aims to create listpacks of 500k, but did that with 5 insertions of
100k each, instead do that in one insertion, reducing the need for
listpack gradual growth, and reducing the number of commands we send.
apparently there are some stalls reading the replies of the commands,
specifically in GH actions, reducing the number of commands seems to
eliminate that.
The main thread needs to check clients in every cron iteration. During
this check, the corresponding I/O threads must not operate on these
clients to avoid data-race. As a result, the main thread is blocked
until the I/O threads finish processing and are suspended, allowing the
main thread to proceed with client checks.
Since the main thread's resources are more valuable than those of I/O
threads in Redis, this blocking behavior should be avoided. To address
this, the I/O threads check during their cron whether any of their
maintained clients need to be inspected by the main thread. If so, the
I/O threads send those clients to the main thread for processing, then
the main thread runs cron jobs for these clients.
In addition, an always-active client might not be in thread->clients, so
before processing the client’s command, we also check whether the client
has skipped running its cron job for over 1 second. If it has, we run
the cron job for the client.
The main thread does not need to actively pause the IO threads, thus
avoiding potential blocking behavior, fixes
https://github.com/redis/redis/issues/13885
Besides, this approach also can let all clients run cron task in a
second, but before, we pause IO threads in multiple batches when there
are more than 8 IO threads, that may cause some clients are not be
processed in a second.
---------
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
In `sendBulkToSlave`, the `lseek()` call used to position the RDB file
descriptor before reading the next data chunk was not checked for
errors.
If the `lseek()` system call were to fail, the file descriptor would
remain at an incorrect position. The subsequent `read()` would then
fetch the wrong data, leading to a corrupted RDB stream being sent to
the replica. This could cause the replication to fail or result in data
inconsistency.
This patch introduces a check for the `lseek()` return value. On
failure, it logs a detailed warning and aborts the replication by
freeing the client, mirroring the existing error handling for `read()`
and `write()` calls within the same function. This improves the
robustness of the RDB transfer process.
---------
Co-authored-by: Yuan Wang <wangyuancode@163.com>
This PR introduces "IN" overloading for strings in Vector Sets VSIM
FILTER expressions.
Now it is possible to do something like:
"foo" IN "foobar"
IN continues to work as usually if the second operand is an array,
checking for membership of the left operand.
Ping @rowantrollope that requested this feature. I'm evaluating if to
add glob matching functionalities via the `=~` operator but I need to do
an optimization round in our glob matching function probably. Glob
matching can be slower, at the same time the complexity of the greedy
search in the graph remains unchanged, so it may be a good idea to have
it.
Case insensitive search will be likely not be added however, since this
would require handling unicode that is kinda outside the scope of Redis
filters. The user is still able to perform `"foo" in "foobar" || "FOO"
in "foobar"` at least.
- tests/unit/maxmemory.tcl
If multithreaded, we need to let IO threads have chance to reply output
buffer, to avoid next commands causing eviction. After eviction is
performed,
the next command becomes ready immediately in IO threads, and now we
enqueue
the client to be processed in main thread’s beforeSleep without
notification.
However, invalidation messages generated by eviction may not have been
fully
delivered by that time. As a result, executing the command in
beforeSleep of
the event loop (running eviction) can cause additional keys to be
evicted.
```
Expected '73' to be between to '200' and '300' (context: type source line 473 file
redis/tests/unit/maxmemory.tcl cmd {assert_range [r dbsize] 200 300} proc ::test)
```
the reason why CI doesn't find this issue is that we skill this test
`tsan:skip` as below
`start_server {tags {"maxmemory external:skip tsan:skip"}} `,so remove
this tag.
- tests/integration/aof.tcl
Because IO and the main thread are working in better parallelism without
notification,
the main thread may haven't write AOF buffer into file, but the IO
thread just writes
the reply, so the clients receive the reply before AOF file is changed.
We should use `appendfsync always` policy to make the command is written
into
AOF file when receiving reply.
```
Expected '0' to be equal to '54' (context: type source line 249 file
redis/tests/integration/aof.tcl cmd {assert_equal $before $after} proc ::test)
```
#13969 makes these scenarios easy to appear.
If replica detects broken connection while reading min expiration time
of hfe key, it calls exit().
Fixed it to handle the error gracefully without calling exit.
To reproduce the issue, the short-read test was modified to generate
many small hfe keys, increasing the likelihood of a connection drop
while reading min expiration time:
```tcl
for {set k 0} {$k < 50000} {incr k} {
for {set i 0} {$i < 1} {incr i} {
r hsetex "$k hfe_small" EX [expr {int(rand()*10)}] FIELDS 1 [string repeat A [expr {int(rand()*10)}]] 0[string repeat A [expr {int(rand()*10)}]]
}
}
```
We can't have the test use only hfe keys, so a few were added alongside
other data. I couldn't reproduce the issue this way but with the test's
randomization, it should hit this scenario in one of the runs.
Merge fork counters with https://github.com/redis/redis/pull/12957
repl_current_sync_attempts - Total number of attempts to connect to a
master since the last time we disconnected from a good connection (or a
configuration change). any number greater than 1 (even if the link is
currently up), indicates an issue.
repl_total_sync_attempts - Number of times in current configuration, the
replica attempted to sync to a master. (dosent reset on master
reconnect.)
repl_total_disconnect_time - Total cumulative time we've been
disconnected as a replica, visible when the link is up too.
master_link_up_since_seconds - Number of seconds since the link is down,
just maintain symmetry with master_link_down_since_seconds.
### Summary
This pull request improves the performance of quicklistCompare and
lpCompare by avoiding repeated calls to string2ll when comparing many
quicklist/listpack entries against the same string value. The
optimization targets use cases like LREM, LPOS, LINSERT, and ZRANK where
comparisons are made repeatedly in a loop.
By caching the result of string2ll during a single command execution, we
avoid re-parsing the same input string thousands of times—resulting in
up to **30% higher throughput and up to 25% lower p50 latency** in LREM
LINSERT benchmarks, and **5% higher throughput** in ZRANK (listpack)
command.
### Changes
- Updated quicklistCompare and lpCompare to accept two optional
parameters:
- `long long *cached_val`
- `int *cached_valid`
- If caching parameters are provided, string2ll is invoked only once and
its result is reused across comparisons.
- listTypeEqual was updated to forward these parameters.
- Commands such as LREM, LPOS, LINSERT, and ZRANK now use this
optimization.
- All internal tests and usage of quicklistCompare/lpCompare were
updated accordingly.
### Behavior
- If cached_valid is NULL, quicklistCompare/lpCompare behaves as before
(no caching).
- If cached_valid is non-NULL:
- 0 means uninitialized: string2ll is attempted.
- 1 means valid: cached_val is used.
- -1 means invalid: string2ll previously failed and is skipped.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
We need to check the command arity in IO threads, if it is not correct,
we should reset it, as we may do memory prefetching according to the
`iolookedcmd`. Accessing `argv` using the key positions returned by
`getKeysFromCommand` is unsafe and must be avoided for invalid commands.
This bug starts to have an impact after #14017
This PR optimizes scanGenericCommand by moving type filtering and
expiration checks from post-processing (Step 3) to the scan callback,
eliminating expensive `lookupKeyReadWithFlags()` calls and adding an
optimization to skip expiration checks via dict lookup given we can now
check the expiration with expiry flag in the kvobj (due to the move to
the scanCallback).
Profiling data (https://pprof.me/1ac456b6d1a46b2184a5e2ef314aa0a2)
showed that scanGenericCommand accounted for 2.8% of total CPU time and
was a notable hotspot during SCAN-heavy workloads.
## Key optimizations:
1. **Type filtering moved to scanCallback**: Uses existing `kvobj *kv`
instead of expensive lookups
2. **Expiration check optimization**: Skips `expireIfNeeded()` calls
when database has no volatile keys
3. **Better cache locality**: Processes filtering during iteration
rather than post-processing
This changes improve a bit the Vector Sets tests:
* DB9 is used instead of the target DB. After a successful test the DB
is left empty.
* If the replica is not available, the replication tests are skipped
without errors but just a warning.
* Other refactoring stuff.
This PR introduces the initial configuration infrastructure for
vector-sets, along with a new option:
`vset-force-single-threaded-execution`. When enabled, it applies the
`NOTHREAD` flag to VSIM and disables the `CAS` option for VADD, thereby
enforcing single-threaded execution.
Note: This mode is not optimized for single-threaded performance.
---------
Co-authored-by: GuyAv46 <47632673+GuyAv46@users.noreply.github.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
Adds a software prefetch with a 2K stride to the scalar popcount loop in
redisPopcount().
Prefetching improved BITCOUNT throughput by up to 41.6%, reduced p50
latency by up to 43.9%, and significantly lowered L3 memory stalls,
confirming effective mitigation of memory-bound bottlenecks, with no
negative impact on L1/L2 usage or cache pollution (confirmed with HW
counters).
Note: The 2K stride was the best starting from 128,256,512,1024,2048,4096.
4K gave the same outcome so it's best to avoid larger strides without reason.
Currently, in the zmalloc and zfree family functions, we rely on
`je_malloc_usable_size()` to obtain the usable size of a pointer for
memory statistics or to return it to the caller. However, this function
is relatively expensive, as it involves locking and rbtree lookups
within jemalloc. Reducing the frequency of these calls can yield
significant performance improvements.
---------
Co-authored-by: oranagra <oran@redislabs.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Vector Sets deserialization was not designed to resist corrupted data,
assuming that a good checksum would mean everything is fine. However
Redis allows the user to specify extra protection via a specific
configuration option.
This commit makes the implementation more resistant, at the cost of some
slowdown. This also fixes a serialization bug that is unrelated (and has
no memory corruption effects) about the lack of the worst index /
distance serialization, that could lower the quality of a graph after
links are replaced. I'll address the serialization issues in a new PR
that will focus on that aspect alone (already work in progress).
The net result is that loading vector sets is, when the serialization of
worst index/distance is missing (always, for now) 100% slower, that is 2
times the loading time we had before. Instead when the info will be
added it will be just 10/15% slower, that is, just making the new sanity
checks.
It may be worth to export to modules if advanced sanity check if needed
or not. Anyway most of the slowdown in this patch comes from having to
recompute the worst neighbor, since duplicated and non reciprocal links
detection was heavy optimized with probabilistic algorithms.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
This bug was introduced by https://github.com/redis/redis/issues/13814
When defragmenting `db->expires`, if the process exits early and
`db->expires` was modified in the meantime (e.g., FLUSHDB), we need to
check whether the previously defragmented expires is still the same as
the current one when resuming. If they differ, we should abort the
current defragmentation of expires.
However, in https://github.com/redis/redis/issues/13814, I made a
mistake by using `db->keys` and `db->expires`, as expires will never be
defragged.
This PR is based on: https://github.com/valkey-io/valkey/pull/861
> ### Memory Access Amortization
> (Designed and implemented by [dan
touitou](https://github.com/touitou-dan))
>
> Memory Access Amortization (MAA) is a technique designed to optimize
the performance of dynamic data structures by reducing the impact of
memory access latency. It is applicable when multiple operations need to
be executed concurrently. The principle behind it is that for certain
dynamic data structures, executing operations in a batch is more
efficient than executing each one separately.
>
> Rather than executing operations sequentially, this approach
interleaves the execution of all operations. This is done in such a way
that whenever a memory access is required during an operation, the
program prefetches the necessary memory and transitions to another
operation. This ensures that when one operation is blocked awaiting
memory access, other memory accesses are executed in parallel, thereby
reducing the average access latency.
>
> We applied this method in the development of dictPrefetch, which takes
as parameters a vector of keys and dictionaries. It ensures that all
memory addresses required to execute dictionary operations for these
keys are loaded into the L1-L3 caches when executing commands.
Essentially, dictPrefetch is an interleaved execution of dictFind for
all the keys.
### Implementation of Redis
When the main thread processes clients with ready-to-execute commands
(i.e., clients for which the IO thread has parsed the commands), a batch
of up to 16 commands is created. Initially, the command's argv, which
were allocated by the IO thread, is prefetched to the main thread's L1
cache. Subsequently, all the dict entries and values required for the
commands are prefetched from the dictionary before the command
execution.
#### Memory prefetching for main hash table
As shown in the picture, after https://github.com/redis/redis/pull/13806
, we unify key value and the dict uses no_value optimization, so the
memory prefetching has 4 steps:
1. prefetch the bucket of the hash table
2. prefetch the entry associated with the given key's hash
3. prefetch the kv object of the entry
4. prefetch the value data of the kv object
we also need to handle the case that the dict entry is the pointer of kv
object, just skip step 3.
MAA can improves single-threaded memory access efficiency by
interleaving the execution of multiple independent operations, allowing
memory-level parallelism and better CPU utilization. Its key point is
batch-wise interleaved execution. Split a batch of independent
operations (such as multiple key lookups) into multiple state machines,
and interleave their progress within a single thread to hide the memory
access latency of individual requests.
The difference between serial execution and interleaved execution:
**naive serial execution**
```
key1: step1 → wait → step2 → wait → done
key2: step1 → wait → step2 → wait → done
```
**interleaved execution**
```
key1: step1 → step2 → done
key2: step1 → step2 → done
key3: step1 → step2 → done
↑ While waiting for key1’s memory, progress key2/key3
```
#### New configuration
This PR involves a new configuration `prefetch-batch-max-size`, but we
think it is a low level optimization, so we hide this config:
When multiple commands are parsed by the I/O threads and ready for
execution, we take advantage of knowing the next set of commands and
prefetch their required dictionary entries in a batch. This reduces
memory access costs. The optimal batch size depends on the specific
workflow of the user. The default batch size is 16, which can be
modified using the 'prefetch-batch-max-size' config.
When the config is set to 0, prefetching is disabled.
---------
Co-authored-by: Uri Yagelnik <uriy@amazon.com>
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Add thread sanitizer run to daily CI.
Few tests are skipped in tsan runs for two reasons:
* Stack trace producing tests (oom, `unit/moduleapi/crash`, etc) are
tagged `tsan:skip` because redis calls `backtrace()` in signal handler
which turns out to be signal-unsafe since it might allocate memory (e.g.
glibc 2.39 does it through a call to `_dl_map_object_deps()`).
* Few tests become flaky with thread sanitizer builds and don't finish
in expected deadlines because of the additional tsan overhead. Instead
of skipping those tests, this can improved in the future by allowing
more iterations when waiting for tsan builds.
Deadlock detection is disabled for now because of tsan limitation where
max 64 locks can be taken at once.
There is one outstanding (false-positive?) race in jemalloc which is
suppressed in `tsan.sup`.
Fix few races thread sanitizer reported having to do with writes from
signal handlers. Since in multi-threaded setting signal handlers might
be called on any thread (modulo pthread_sigmask) while the main thread
is running, `volatile sig_atomic_t` type is not sufficient and atomics
are used instead.
When `repl-diskless-load` is enabled on a replica, and it is in the
process of loading an RDB file, a broken connection detected by the main
channel may trigger a call to rioAbort(). This sets a flag to cause the
rdb channel to fail on the next rioRead() call, allowing it to perform
necessary cleanup.
However, there are specific scenarios where the error is checked using
rioGetReadError(), which does not account for the RIO_ABORT flag (see
[source](79b37ff535/src/rdb.c (L3098))).
As a result, the error goes undetected. The code then proceeds to
validate a module type, fails to find a match, and calls
rdbReportCorruptRDB() which logs the following error and exits the
process:
```
The RDB file contains module data I can't load: no matching module type '_________'
```
To fix this issue, the RIO_ABORT flag has been removed. Now, rioAbort()
sets both read and write error flags, so that subsequent operations and
error checks properly detect the failure.
Additional keys were added to the short read test. It reproduces the
issue with this change. We hit that problematic line once per key. My
guess is that with many smaller keys, the likelihood of the connection
being killed at just the right moment increases.
Compiled Redis with COVERAGE_TEST, while using the fork API encountered
the following issue:
- Forked process calls `RedisModule_ExitFromChild` - child process
starts to report its COW while performing IO operations
- Parent process terminates child process with
`RedisModule_KillForkChild`
- Child process signal handler gets called while an IO operation is
called
- exit() is called because COVERAGE_TEST was on during compilation.
- exit() tries to perform more IO operations in its exit handlers.
- process gets deadlocked
Backtrace snippet:
```
#0 futex_wait (private=0, expected=2, futex_word=0x7e1220000c50) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait_private (futex=0x7e1220000c50) at ./nptl/lowlevellock.c:34
#2 0x00007e1234696429 in __GI__IO_flush_all () at ./libio/genops.c:698
#3 0x00007e123469680d in _IO_cleanup () at ./libio/genops.c:843
#4 0x00007e1234647b74 in __run_exit_handlers (status=status@entry=255, listp=<optimized out>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:129
#5 0x00007e1234647bbe in __GI_exit (status=status@entry=255) at ./stdlib/exit.c:138
#6 0x00005ef753264e13 in exitFromChild (retcode=255) at /home/jonathan/CLionProjects/redis/src/server.c:263
#7 sigKillChildHandler (sig=<optimized out>) at /home/jonathan/CLionProjects/redis/src/server.c:6794
#8 <signal handler called>
#9 0x00007e1234685b94 in _IO_fgets (buf=buf@entry=0x7e122dafdd90 "KSM:", ' ' <repeats 19 times>, "0 kB\n", n=n@entry=1024, fp=fp@entry=0x7e1220000b70) at ./libio/iofgets.c:47
#10 0x00005ef75326c5e0 in fgets (__stream=<optimized out>, __n=<optimized out>, __s=<optimized out>, __s=<optimized out>, __n=<optimized out>, __stream=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:200
#11 zmalloc_get_smap_bytes_by_field (field=0x5ef7534c42fd "Private_Dirty:", pid=<optimized out>) at /home/jonathan/CLionProjects/redis/src/zmalloc.c:928
#12 0x00005ef75338ab1f in zmalloc_get_private_dirty (pid=-1) at /home/jonathan/CLionProjects/redis/src/zmalloc.c:978
#13 sendChildInfoGeneric (info_type=CHILD_INFO_TYPE_MODULE_COW_SIZE, keys=0, progress=-1, pname=0x5ef7534c95b2 "Module fork") at /home/jonathan/CLionProjects/redis/src/childinfo.c:71
#14 0x00005ef75337962c in sendChildCowInfo (pname=0x5ef7534c95b2 "Module fork", info_type=CHILD_INFO_TYPE_MODULE_COW_SIZE) at /home/jonathan/CLionProjects/redis/src/server.c:6895
#15 RM_ExitFromChild (retcode=0) at /home/jonathan/CLionProjects/redis/src/module.c:11468
```
Change is to make the exit() _exit() calls conditional based on a
parameter to exitFromChild function.
The signal handler should exit without io operations since it doesn't
know its history.(If we were in the middle of IO operations before it
was called)
---------
Co-authored-by: Yuan Wang <wangyuancode@163.com>
restoreCommand() creates a key-value object (kv) with a TTL in two steps.
During the second step, setExpire() may reallocate the kv object. To ensure
correct behavior, kv must be updated after this call, as it might be used later
in the function.
Hi, as described, this implements WITHATTRIBS, a feature requested by a
few users, and indeed needed.
This was requested the first time by @rowantrollope but I was not sure
how to make it work with RESP2 and RESP3 in a clean way, hopefully
that's it.
The patch includes tests and documentation updates.