Hi all, this PR fixes two things:
1. An assertion, that prevented the RDB loading from recovery if there
was a quantization type mismatch (with regression test).
2. Two code paths that just returned NULL without proper cleanup during
RDB loading.
This PR adds support for REDISMODULE_OPTIONS_HANDLE_IO_ERRORS.
and tests for short read and corrupted RESTORE payload.
Please: note that I also removed the comment about async loading support
since we should be already covered. No manipulation of global data
structures in Vector Sets, if not for the unique ID used to create new
vector sets with different IDs.
This PR fixes an issue in the CI test for client-output-buffer-limit,
which was causing an infinite loop when running on macOS 15.4.
### Problem
This test start two clients, R and R1:
```c
R1 subscribe foo
R publish foo bar
```
When R executes `PUBLISH foo bar`, the server first stores the message
`bar` in R1‘s buf. Only when the space in buf is insufficient does it
call `_addReplyProtoToList`.
Inside this function, `closeClientOnOutputBufferLimitReached` is invoked
to check whether the client’s R1 output buffer has reached its
configured limit.
On macOS 15.4, because the server writes to the client at a high speed,
R1’s buf never gets full. As a result,
`closeClientOnOutputBufferLimitReached` in the test is never triggered,
causing the test to never exit and fall into an infinite loop.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
CI / build-libc-malloc (push) Failing after 31sDetails
CI / build-old-chain-jemalloc (push) Failing after 31sDetails
Codecov / code-coverage (push) Failing after 31sDetails
Spellcheck / Spellcheck (push) Failing after 30sDetails
CI / build-debian-old (push) Failing after 43sDetails
CI / build-centos-jemalloc (push) Failing after 1m31sDetails
CI / build-macos-latest (push) Has been cancelledDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-standalone (push) Failing after 1m52sDetails
External Server Tests / test-external-nodebug (push) Failing after 1m58sDetails
External Server Tests / test-external-cluster (push) Failing after 3m3sDetails
Since after https://github.com/redis/redis/pull/13695,
`io-threads-do-reads` config is deprecated, we should remove it from
normal config list and only keep it in deprecated config list, but we
forgot to do this, this PR fixes this.
thanks @YaacovHazan for reporting this
### Problem
A previous PR (https://github.com/redis/redis/pull/13932) fixed the TCP
port issue in CLUSTER SLOTS, but it seems the handling of the TLS port
was overlooked.
There is this comment in the `addNodeToNodeReply` function in the
`cluster.c` file:
```c
/* Report TLS ports to TLS client, and report non-TLS port to non-TLS client. */
addReplyLongLong(c, clusterNodeClientPort(node, shouldReturnTlsInfo()));
addReplyBulkCBuffer(c, clusterNodeGetName(node), CLUSTER_NAMELEN);
```
### Fixed
This PR fixes the TLS port issue and adds relevant tests.
CI / build-old-chain-jemalloc (push) Failing after 31sDetails
Codecov / code-coverage (push) Failing after 32sDetails
CI / test-sanitizer-address (push) Failing after 1m22sDetails
Spellcheck / Spellcheck (push) Failing after 31sDetails
CI / build-macos-latest (push) Has been cancelledDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-standalone (push) Failing after 31sDetails
External Server Tests / test-external-cluster (push) Failing after 31sDetails
External Server Tests / test-external-nodebug (push) Failing after 31sDetails
This PR fix the lag calculation by ensuring that when consumer group's last_id
is behind the first entry, the consumer group's entries read is considered
invalid and recalculated from the start of the stream
Supplement to PR #13473Close#13957
Signed-off-by: Ernesto Alejandro Santana Hidalgo <ernesto.alejandrosantana@gmail.com>
Close https://github.com/redis/redis/issues/13892
config set port cmd updates server.port. cluster slot retrieves
information about cluster slots and their associated nodes. the fix
updates this info when config set port cmd is done, so cluster slots cmd
returns the right value.
from the master's perspective, the replica can become online before it's
actually done loading the rdb file.
this was always like that, in disk-based repl, and thus ok with diskless
and rdb channel.
in this test, because all the keys are added before the backlog is
created, the replication offset is 0, so the test proceeds and could get
a LOADING error when trying to run the function.
If HGETEX command deletes the only field due to lazy expiry, Redis
currently sends `del` KSN (Keyspace Notification) first, followed by
`hexpired` KSN. The order should be reversed, `hexpired` should be sent
first and `del` later.
Additonal changes: More test coverage for HGETDEL KSN
---------
Co-authored-by: hristosko <hristosko.chaushev@redis.com>
This test was introduced by https://github.com/redis/redis/issues/13853
We determine if the client is in blocked status, but if async flushdb is
completed before checking the blocked status, the test will fail.
So modify the test to only determine if `lazyfree_pending_objects` is
correct to ensure that flushdb is async, that is, the client must be
blocked.
CI / build-libc-malloc (push) Failing after 31sDetails
CI / build-centos-jemalloc (push) Failing after 31sDetails
CI / build-old-chain-jemalloc (push) Failing after 32sDetails
Codecov / code-coverage (push) Failing after 31sDetails
Spellcheck / Spellcheck (push) Failing after 31sDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-standalone (push) Failing after 32sDetails
External Server Tests / test-external-cluster (push) Failing after 32sDetails
External Server Tests / test-external-nodebug (push) Failing after 1m56sDetails
CI / build-macos-latest (push) Has been cancelledDetails
Test fails time to time:
```
*** [err]: Slave is able to detect timeout during handshake in tests/integration/replication.tcl
Replica is not able to detect timeout
```
Depending on the timing, "debug sleep" may occur during rdbchannel
handshake and required log statement won't be printed to the log in that
case. Better to wait after rdbchannel handshake.
With RDB channel replication, we introduced parallel replication stream
and RDB delivery to the replica during a full sync. Currently, after the
replica loads the RDB and begins streaming the accumulated buffer to the
database, it does not read from the master connection during this
period. Although streaming the local buffer is generally a fast
operation, it can take some time if the buffer is large. This PR
introduces buffering during the streaming of the local buffer. One
important consideration is ensuring that we consume more than we read
during this operation; otherwise, it could take indefinitely. To
guarantee that it will eventually complete, we limit the read to at most
half of what we consume, e.g. read at most 1 mb once we consume at least
2 mb.
**Additional changes**
**Bug fix**
- Currently, when replica starts draining accumulated buffer, we call
protectClient() for the master client as we occasionally yield back to
event loop via processEventsWhileBlocked(). So, it prevents freeing the
master client. While we are in this loop, if replica receives "replicaof
newmaster" command, we call replicaSetMaster() which expects to free the
master client and trigger a new connection attempt. As the client object
is protected, its destruction will happen asynchronously. Though, a new
connection attempt to new master will be made immediately. Later, when
the replication buffer is drained, we realize master client was marked
as CLOSE_ASAP, and freeing master client triggers another connection
attempt to the new master. In most cases, we realize something is wrong
in the replication state machine and abort the second attempt later. So,
the bug may go undetected. Fix is not calling protectClient() for the
master client. Instead, trying to detect if master client is
disconnected during processEventsWhileBlocked() and if so, breaking the
loop immediately.
**Related improvement:**
- Currently, the replication buffer is a linked list of buffers, each of
which is 1 MB in size. While consuming the buffer, we process one buffer
at a time and check if we need to yield back to
`processEventsWhileBlocked()`. However, if
`loading-process-events-interval-bytes` is set to less than 1 MB, this
approach doesn't handle it well. To improve this, I've modified the code
to process 16KB at a time and check
`loading-process-events-interval-bytes` more frequently. This way,
depending on the configuration, we may yield back to networking more
often.
- In replication.c, `disklessLoadingRio` will be set before a call to
`emptyData()`. This change should not introduce any behavioral change
but it is logically more correct as emptyData() may yield to networking
and we may need to call rioAbort() on disklessLoadingRio. Otherwise,
failure of main channel may go undetected until a failure on rdb channel
on a corner case.
**Config changes**
- The default value for the `loading-process-events-interval-bytes`
configuration is being lowered from 2MB to 512KB. This configuration
primarily used for testing and controls the frequency of networking
during the loading phase, specifically when loading the RDB or applying
accumulated buffers during a full sync on the replica side.
Before the introduction of RDB channel replication, the 2MB value was
sufficient for occasionally yielding to networking, mainly to reply
-loading to the clients. However, with RDB channel replication, during a
full sync on the replica side (either while loading the RDB or applying
the accumulated buffer), we need to yield back to networking more
frequently to continue accumulating the replication stream. If this
doesn’t happen often enough, the replication stream can accumulate on
the master side, which is undesirable.
To address this, we’ve decided to lower the default value to 512KB. One
concern with frequent yielding to networking is the potential
performance impact, as each call to processEventsWhileBlocked() involves
4 syscalls, which could slow down the RDB loading phase. However,
benchmarking with various configuration values has shown that using
512KB or higher does not negatively impact RDB loading performance.
Based on these results, 512KB is now selected as the default value.
**Test changes**
- Added improved version of a replication test which checks memory usage
on master during full sync.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Before https://github.com/redis/redis/pull/13732, replicas were brought
online immediately after master wrote the last bytes of the RDB file to
the socket. This behavior remains unchanged if rdbchannel replication is
not used. However, with rdbchannel replication, the replica is brought
online after receiving the first ack which is sent by replica after rdb
is loaded.
To align the behavior, reverting this change to put replica online once
bgsave is done.
Additonal changes:
- INFO field `mem_total_replication_buffers` will also contain
`server.repl_full_sync_buffer.mem_used` which shows accumulated
replication stream during rdbchannel replication on replica side.
- Deleted debug level logging from some replication tests. These tests
generate thousands of keys and it may cause per key logging on some
cases.
CI / build-libc-malloc (push) Failing after 32sDetails
CI / build-centos-jemalloc (push) Failing after 32sDetails
CI / build-debian-old (push) Failing after 48sDetails
CI / build-old-chain-jemalloc (push) Failing after 31sDetails
Codecov / code-coverage (push) Failing after 31sDetails
Spellcheck / Spellcheck (push) Failing after 31sDetails
CodeQL / Analyze (cpp) (push) Failing after 31sDetails
CI / build-macos-latest (push) Has been cancelledDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-standalone (push) Failing after 31sDetails
External Server Tests / test-external-cluster (push) Failing after 31sDetails
External Server Tests / test-external-nodebug (push) Failing after 31sDetails
When the `restore foo 0 $encoded freq 100` command and `set freq [r
object freq foo]` run in different minute timestamps (i.e., when
server.unixtime/60 changes between these operations), the assertion may
fail due to the LFU decay.
This PR updates the “RESTORE can set LFU” test to verify the actual freq
value based on minute timestamps.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
CI / build-old-chain-jemalloc (push) Failing after 32sDetails
Codecov / code-coverage (push) Failing after 31sDetails
External Server Tests / test-external-standalone (push) Failing after 32sDetails
External Server Tests / test-external-cluster (push) Failing after 32sDetails
External Server Tests / test-external-nodebug (push) Failing after 32sDetails
CI / test-ubuntu-latest (push) Failing after 1m37sDetails
Spellcheck / Spellcheck (push) Failing after 32sDetails
When the diskless load configuration is set to on-empty-db, we retain a
pointer to the function library context. When emptyData() is called, it
frees this function library context pointer, leading to a use-after-free
situation.
I refactored code to ensure that emptyData() is called first, followed
by retrieving the valid pointer to the function library context.
Refactored code should not introduce any runtime implications.
Bug introduced by https://github.com/redis/redis/pull/13495 (Redis 8.0)
Co-authored-by: Oran Agra <oran@redislabs.com>
Codecov / code-coverage (push) Failing after 8sDetails
CI / build-libc-malloc (push) Successful in 50sDetails
CI / test-ubuntu-latest (push) Failing after 2m9sDetails
CI / test-sanitizer-address (push) Failing after 2m40sDetails
Spellcheck / Spellcheck (push) Successful in 9m2sDetails
External Server Tests / test-external-standalone (push) Failing after 32sDetails
External Server Tests / test-external-cluster (push) Failing after 32sDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-nodebug (push) Failing after 31sDetails
This fixes an error that occurs in the job
[test-valgrind-no-malloc-usable-size-test](https://github.com/redis/redis/actions/runs/13912357739/job/38929051397)
of the Daily workflow:
```
*** [err]: HEXPIREAT - Set time and then get TTL (listpackex) in tests/unit/type/hash-field-expire.tcl
Expected '999' to be between to '1000' and '2000' (context: type eval line 6 cmd {assert_range [r hpttl myhash FIELDS 1 field1] 1000 2000} proc ::test)
```
CI / build-libc-malloc (push) Successful in 53sDetails
CI / test-sanitizer-address (push) Failing after 1m6sDetails
CI / test-ubuntu-latest (push) Failing after 2m57sDetails
Spellcheck / Spellcheck (push) Successful in 9m5sDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-cluster (push) Failing after 31sDetails
External Server Tests / test-external-standalone (push) Failing after 6m35sDetails
External Server Tests / test-external-nodebug (push) Failing after 15m1sDetails
CI / build-macos-latest (push) Has been cancelledDetails
in #13505, we changed the code to use the string value of the key rather
than the integer value on the stack, but we have a test in
unit/moduleapi/keyspace_events that uses keyspace notification hook to
modify the value with RM_StringDMA, which can cause this value to be
released before used. the reason it didn't happen so far is because we
were using shared integers, so releasing the object doesn't free it.
CI / test-ubuntu-latest (push) Failing after 31sDetails
CI / build-debian-old (push) Failing after 32sDetails
CI / build-libc-malloc (push) Failing after 31sDetails
CI / build-centos-jemalloc (push) Failing after 31sDetails
CI / build-old-chain-jemalloc (push) Failing after 32sDetails
Codecov / code-coverage (push) Failing after 32sDetails
Spellcheck / Spellcheck (push) Failing after 32sDetails
CI / test-sanitizer-address (push) Failing after 4m35sDetails
CI / build-32bit (push) Failing after 5m35sDetails
CI / build-macos-latest (push) Has been cancelledDetails
CodeQL / Analyze (cpp) (push) Failing after 32sDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-standalone (push) Failing after 31sDetails
External Server Tests / test-external-cluster (push) Failing after 31sDetails
External Server Tests / test-external-nodebug (push) Failing after 6m47sDetails
The bug mentioned in this
[#13862](https://github.com/redis/redis/issues/13862) has been fixed.
---------
Signed-off-by: li-benson <1260437731@qq.com>
Signed-off-by: youngmore1024 <youngmore1024@outlook.com>
Co-authored-by: youngmore1024 <youngmore1024@outlook.com>
After #13840, the data we populate becomes more complex and slower, we
always wait for a defragmentation cycle to end before verifying that the
test is okay.
However, in some slow environments, an entire defragmentation cycle can
exceed 5 seconds, and in my local test using 'taskset -c 0' it can reach
6 seconds, so increase the threshold to avoid test failures.
CI / build-libc-malloc (push) Failing after 31sDetails
CI / build-debian-old (push) Failing after 1m32sDetails
CI / build-old-chain-jemalloc (push) Failing after 31sDetails
Codecov / code-coverage (push) Failing after 31sDetails
CI / test-ubuntu-latest (push) Failing after 3m21sDetails
Spellcheck / Spellcheck (push) Failing after 31sDetails
CI / test-sanitizer-address (push) Failing after 6m36sDetails
CI / build-centos-jemalloc (push) Failing after 6m36sDetails
External Server Tests / test-external-standalone (push) Failing after 2m10sDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-nodebug (push) Failing after 2m12sDetails
External Server Tests / test-external-cluster (push) Failing after 2m16sDetails
### Background
The program runs normally in standalone mode, but migrating to cluster
mode may cause errors, this is because some cross slot commands can not
run in cluster mode. We should provide an approach to detect this issue
when running in standalone mode, and need to expose a metric which
indicates the usage of no incompatible commands.
### Solution
To avoid perf impact, we introduce a new config
`cluster-compatibility-sample-ratio` which define the sampling ratio
(0-100) for checking command compatibility in cluster mode. When a
command is executed, it is sampled at the specified ratio to determine
if it complies with Redis cluster constraints, such as cross-slot
restrictions.
A new metric is exposed: `cluster_incompatible_ops` in `info stats`
output.
The following operations will be considered incompatible operations.
- cross-slot command
If a command has multiple cross slot keys, it is incompatible
- `swap, copy, move, select` command
These commands involve multi databases in some cases, we don't allow
multiple DB in cluster mode, so there are not compatible
- Module command with `no-cluster` flag
If a module command has `no-cluster` flag, we will encounter an error
when loading module, leading to fail to load module if cluster is
enabled, so this is incompatible.
- Script/function with `no-cluster` flag
Similar with module command, if we declare `no-cluster` in shebang of
script/function, we also can not run it in cluster mode
- `sort` command by/get pattern
When `sort` command has `by/get` pattern option, we must ask that the
pattern slot is equal with the slot of keys, otherwise it is
incompatible in cluster mode.
- The script/function command accesses the keys and declared keys have
different slots
For the script/function command, we not only check the slot of declared
keys, but only check the slot the accessing keys, if they are different,
we think it is incompatible.
**Besides**, commands like `keys, scan, flushall, script/function
flush`, that in standalone mode iterate over all data to perform the
operation, are only valid for the server that executes the command in
cluster mode and are not broadcasted. However, this does not lead to
errors, so we do not consider them as incompatible commands.
### Performance impact test
**cross slot test**
Below are the test commands and results. When using MSET with 8 keys,
performance drops by approximately 3%.
**single key test**
It may be due to the overhead of the sampling function, and single-key
commands could cause a 1-2% performance drop.
CI / build-macos-latest (push) Waiting to runDetails
CI / build-debian-old (push) Failing after 6sDetails
CI / build-centos-jemalloc (push) Failing after 5sDetails
CI / build-old-chain-jemalloc (push) Failing after 3sDetails
Codecov / code-coverage (push) Failing after 7sDetails
CI / build-libc-malloc (push) Successful in 56sDetails
CI / test-sanitizer-address (push) Failing after 1m8sDetails
CI / test-ubuntu-latest (push) Failing after 2m13sDetails
CI / build-32bit (push) Failing after 3m28sDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-nodebug (push) Failing after 1m48sDetails
External Server Tests / test-external-standalone (push) Failing after 2m9sDetails
External Server Tests / test-external-cluster (push) Failing after 2m14sDetails
Spellcheck / Spellcheck (push) Successful in 9m3sDetails
Since https://github.com/redis/redis/pull/11884, what was previously
accepted as a valid input (hexadecimal string) before 8.0 returned an
error. This PR addresses it. To avoid performance penalties if hints the
compiler that the fallbacks are not likely to happen.
Furthermore, we were ignoring std::result_out_of_range outputs from
fast_float. This PR addresses it as well and includes tests for both
identified scenarios.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
CI / build-debian-old (push) Failing after 7sDetails
CI / build-centos-jemalloc (push) Failing after 6sDetails
CI / build-old-chain-jemalloc (push) Failing after 4sDetails
Codecov / code-coverage (push) Failing after 7sDetails
CI / build-libc-malloc (push) Successful in 53sDetails
CI / test-sanitizer-address (push) Failing after 1m4sDetails
CI / test-ubuntu-latest (push) Failing after 2m9sDetails
CI / build-32bit (push) Failing after 9m50sDetails
Spellcheck / Spellcheck (push) Successful in 9m0sDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-standalone (push) Failing after 31sDetails
External Server Tests / test-external-cluster (push) Failing after 6m36sDetails
External Server Tests / test-external-nodebug (push) Failing after 9m54sDetails
CI / build-macos-latest (push) Has been cancelledDetails
After https://github.com/redis/redis/pull/13167, when a client calls
`FLUSHDB` command, we still async empty database, and the client was
blocked until the lazyfree completes.
1) If another client calls `SLAVEOF` command during this time, the
server will unblock all blocked clients, including those blocked by the
lazyfree. However, when unblocking a lazyfree blocked client, we forgot
to call `updateStatsOnUnblock()`, which ultimately triggered the
following assertion.
2) If a client blocked by Lazyfree is unblocked midway, and at this
point the `bio_comp_list` has already received the completion
notification for the bio, we might end up processing a client that has
already been unblocked in `flushallSyncBgDone()`. Therefore, we need to
filter it out.
---------
Co-authored-by: oranagra <oran@redislabs.com>
After https://github.com/redis/redis/pull/13816, we make a new API to
defrag RedisModuleDict.
Currently, we only support incremental defragmentation of the dictionary
itself, but the defragmentation of values is still not incremental. If
the values are very large, it could lead to significant blocking.
Therefore, in this PR, we have added incremental defragmentation for the
values.
The main change is to the `RedisModuleDefragDictValueCallback`, we
modified the return value of this callback.
When the callback returns 1, we will save the `seekTo` as the key of the
current unfinished node, and the next time we enter, we will continue
defragmenting this node.
When the return value is 0, we will proceed to the next node.
## Test
Since each dictionary in the global dict originally contained only 10
strings, but now it has been changed to a nested dictionary, each
dictionary now has 10 sub-dictionaries, with each sub-dictionary
containing 10 strings, this has led to a corresponding reduction in the
defragmentation time obtained from other tests.
Therefore, the other tests have been modified to always wait for
defragmentation to be turned off before the test begins, then start it
after creating fragmentation, ensuring that they can always run for a
full defragmentation cycle.
---------
Co-authored-by: ephraimfeldblum <ephraim.feldblum@redis.com>
1) Enable the callback to be NULL for RM_DefragRedisModuleDict()
Because the dictionary may store only the key without the value.
2) Reduce the system calls of RM_DefragShouldStop()
The API checks the following thresholds before performing a time check:
over 512 defrag hits, or over 1024 defrag misses, and performs the time
judgment if any of these thresholds are reached.
3) Added defragmentation statistics for dictionary items to cover the
associated code for RM_DefragRedisModuleDict().
4) Removed `module_ctx` from `defragModuleCtx` struct, which can be
replaced by a temporary variable.
---------
Co-authored-by: oranagra <oran@redislabs.com>
Recently encountered some errors as bellow,
HGETEX/HSETEX with PXAT/EXAT options, after getting ttl, we calculate
current time by `[clock seconds]` that may have a delay that causes
results greater than expected.
Dismiss memory test error, now we introduced rdb-channel replication,
the full synchronization might finish before the child process exits. So
we may fail if calling `bgsave` immediately after full sync.
Close#13628
This PR changes behavior of special `+` id of XREAD command. Now it uses
`streamLastValidID` to find last entry instead of `last_id` field of
stream object.
This PR adds test for the issue.
**Notes**
Initial idea to update `last_id` while executing XDEL seems to be wrong.
`last_id` is used to strore last generated id and not id of last entry.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: guybe7 <guy.benoish@redislabs.com>
This commit addresses several issues related to the `INFO KEYSIZES` feature:
- HyperLogLog commands: `KEYSIZES` hooks were not properly set or tested.
- HFE lazy expiration: `KEYSIZES` hooks were not properly set or tested.
- Empty DB & SYNC flow: On `blocking_async=0` flow, global `keysizes`
histogram were not reset (can reproduced using `DEBUG RELOAD`).
- Empty string handling: Fix histogram for strings of size 0. Not
relevant to other data-types.
1) Fix a bug that passing an incorrect endtime to module.
This bug was found by @ShooterIT.
After #13814, all endtime will be monotonic time, and we should no
longer convert it to ustime relative.
Add assertions to prevent endtime from being much larger thatn the
current time.
2) Fix a race in test `Reduce defrag CPU usage when module data can't be
defragged`
---------
Co-authored-by: ShooterIT <wangyuancode@163.com>
After #13815, we introduced incremental defragmentation for global data
for module.
Now we added a new module API `RM_DefragRedisModuleDict` to incremental
defrag `RedisModuleDict`.
This PR adds a new APIs and a new defrag callback:
```c
RedisModuleDict *RM_DefragRedisModuleDict(RedisModuleDefragCtx *ctx, RedisModuleDict *dict, RedisModuleDefragDictValueCallback valueCB, RedisModuleString **seekTo);
typedef void *(*RedisModuleDefragDictValueCallback)(RedisModuleDefragCtx *ctx, void *data, unsigned char *key, size_t keylen);
```
Usage:
```c
RedisModuleString *seekTo = NULL;
RedisModuleDict *dict = = RedisModule_CreateDict(ctx);
... populate the dict code ...
/* Defragment a dictionary completely */
do {
RedisModuleDict *new = RedisModule_DefragRedisModuleDict(ctx, dict, defragGlobalDictValueCB, &seekTo);
if (new != NULL) {
dict = new;
}
} while (seekTo);
```
---------
Co-authored-by: ShooterIT <wangyuancode@163.com>
Co-authored-by: oranagra <oran@redislabs.com>
## Description
Currently, when performing defragmentation on non-key data within the
module, we cannot process the defragmentation incrementally. This
limitation affects the efficiency and flexibility of defragmentation in
certain scenarios.
The primary goal of this PR is to introduce support for incremental
defragmentation of global module data.
## Interface Change
New module API `RegisterDefragFunc2`
This is a more advanced version of `RM_RegisterDefragFunc`, in that it
takes a new callbacks(`RegisterDefragFunc2`) that has a return value,
and can use RM_DefragShouldStop in and indicate that it should be called
again later, or is it done (returned 0).
## Note
The `RegisterDefragFunc` API remains available.
---------
Co-authored-by: ShooterIT <wangyuancode@163.com>
Co-authored-by: oranagra <oran@redislabs.com>
This PR is based on: https://github.com/valkey-io/valkey/pull/1462
## Issue/Problems
Duty Cycle: Active Defrag has configuration values which determine the
intended percentage of CPU to be used based on a gradient of the
fragmentation percentage. However, Active Defrag performs its work on
the 100ms serverCron timer. It then computes a duty cycle and performs a
single long cycle. For example, if the intended CPU is computed to be
10%, Active Defrag will perform 10ms of work on this 100ms timer cron.
* This type of cycle introduces large latencies on the client (up to
25ms with default configurations)
* This mechanism is subject to starvation when slow commands delay the
serverCron
Maintainability: The current Active Defrag code is difficult to read &
maintain. Refactoring of the high level control mechanisms and functions
will allow us to more seamlessly adapt to new defragmentation needs.
Specific examples include:
* A single function (activeDefragCycle) includes the logic to
start/stop/modify the defragmentation as well as performing one "step"
of the defragmentation. This should be separated out, so that the actual
defrag activity can be performed on an independent timer (see duty cycle
above).
* The code is focused on kvstores, with other actions just thrown in at
the end (defragOtherGlobals). There's no mechanism to break this up to
reduce latencies.
* For the main dictionary (only), there is a mechanism to set aside
large keys to be processed in a later step. However this code creates a
separate list in each kvstore (main dict or not), bleeding/exposing
internal defrag logic. We only need 1 list - inside defrag. This logic
should be more contained for the main key store.
* The structure is not well suited towards other non-main-dictionary
items. For example, pub-sub and pub-sub-shard was added, but it's added
in such a way that in CMD mode, with multiple DBs, we will defrag
pub-sub repeatedly after each DB.
## Description of the feature
Primarily, this feature will split activeDefragCycle into 2 functions.
1. One function will be called from serverCron to determine if a defrag
cycle (a complete scan) needs to be started. It will also determine if
the CPU expenditure needs to be adjusted.
2. The 2nd function will be a timer proc dedicated to performing defrag.
This will be invoked independently from serverCron.
Once the functions are split, there is more control over the latency
created by the defrag process. A new configuration will be used to
determine the running time for the defrag timer proc. The default for
this will be 500us (one-half of the current minimum time). Then the
timer will be adjusted to achieve the desired CPU. As an example, 5% of
CPU will run the defrag process for 500us every 10ms. This is much
better than running for 5ms every 100ms.
The timer function will also adjust to compensate for starvation. If a
slow command delays the timer, the process will run proportionately
longer to ensure that the configured CPU is achieved. Given the presence
of slow commands, the proportional extra time is insignificant to
latency. This also addresses the overload case. At 100% CPU, if the
event loop slows, defrag will run proportionately longer to achieve the
configured CPU utilization.
Optionally, in low CPU situations, there would be little impact in
utilizing more than the configured CPU. We could optionally allow the
timer to pop more often (even with a 0ms delay) and the (tail) latency
impact would not change.
And we add a time limit for the defrag duty cycle to prevent excessive
latency. When latency is already high (indicated by a long time between
calls), we don't want to make it worse by running defrag for too long.
Addressing maintainability:
* The basic code structure can more clearly be organized around a
"cycle".
* Have clear begin/end functions and a set of "stages" to be executed.
* Rather than stages being limited to "kvstore" type data, a cycle
should be more flexible, incorporating the ability to incrementally
perform arbitrary work. This will likely be necessary in the future for
certain module types. It can be used today to address oddballs like
defragOtherGlobals.
* We reduced some of the globals, and reduce some of the coupling.
defrag_later should be removed from serverDb.
* Each stage should begin on a fresh cycle. So if there are
non-time-bounded operations like kvstoreDictLUTDefrag, these would be
less likely to introduce additional latency.
Signed-off-by: Jim Brunner
[brunnerj@amazon.com](mailto:brunnerj@amazon.com)
Signed-off-by: Madelyn Olson
[madelyneolson@gmail.com](mailto:madelyneolson@gmail.com)
Co-authored-by: Madelyn Olson
[madelyneolson@gmail.com](mailto:madelyneolson@gmail.com)
---------
Signed-off-by: Jim Brunner brunnerj@amazon.com
Signed-off-by: Madelyn Olson madelyneolson@gmail.com
Co-authored-by: Madelyn Olson madelyneolson@gmail.com
Co-authored-by: ShooterIT <wangyuancode@163.com>
This PR adds three new hash commands: HGETDEL, HGETEX and HSETEX. These
commands enable user to do multiple operations in one step atomically
e.g. set a hash field and update its TTL with a single command.
Previously, it was only possible to do it by calling hset and hexpire
commands subsequently.
- **HGETDEL command**
```
HGETDEL <key> FIELDS <numfields> field [field ...]
```
**Description**
Get and delete the value of one or more fields of a given hash key
**Reply**
Array reply: list of the value associated with each field or nil if the
field doesn’t exist.
- **HGETEX command**
```
HGETEX <key>
[EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT
unix-time-milliseconds | PERSIST]
FIELDS <numfields> field [field ...]
```
**Description**
Get the value of one or more fields of a given hash key, and optionally
set their expiration
**Options:**
EX seconds: Set the specified expiration time, in seconds.
PX milliseconds: Set the specified expiration time, in milliseconds.
EXAT timestamp-seconds: Set the specified Unix time at which the field
will expire, in seconds.
PXAT timestamp-milliseconds: Set the specified Unix time at which the
field will expire, in milliseconds.
PERSIST: Remove the time to live associated with the field.
**Reply**
Array reply: list of the value associated with each field or nil if the
field doesn’t exist.
- **HSETEX command**
```
HSETEX <key>
[FNX | FXX]
[EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT
unix-time-milliseconds | KEEPTTL]
FIELDS <numfields> field value [field value...]
```
**Description**
Set the value of one or more fields of a given hash key, and optionally
set their expiration
**Options:**
FNX: Only set the fields if all do not already exist.
FXX: Only set the fields if all already exist.
EX seconds: Set the specified expiration time, in seconds.
PX milliseconds: Set the specified expiration time, in milliseconds.
EXAT timestamp-seconds: Set the specified Unix time at which the field
will expire, in seconds.
PXAT timestamp-milliseconds: Set the specified Unix time at which the
field will expire, in milliseconds.
KEEPTTL: Retain the time to live associated with the field.
Note: If no option is provided, any associated expiration time will be
discarded similar to how SET command behaves.
**Reply**
Integer reply: 0 if no fields were set
Integer reply: 1 if all the fields were set
### Background
AOF is often used as an effective data recovery method, but now if we
have two AOFs from different nodes, it is hard to learn which one has
latest data. Generally, we determine whose data is more up-to-date by
reading the latest modification time of the AOF file, but because of
replication delay, even if both master and replica write to the AOF at
the same time, the data in the master is more up-to-date (there are
commands that didn't arrive at the replica yet, or a large number of
commands have accumulated on replica side ), so we may make wrong
decision.
### Solution
The replication offset always increments when AOF is enabled even if
there is no replica, we think replication offset is better method to
determine which one has more up-to-date data, whoever has a larger
offset will have newer data, so we add the start replication offset info
for AOF, as bellow.
```
file appendonly.aof.2.base.rdb seq 2 type b
file appendonly.aof.2.incr.aof seq 2 type i startoffset 224
```
And if we close gracefully the AOF file, not a crash, such as
`shutdown`, `kill signal 15` or `config set appendonly no`, we will add
the end replication offset, as bellow.
```
file appendonly.aof.2.base.rdb seq 2 type b
file appendonly.aof.2.incr.aof seq 2 type i startoffset 224 endoffset 532
```
#### Things to pay attention to
- For BASE AOF, we do not add `startoffset` and `endoffset` info, since
we could not know the start replication replication of data, and it is
useless to help us to determine which one has more up-to-date data.
- For AOFs from old version, we also don't add `startoffset` and
`endoffset` info, since we also don't know start replication replication
of them. If we add the start offset from 0, we might make the judgment
even less accurate. For example, if the master has just rewritten the
AOF, its INCR AOF will inevitably be very small. However, if the replica
has not rewritten AOF for a long time, its INCR AOF might be much
larger. By applying the following method, we might make incorrect
decisions, so we still just check timestamp instead of adding offset
info
- If the last INCR AOF has `startoffset` or `endoffset`, we need to
restore `server.master_repl_offset` according to them to avoid the
rollback of the `startoffset` of next INCR AOF. If it has `endoffset`,
we just use this value as `server.master_repl_offset`, and a very
important thing is to remove this information from the manifest file to
avoid the next time we load the manifest file with wrong `endoffset`. If
it only has `startoffset`, we calculate `server.master_repl_offset` by
the `startoffset` plus the file size.
### How to determine which one has more up-to-date data
If AOF has a larger replication offset, it will have more up-to-date
data. The following is how to get AOF offset:
Read the AOF manifest file to obtain information about **the last INCR
AOF**
1. If the last INCR AOF has `endoffset` field, we can directly use the
`endoffset` to present the replication offset of AOF
2. If there is no `endoffset`(such as redis crashes abnormally), but
there is `startoffset` filed of the last INCR AOF, we can get the
replication offset of AOF by `startoffset` plus the file size
3. Finally, if the AOF doesn’t have both `startoffset` and `endoffset`,
maybe from old version, and new version redis has not rewritten AOF yet,
we still need to check the modification timestamp of the last INCR AOF
### TODO
Fix ping causing inconsistency between AOF size and replication
offset in the future PR. Because we increment the replication offset
when sending PING/REPLCONF to the replica but do not write data to the
AOF file, this might cause the starting offset of the AOF file plus its
size to be inconsistent with the actual replication offset.
Although the commit #6ceadfb58 improves GETRANGE command behavior,
we can't accept it as we should avoid breaking changes for non-critical bug fixes.
This reverts commit 6ceadfb580.
Although the commit #7f0a7f0a6 improves the performance of the SCAN command,
we can't accept it as we should avoid breaking changes for non-critical bug fixes.
This reverts commit 7f0a7f0a69.
# PR: Add Mechanism for Internal Commands and Connections in Redis
This PR introduces a mechanism to handle **internal commands and
connections** in Redis. It includes enhancements for command
registration, internal authentication, and observability.
## Key Features
1. **Internal Command Flag**:
- Introduced a new **module command registration flag**: `internal`.
- Commands marked with `internal` can only be executed by **internal
connections**, AOF loading flows, and master-replica connections.
- For any other connection, these commands will appear as non-existent.
2. **Support for internal authentication added to `AUTH`**:
- Used by depicting the special username `internal connection` with the
right internal password, i.e.,: `AUTH "internal connection"
<internal_secret>`.
- No user-defined ACL username can have this name, since spaces are not
aloud in the ACL parser.
- Allows connections to authenticate as **internal connections**.
- Authenticated internal connections can execute internal commands
successfully.
4. **Module API for Internal Secret**:
- Added the `RedisModule_GetInternalSecret()` API, that exposes the
internal secret that should be used as the password for the new `AUTH
"internal connection" <password>` command.
- This API enables the modules to authenticate against other shards as
local connections.
## Notes on Behavior
- **ACL validation**:
- Commands dispatched by internal connections bypass ACL validation, to
give the caller full access regardless of the user with which it is
connected.
- **Command Visibility**:
- Internal commands **do not appear** in `COMMAND <subcommand>` and
`MONITOR` for non-internal connections.
- Internal commands **are logged** in the slow log, latency report and
commands' statistics to maintain observability.
- **`RM_Call()` Updates**:
- **Non-internal connections**:
- Cannot execute internal commands when the command is sent with the `C`
flag (otherwise can).
- Internal connections bypass ACL validations (i.e., run as the
unrestricted user).
- **Internal commands' success**:
- Internal commands succeed upon being sent from either an internal
connection (i.e., authenticated via the new `AUTH "internal connection"
<internal_secret>` API), an AOF loading process, or from a master via
the replication link.
Any other connections that attempt to execute an internal command fail
with the `unknown command` error message raised.
- **`CLIENT LIST` flags**:
- Added the `I` flag, to indicate that the connection is internal.
- **Lua Scripts**:
- Prevented internal commands from being executed via Lua scripts.
---------
Co-authored-by: Meir Shpilraien <meir@redis.com>
During fullsync, before loading RDB on the replica, we stop aof child to
prevent copy-on-write disaster.
Once rdb is loaded, aof is started again and it will trigger aof
rewrite. With https://github.com/redis/redis/pull/13732 , for rdbchannel
replication, this behavior was changed. Currently, we start aof after
replication buffer is streamed to db. This PR changes it back to start
aof just after rdb is loaded (before repl buffer is streamed)
Both approaches may have pros and cons. If we start aof before streaming
repl buffers, we may still face with copy-on-write issues as repl
buffers potentially include large amount of changes. If we wait until
replication buffer drained, it means we are delaying starting aof
persistence.
Additional changes are introduced as part of this PR:
- Interface change:
Added `mem_replica_full_sync_buffer` field to the `INFO MEMORY` command
reply. During full sync, it shows total memory consumed by accumulated
replication stream buffer on replica. Added same metric to `MEMORY
STATS` command reply as `replica.fullsync.buffer` field.
- Fixes:
- Count repl stream buffer size of replica as part of 'memory overhead'
calculation for fields in "INFO MEMORY" and "MEMORY STATS" outputs.
Before this PR, repl buffer was not counted as part of memory overhead
calculation, causing misreports for fields like `used_memory_overhead`
and `used_memory_dataset` in "INFO STATS" and for `overhead.total` field
in "MEMORY STATS" command reply.
- Dismiss replication stream buffers memory of replica in the fork to
reduce COW impact during a fork.
- Fixed a few time sensitive flaky tests, deleted a noop statement,
fixed some comments and fail messages in rdbchannel tests.
The PR introduces a new shared secret that is shared over all the nodes
on the Redis cluster. The main idea is to leverage the cluster bus to
share a secret between all the nodes such that later the nodes will be
able to authenticate using this secret and send internal commands to
each other (see #13740 for more information about internal commands).
The way the shared secret is chosen is the following:
1. Each node, when start, randomly generate its own internal secret.
2. Each node share its internal secret over the cluster ping messages.
3. If a node gets a ping message with secret smaller then his current
secret, it embrace it.
4. Eventually all nodes should embrace the minimal secret
The converges of the secret is as good as the topology converges.
To extend the ping messages to contain the secret, we leverage the
extension mechanism. Nodes that runs an older Redis version will just
ignore those extensions.
Specific tests were added to verify that eventually all nodes see the
secrets. In addition, a verification was added to the test infra to
verify the secret on `cluster_config_consistent` and to
`assert_cluster_state`.