Before https://github.com/redis/redis/pull/13732, replicas were brought
online immediately after master wrote the last bytes of the RDB file to
the socket. This behavior remains unchanged if rdbchannel replication is
not used. However, with rdbchannel replication, the replica is brought
online after receiving the first ack which is sent by replica after rdb
is loaded.
To align the behavior, reverting this change to put replica online once
bgsave is done.
Additonal changes:
- INFO field `mem_total_replication_buffers` will also contain
`server.repl_full_sync_buffer.mem_used` which shows accumulated
replication stream during rdbchannel replication on replica side.
- Deleted debug level logging from some replication tests. These tests
generate thousands of keys and it may cause per key logging on some
cases.
CI / build-libc-malloc (push) Failing after 31sDetails
CI / build-centos-jemalloc (push) Failing after 31sDetails
CI / build-old-chain-jemalloc (push) Failing after 31sDetails
Codecov / code-coverage (push) Failing after 31sDetails
Spellcheck / Spellcheck (push) Failing after 31sDetails
CI / build-macos-latest (push) Has been cancelledDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-standalone (push) Failing after 32sDetails
External Server Tests / test-external-cluster (push) Failing after 32sDetails
External Server Tests / test-external-nodebug (push) Failing after 2m18sDetails
Close https://github.com/redis/redis/issues/13868
This bug was introduced by https://github.com/redis/redis/pull/13468
## Issue
To maintain compatibility with older versions that do not support
shardid, when a replica passes a shardid, we also update the master’s
shardid accordingly.
However, when both the master and replica support shardid, an issue
arises: in one moment, the master may pass a shardid, causing us to
update both the master and all its replicas to match the master’s
shardid. But if the replica later passes a different shardid, we would
then update the master’s shardid again, leading to continuous changes in
shardid.
## Solution
Regardless of the situation, we always ensure that the replica’s shardid
remains consistent with the master’s shardid.
CI / build-libc-malloc (push) Failing after 32sDetails
CI / build-centos-jemalloc (push) Failing after 32sDetails
CI / build-debian-old (push) Failing after 48sDetails
CI / build-old-chain-jemalloc (push) Failing after 31sDetails
Codecov / code-coverage (push) Failing after 31sDetails
Spellcheck / Spellcheck (push) Failing after 31sDetails
CodeQL / Analyze (cpp) (push) Failing after 31sDetails
CI / build-macos-latest (push) Has been cancelledDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-standalone (push) Failing after 31sDetails
External Server Tests / test-external-cluster (push) Failing after 31sDetails
External Server Tests / test-external-nodebug (push) Failing after 31sDetails
When the `restore foo 0 $encoded freq 100` command and `set freq [r
object freq foo]` run in different minute timestamps (i.e., when
server.unixtime/60 changes between these operations), the assertion may
fail due to the LFU decay.
This PR updates the “RESTORE can set LFU” test to verify the actual freq
value based on minute timestamps.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
This way we don't need to mess with node->value at a latter time
where an explicit lock would be required. Now we have:
1. Prepare context (neighbors).
2. Commit, and set the associated value.
CI / test-sanitizer-address (push) Failing after 32sDetails
CI / build-libc-malloc (push) Failing after 32sDetails
CI / build-debian-old (push) Failing after 32sDetails
CI / build-centos-jemalloc (push) Failing after 31sDetails
CI / build-old-chain-jemalloc (push) Failing after 31sDetails
Codecov / code-coverage (push) Failing after 31sDetails
CI / test-ubuntu-latest (push) Failing after 2m7sDetails
Spellcheck / Spellcheck (push) Failing after 31sDetails
CI / build-macos-latest (push) Has been cancelledDetails
External Server Tests / test-external-standalone (push) Failing after 31sDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-cluster (push) Failing after 32sDetails
External Server Tests / test-external-nodebug (push) Failing after 31sDetails
This PR is based on: https://github.com/valkey-io/valkey/pull/1801
[SoftlyRaining](https://github.com/SoftlyRaining) was hunting for defrag
bugs with Jim and found a couple of improvements to make. Jim pointed
out that in several of the callbacks, if the encoding were to change it
simply returns without doing anything to `cursor` to make it reach 0,
meaning that it would continue no-op working on that item without making
any progress. Type and encoding can change while the defrag scan is in
progress if the value is mutated or replaced by something else with the
same key.
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Co-authored-by: Rain Valentine <rsg000@gmail.com>
We pass our aborting allocation function to the HNSW lib, the
only other reason for it to fail is pthread mutex locking failing
but this is also practically impossible AFAIK in modern systems,
and if it happens (for kernel reosurces shortage) anyway to
abort is the best thing to do: otherwise we would have to return
that we could not complete the operation for some reason, which
is not uniform with everything Redis does. In Redis under
normal conditions writes must succeed if they are semantically
correct, or the server crash for OOM.
CI / build-old-chain-jemalloc (push) Failing after 32sDetails
Codecov / code-coverage (push) Failing after 31sDetails
External Server Tests / test-external-standalone (push) Failing after 32sDetails
External Server Tests / test-external-cluster (push) Failing after 32sDetails
External Server Tests / test-external-nodebug (push) Failing after 32sDetails
CI / test-ubuntu-latest (push) Failing after 1m37sDetails
Spellcheck / Spellcheck (push) Failing after 32sDetails
When the diskless load configuration is set to on-empty-db, we retain a
pointer to the function library context. When emptyData() is called, it
frees this function library context pointer, leading to a use-after-free
situation.
I refactored code to ensure that emptyData() is called first, followed
by retrieving the valid pointer to the function library context.
Refactored code should not introduce any runtime implications.
Bug introduced by https://github.com/redis/redis/pull/13495 (Redis 8.0)
Co-authored-by: Oran Agra <oran@redislabs.com>
Codecov / code-coverage (push) Failing after 8sDetails
CI / build-libc-malloc (push) Successful in 50sDetails
CI / test-ubuntu-latest (push) Failing after 2m9sDetails
CI / test-sanitizer-address (push) Failing after 2m40sDetails
Spellcheck / Spellcheck (push) Successful in 9m2sDetails
External Server Tests / test-external-standalone (push) Failing after 32sDetails
External Server Tests / test-external-cluster (push) Failing after 32sDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-nodebug (push) Failing after 31sDetails
This fixes an error that occurs in the job
[test-valgrind-no-malloc-usable-size-test](https://github.com/redis/redis/actions/runs/13912357739/job/38929051397)
of the Daily workflow:
```
*** [err]: HEXPIREAT - Set time and then get TTL (listpackex) in tests/unit/type/hash-field-expire.tcl
Expected '999' to be between to '1000' and '2000' (context: type eval line 6 cmd {assert_range [r hpttl myhash FIELDS 1 field1] 1000 2000} proc ::test)
```
CI / build-libc-malloc (push) Successful in 53sDetails
CI / test-sanitizer-address (push) Failing after 1m6sDetails
CI / test-ubuntu-latest (push) Failing after 2m57sDetails
Spellcheck / Spellcheck (push) Successful in 9m5sDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-cluster (push) Failing after 31sDetails
External Server Tests / test-external-standalone (push) Failing after 6m35sDetails
External Server Tests / test-external-nodebug (push) Failing after 15m1sDetails
CI / build-macos-latest (push) Has been cancelledDetails
in #13505, we changed the code to use the string value of the key rather
than the integer value on the stack, but we have a test in
unit/moduleapi/keyspace_events that uses keyspace notification hook to
modify the value with RM_StringDMA, which can cause this value to be
released before used. the reason it didn't happen so far is because we
were using shared integers, so releasing the object doesn't free it.
First, when we do `raxSeek()` and then call raxNext, we will get the
`RAX_ITER_JUST_SEEKED` flag and return success directly.
We always set the node defrag callback after `raxSeek()`, which means
that when we break from defragmentation, the first node that comes in
again will never be defragged.
In this PR, we save the last as the next node to be processed, not the
last node to be completed.
This way we defrag the next node when we exit to avoid it being skipped
on the next resume.
---------
Co-authored-by: oranagra <oran@redislabs.com>
CI / test-ubuntu-latest (push) Failing after 31sDetails
CI / build-debian-old (push) Failing after 32sDetails
CI / build-libc-malloc (push) Failing after 31sDetails
CI / build-centos-jemalloc (push) Failing after 31sDetails
CI / build-old-chain-jemalloc (push) Failing after 32sDetails
Codecov / code-coverage (push) Failing after 32sDetails
Spellcheck / Spellcheck (push) Failing after 32sDetails
CI / test-sanitizer-address (push) Failing after 4m35sDetails
CI / build-32bit (push) Failing after 5m35sDetails
CI / build-macos-latest (push) Has been cancelledDetails
CodeQL / Analyze (cpp) (push) Failing after 32sDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-standalone (push) Failing after 31sDetails
External Server Tests / test-external-cluster (push) Failing after 31sDetails
External Server Tests / test-external-nodebug (push) Failing after 6m47sDetails
The bug mentioned in this
[#13862](https://github.com/redis/redis/issues/13862) has been fixed.
---------
Signed-off-by: li-benson <1260437731@qq.com>
Signed-off-by: youngmore1024 <youngmore1024@outlook.com>
Co-authored-by: youngmore1024 <youngmore1024@outlook.com>
After #13840, the data we populate becomes more complex and slower, we
always wait for a defragmentation cycle to end before verifying that the
test is okay.
However, in some slow environments, an entire defragmentation cycle can
exceed 5 seconds, and in my local test using 'taskset -c 0' it can reach
6 seconds, so increase the threshold to avoid test failures.
CI / build-libc-malloc (push) Failing after 31sDetails
CI / build-debian-old (push) Failing after 1m32sDetails
CI / build-old-chain-jemalloc (push) Failing after 31sDetails
Codecov / code-coverage (push) Failing after 31sDetails
CI / test-ubuntu-latest (push) Failing after 3m21sDetails
Spellcheck / Spellcheck (push) Failing after 31sDetails
CI / test-sanitizer-address (push) Failing after 6m36sDetails
CI / build-centos-jemalloc (push) Failing after 6m36sDetails
External Server Tests / test-external-standalone (push) Failing after 2m10sDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-nodebug (push) Failing after 2m12sDetails
External Server Tests / test-external-cluster (push) Failing after 2m16sDetails
### Background
The program runs normally in standalone mode, but migrating to cluster
mode may cause errors, this is because some cross slot commands can not
run in cluster mode. We should provide an approach to detect this issue
when running in standalone mode, and need to expose a metric which
indicates the usage of no incompatible commands.
### Solution
To avoid perf impact, we introduce a new config
`cluster-compatibility-sample-ratio` which define the sampling ratio
(0-100) for checking command compatibility in cluster mode. When a
command is executed, it is sampled at the specified ratio to determine
if it complies with Redis cluster constraints, such as cross-slot
restrictions.
A new metric is exposed: `cluster_incompatible_ops` in `info stats`
output.
The following operations will be considered incompatible operations.
- cross-slot command
If a command has multiple cross slot keys, it is incompatible
- `swap, copy, move, select` command
These commands involve multi databases in some cases, we don't allow
multiple DB in cluster mode, so there are not compatible
- Module command with `no-cluster` flag
If a module command has `no-cluster` flag, we will encounter an error
when loading module, leading to fail to load module if cluster is
enabled, so this is incompatible.
- Script/function with `no-cluster` flag
Similar with module command, if we declare `no-cluster` in shebang of
script/function, we also can not run it in cluster mode
- `sort` command by/get pattern
When `sort` command has `by/get` pattern option, we must ask that the
pattern slot is equal with the slot of keys, otherwise it is
incompatible in cluster mode.
- The script/function command accesses the keys and declared keys have
different slots
For the script/function command, we not only check the slot of declared
keys, but only check the slot the accessing keys, if they are different,
we think it is incompatible.
**Besides**, commands like `keys, scan, flushall, script/function
flush`, that in standalone mode iterate over all data to perform the
operation, are only valid for the server that executes the command in
cluster mode and are not broadcasted. However, this does not lead to
errors, so we do not consider them as incompatible commands.
### Performance impact test
**cross slot test**
Below are the test commands and results. When using MSET with 8 keys,
performance drops by approximately 3%.
**single key test**
It may be due to the overhead of the sampling function, and single-key
commands could cause a 1-2% performance drop.
CI / build-macos-latest (push) Waiting to runDetails
CI / build-debian-old (push) Failing after 6sDetails
CI / build-centos-jemalloc (push) Failing after 5sDetails
CI / build-old-chain-jemalloc (push) Failing after 3sDetails
Codecov / code-coverage (push) Failing after 7sDetails
CI / build-libc-malloc (push) Successful in 56sDetails
CI / test-sanitizer-address (push) Failing after 1m8sDetails
CI / test-ubuntu-latest (push) Failing after 2m13sDetails
CI / build-32bit (push) Failing after 3m28sDetails
Coverity Scan / coverity (push) Has been skippedDetails
External Server Tests / test-external-nodebug (push) Failing after 1m48sDetails
External Server Tests / test-external-standalone (push) Failing after 2m9sDetails
External Server Tests / test-external-cluster (push) Failing after 2m14sDetails
Spellcheck / Spellcheck (push) Successful in 9m3sDetails
Since https://github.com/redis/redis/pull/11884, what was previously
accepted as a valid input (hexadecimal string) before 8.0 returned an
error. This PR addresses it. To avoid performance penalties if hints the
compiler that the fallbacks are not likely to happen.
Furthermore, we were ignoring std::result_out_of_range outputs from
fast_float. This PR addresses it as well and includes tests for both
identified scenarios.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>