#13495 introduced a change to reply -LOADING while flushing existing db on a replica. Some of our tests are
sensitive to this change and do no expect -LOADING reply.
Fixing a couple of tests that fail time to time.
This PR is based on the commits from PR
https://github.com/valkey-io/valkey/pull/258,
https://github.com/valkey-io/valkey/pull/593,
https://github.com/valkey-io/valkey/pull/639
This PR optimizes client query buffer handling in Redis by introducing
a reusable query buffer that is used by default for client reads. This
reduces memory usage by ~20KB per client by avoiding allocations for
most clients using short (<16KB) complete commands. For larger or
partial commands, the client still gets its own private buffer.
The primary changes are:
* Adding a reusable query buffer `thread_shared_qb` that clients use by
default.
* Modifying client querybuf initialization and reset logic.
* Freeing idle client query buffers when empty to allow reuse of the
reusable query buffer.
* Master client query buffers are kept private as their contents need to
be preserved for replication stream.
* When nested commands is executed, only the first user uses the reuse
buffer, and subsequent users will still use the private buffer.
In addition to the memory savings, this change shows a 3% improvement in
latency and throughput when running with 1000 active clients.
The memory reduction may also help reduce the need to evict clients when
reaching max memory limit, as the query buffer is the main memory
consumer per client.
This PR is different from https://github.com/valkey-io/valkey/pull/258
1. When a client is in the mid of requiring a reused buffer and
returning it, regardless of whether the query buffer has changed
(expanded), we do not update the reused query buffer in the middle, but
return the reused query buffer (expanded or with data remaining) or
reset it at the end.
2. Adding a new thread variable `thread_shared_qb_used` to avoid
multiple clients requiring the reusable query buffer at the same time.
---------
Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: Madelyn Olson <matolson@amazon.com>
Co-authored-by: Uri Yagelnik <uriy@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: oranagra <oran@redislabs.com>
After https://github.com/redis/redis/pull/13499, If the length set by
`addReplySetLen()` does not match the actual number of elements in the
reply, it will cause protocol broken and result in the client hanging.
RM_RdbLoad() disables AOF temporarily while loading RDB.
Later, it does not enable it back as it checks AOF state (disabled by then)
rather than AOF config parameter.
Added a change to restart AOF according to config parameter.
- Avoid addReplyLongLong (which converts back to string) the value we
already have as a robj, by using addReplyProto + addReply
- Avoid doing dbFind Twice for the same dictEntry on
INCR*/DECR*/SETRANGE/APPEND commands.
- Avoid multiple sdslen calls with the same input on setrangeCommand and
appendCommand
- Introduce setKeyWithDictEntry, which is like setKey(), but accepts an
optional dictEntry input: Avoids the second dictFind in SET command
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
In #13279 (found by @filipecosta90), for custom lookups, we introduce a
comparison function for `lpFind()` to compare entry, but it also
introduces some overhead.
To avoid the overhead of function pointer calls:
1. Extract the lpFindCb() method into a lpFindCbInternal() method that
is easier to inline.
2. Use unlikely to annotate the comparison method, as can only success
once.
---------
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
All the defrag allocations API expects to get a value and replace it, leaving the old value untouchable.
In some cases a value might be shared between multiple keys, in such cases we can not simply replace
it when the defrag callback is called.
To allow support such use cases, the PR adds two new API's to the defrag API:
1. `RM_DefragAllocRaw` - allocate memory base on a given size.
2. `RM_DefragFreeRaw` - Free the given pointer.
Those API's avoid using tcache so they operate just like `RM_DefragAlloc` but allows the user to split
the allocation and the memory free operations into two stages and control when those happen.
In addition the PR adds new API to allow the module to receive notifications when defrag start and end: `RM_RegisterDefragCallbacks`
Those callbacks are the same as `RM_RegisterDefragFunc` but promised to be called and the start
and the end of the defrag process.
This PR introduces a dedicated implementation for the SMEMBERS command
that avoids using the more generalized sinterGenericCommand function.
By tailoring the logic specifically for SMEMBERS, we reduce unnecessary
processing and memory overheads that were previously incurred by
handling more complex cases like set intersections.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
# Overall improvement
TBD ( current is approximately 6% on the achievable ops/sec), coming
from:
- In case of no module we can skip 1.3% CPU cycles on dict Iterator
creation/deletion
- Use addReplyBulkCBuffer instead of addReplyBulkCString to avoid
runtime strlen overhead within HELLO reply on string constants.
## Optimization 1: In case of no module we can skip 1.3% CPU cycles on
dict Iterator creation/deletion.
## Optimization 2: Use addReplyBulkCBuffer instead of
addReplyBulkCString to avoid runtime strlen overhead within HELLO reply
on string constants.
Currently aeApiPoll panic does not record error code information. Added
variable parameter formatting to _serverPanic to fix the issue
---------
Co-authored-by: yingyin.chen <15816602944@163.com>
On a full sync, replica starts discarding existing db. If the existing
db is huge and flush is happening synchronously, replica may become
unresponsive.
Adding a change to yield back to event loop while flushing db on
a replica. Replica will reply -LOADING in this case. Note that while
replica is loading the new rdb, it may get an error and start flushing
the partial db. This step may take a long time as well. Similarly,
replica will reply -LOADING in this case.
To call processEventsWhileBlocked() and reply -LOADING, we need to do:
- Set connSetReadHandler() null not to process further data from the master
- Set server.loading flag
- Call blockingOperationStarts()
rdbload() already does these steps and calls processEventsWhileBlocked()
while loading the rdb. Added a new call rdbLoadWithEmptyFunc() which
accepts callback to flush db before loading rdb or when an error
happens while loading.
For diskless replication, doing something similar and calling emptyData()
after setting required flags.
Additional changes:
- Allow `appendonly` config change during loading.
Config can be changed while loading data on startup or on replication
when slave is loading RDB. We allow config change command to update
`server.aof_enabled` and then lazily apply config change after loading
operation is completed.
- Added a test for `replica-lazy-flush` config
so far ./runtest --dump-logs used work for servers started within the
test proc.
now it'll also work on servers started outside the test proc scope.
the downside is that these logs can be huge if they served many tests
and not just the failing one.
but for some rare failures, we rather have that than nothing.
this feature isn't enabled y default, but is used by our GH actions.
Currently, module commands are not returned for the `ACL CAT <category>`
command, but skipped instead. Since now modules can add ACL categories
they should no longer be skipped.
Fixed the issue about GETRANGE and SUBSTR command
return unexpected result caused by the `start` and `end` out of
definition range of string.
---
## break change
Before this PR, when negative `end` was out of range (i.e., end <
-strlen), we would fix it to 0 to get the substring, which also resulted
in the first character still being returned for this kind of out of
range.
After this PR, we ensure that `GETRANGE` returns an empty bulk when the
negative end index is out of range.
Closes#11738
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
Move the TYPE filtering to the scan callback so that avoided the
`lookupKey` operation. This is the follow-up to #12209 . In this thread
we introduced two breaking changes:
1. we will not attempt to do lazy expire (delete) a key that was
filtered by not matching the TYPE (like we already do for MATCH
pattern).
2. when the specified key TYPE filter is an unknown type, server will
reply a error immediately instead of doing a full scan that comes back
empty handed.
This is a missing of the PR https://github.com/redis/redis/pull/13383.
We will call `functionsLibCtxClear()` in bio, so we shouldn't touch
`curr_functions_lib_ctx` in it.
## Describe
When using the `XTRIM` command to trim a stream, it does not update the
maximal tombstone (`max_deleted_entry_id`). This leads to an issue where
the lag calculation incorrectly assumes that there are no tombstones
after the consumer group's last_id, resulting in an inaccurate lag.
The reason XTRIM doesn't need to update the maximal tombstone is that it
always trims from the beginning of the stream. This means that it
consistently changes the position of the first entry, leading to the
following scenarios:
1) First entry trimmed after maximal tombstone:
If the first entry is trimmed to a position after the maximal tombstone,
all tombstones will be before the first entry, so they won't affect the
consumer group's lag.
2) First entry trimmed before maximal tombstone:
If the first entry is trimmed to a position before the maximal
tombstone, the maximal tombstone will not be updated.
## Solution
Therefore, this PR optimizes the lag calculation by ensuring that when
both the consumer group's last_id and the maximal tombstone are behind
the first entry, the consumer group's lag is always equal to the number
of remaining elements in the stream.
Supplement to PR https://github.com/redis/redis/pull/13338
Hash field expiration is optimized to avoid frequent update global HFE DS for
each field deletion. Eventually active-expiration will run and update or remove
the hash from global HFE DS gracefully. Nevertheless, statistic "subexpiry"
might reflect wrong number of hashes with HFE to the user if HDEL deletes
the last field with expiration in hash (yet there are more fields without expiration).
Following this change, if HDEL the last field with expiration in the hash then
take care to remove the hash from global HFE DS as well.
Fix memory leak related to variable slot_nodes in the
clusterManagerFixSlotsCoverage() function.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
This PR is based on the commits from PR
https://github.com/valkey-io/valkey/pull/52.
Ref: https://github.com/redis/redis/pull/12760
Close https://github.com/redis/redis/issues/13401
This PR will replace https://github.com/redis/redis/pull/13449
Fixes compatibilty of Redis cluster (7.2 - extensions enabled by
default) with older Redis cluster (< 7.0 - extensions not handled) .
With some of the extensions enabled by default in 7.2 version, new nodes
running 7.2 and above start sending out larger clusterbus message
payload including the ping extensions. This caused an incompatibility
with node running engine versions < 7.0. Old nodes (< 7.0) would receive
the payload from new nodes (> 7.2) would observe a payload length
(totlen) > (estlen) and would perform an early exit and won't process
the message.
This fix does the following things:
1. Always set `CLUSTERMSG_FLAG0_EXT_DATA`, because during the meet
phase, we do not know whether the connected node supports ext data, we
need to make sure that it knows and send back its ext data if it has.
2. If another node does not support ext data, we will not send it ext
data to avoid the handshake failure due to the incorrect payload length.
Note: A successful `PING`/`PONG` is required as a sender for a given
node to be marked as `CLUSTERMSG_FLAG0_EXT_DATA` and then extensions
message
will be sent to it. This could cause a slight delay in receiving the
extensions message(s).
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
First, we need to ensure that `curmaster` in
`clusterUpdateSlotsConfigWith()` is not NULL in the line
82f00f5179/src/cluster_legacy.c (L2320)
otherwise, it will crash in the
82f00f5179/src/cluster_legacy.c (L2395)
So when loading cluster node config, we need to ensure that the
following conditions are met:
1. A node must be at least one of the master or replica.
2. If a node is a replica, its master can't be NULL.
Make the clusterCommandShards function use only cluster API functions
instead of accessing cluster implementation details.
This way the cluster API implementation doesn't have to have intimate
knowledge of the command reply format, and doesn't need to interact with
the client directly (the addReply function family).
The PR has two commits, one moves the function from cluster_legacy.c to
cluster.c, and the other modifies it's implementation.
**better merge without squashing.**
This and the previous commit make the cluster
shards command a generic implementation instead of a
specific implementation for each cluster API implementation.
This commit (a) adds functions to the cluster API
and (b) modifies the cluster shards cmd implementation
to use cluster API functions instead of directly
accessing the legacy clustering implementation.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
This and the next following commit makes the cluster
shards command a generic implementation instead of a
specific implementation for each cluster API implementation.
This commit simply moves the cluster shards implementation
from cluster_legacy.c to cluster.c without changing the
implementation at all. The reason for doing so was to
help with reviewing the changes in the diff.
Signed-off-by: Josh Hershberg <yehoshua@redis.com>
## Replace bit shift with `__builtin_ctzll` in HyperLogLog
Builtin function `__builtin_ctzll` is more effective than bit shift even
though "in the average case there are high probabilities to find a 1
after a few iterations" mentioned in the source file comment.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
When the server restarts while the CLI is connecting, the reconnection
does not automatically select the previous db.
This may lead users to believe they are still in the previous db, in
fact, they are in db0.
This PR will automatically reset the current dbnum and `cliSelect()`
again when reconnecting.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
Close https://github.com/redis/redis/issues/13414
When the cluster's master node fails and is switched to another node,
the first node in the shard node list (the old master) is no longer
valid.
Add a new method clusterGetMasterFromShard() to obtain the current
master.
1. Fix fuzzer test failure when the key was deleted due to expiration
before sending random traffic for the key.
After HFE, when all fields in a hash are expired, the hash might be
deleted due to expiration.
If the key was expired in the mid of `RESTORE` command and sending rand
trafic, `fuzzer` test will fail in the following code because the 'TYPE
key' will return `none` and then throw an exception because it cannot be
found in `$commands`
94b9072e44/tests/support/util.tcl (L712-L713)
This PR adds a `None` check for the reply of `KEY TYPE` command and adds
a print of `err` to avoid false positives in the future.
failed CI:
https://github.com/redis/redis/actions/runs/10127334121/job/28004985388
2. Fix the issue where key was deleted due to expiration before the
`scan.scan_key` command could execute, caused by premature enabling of
`set-active-expire`.
failed CI:
https://github.com/redis/redis/actions/runs/10153722715/job/28077610552
---------
Co-authored-by: oranagra <oran@redislabs.com>
Fix#13337
Ths PR fixes fixed two bugs that caused lag calculation errors.
1. When the latest tombstone is before the first entry, the tombstone
may stil be after the last id of consume group.
2. When a tombstone is after the last id of consume group, the group's
counter will be invalid, we should caculate the entries_read by using
estimates.
* some tests didn't wait for replication offset sync
* tests that used deferring client, didn't wait for it to get blocked.
an in some cases, the replication offset sync ended before the deferring
client finished, so the digest match failed.
* some tests used deferring clients excessively
* the tests didn't read the client response
* the tests didn't close the client (fd leak)
Modify RDB_TYPE_HASH_METADATA layout to store expiration times relative
to the minimum expiration time, which is written at the start as absolute time.
[exception]: Executing test client: ERR FAILOVER target replica is not
online.. ERR FAILOVER target replica is not online.
while executing
"$node_0 failover to $node_1_host $node_1_port"
("uplevel" body line 16)
invoked from within
"uplevel 1 $code"
(procedure "test" line 58)
invoked from within
"test {failover command to specific replica works} {
[err]: client evicted due to percentage of maxmemory in
tests/unit/client-eviction.tcl
Expected 33622 >= 220200 && 33622 < 440401 (context: type eval line 17
cmd {assert {$tot_mem >= $n && $tot_mem < $maxmemory_clients_actual}}
proc ::test)
Recently in #13361, i attempted to fix a race between FLUSHALL and
BGSAVE, where despite calling killRDBChild, the
backgroundSaveDoneHandler will terminate with success.
Turns out that even if the child didn't yet exit, there's a chance it'll
still miss our signal and exit with success.
in that case, we will still mess up the dirty counter (deducting
dirty_before_bgsave) which is reset by FLUSHALL, and override the
synchronous rdb file we saved.
instead, we'll set a flag to treat the next done handler as a failed
one.
When the tests are run against an external server in this order:
`--single unit/introspection --single unit/moduleapi/blockonbackground
--single integration/redis-cli`
the test would hang when the "ASK redirect test" test attempts to create
a listening socket (it fails, and then redis-cli itself hangs waiting
for a non-responsive socket created by the introspection test).
the reasons are:
1. the blockedbackground test includes util.tcl and resets the
`::last_port_attempted` variable
2. the test in introspection didn't close the listening server, so it's
still alive.
3. find_available_port doesn't properly detect the busy port, and it
thinks that the port is free even though it's busy.
fixing all 3 of these problems, even though fixing just one would be
enough to let the test pass.
- when uploading server logs, make sure they don't overwrite each other.
- sort the test units to get consistent order between them (following
#13220)
- backup and restore the entire server configuration, to protect one
unit from config changes another unit performs
Nowdays we do not trigger LUA GC after loading lua script. This means
that when a large number of scripts are loaded, such as when functions
are propagating from the master to the replica, if the LUA scripts are
never touched on the replica, the garbage might remain there
indefinitely.
Before this PR, we would share a gc_count between scripts and functions.
This means that, under certain circumstances, the GC trigger for scripts
and functions was not fair.
For example, loading a large number of scripts followed by a small
number of functions could result in the functions triggering GC.
In this PR, we assign a unique `gc_count` to each of them, so the GC
triggers between them will no longer affect each other.
on the other hand, this PR will to bring regession for script loading
commands(`FUNCTION LOAD` and `SCRIPT LOAD`), but they are not hot path,
we can ignore it, and it will be replaced
https://github.com/redis/redis/pull/13375 in the future.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
When we terminate the diskless RDB saving child process and, at the same
time, we start a new BGSAVE for new replicas, we should not delete the
RDB read event. Otherwise, these replicas will never receive a response.
this is a result of the recent change in
https://github.com/redis/redis/pull/13361
---------
Co-authored-by: oranagra <oran@redislabs.com>