* Introduce a new API's: RM_GetContextFlagsAll, and
RM_GetKeyspaceNotificationFlagsAll that will return the
full flags mask of each feature. The module writer can
check base on this value if the Flags he needs are
supported or not.
* For each flag, introduce a new value on redismodule.h,
this value represents the LAST value and should be there
as a reminder to update it when a new value is added,
also it will be used in the code to calculate the full
flags mask (assuming flags are incrementally increasing).
In addition, stated that the module writer should not use
the LAST flag directly and he should use the GetFlagAll API's.
* Introduce a new API: RM_IsSubEventSupported, that returns for a given
event and subevent, whether or not the subevent supported.
* Introduce a new macro RMAPI_FUNC_SUPPORTED(func) that returns whether
or not a function API is supported by comparing it to NULL.
* Introduce a new API: int RM_GetServerVersion();, that will return the
current Redis version in the format 0x00MMmmpp; e.g. 0x00060008;
* Changed unstable version from 999.999.999 to 255.255.255
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Yossi Gottlieb <yossigo@gmail.com>
(cherry picked from commit adc3183cd2)
This API function makes it possible to retrieve the X.509 certificate
used by clients to authenticate TLS connections.
(cherry picked from commit 0aec98dce2)
The main motivation here is to provide a way for modules to create a
single, global context that can be used for logging.
Currently, it is possible to obtain a thread-safe context that is not
attached to any blocked client by using `RM_GetThreadSafeContext`.
However, the attached context is not linked to the module identity so
log messages produced are not tagged with the module name.
Ideally we'd fix this in `RM_GetThreadSafeContext` itself but as it
doesn't accept the current context as an argument there's no way to do
that in a backwards compatible manner.
(cherry picked from commit 907da0580b)
This is essentially the same as calling COMMAND GETKEYS but provides a
more efficient interface that can be used in every context (i.e. not a
Redis command).
(cherry picked from commit 7d117d7591)
I suppose that it was overlooked, since till recently none of the blocked commands were readonly.
other changes:
- add test for the above.
- add better support for additional (and deferring) clients for
cluster tests
- improve a test which left the client in MULTI state.
(cherry picked from commit 216c110609)
track and report memory used by clients argv.
this is very usaful in case clients started sending a command and didn't
complete it. in which case the first args of the command are already
trimmed from the query buffer.
in an effort to avoid cache misses and overheads while keeping track of
these, i avoid calling sdsZmallocSize and instead use the sdslen /
bulk-len which can at least give some insight into the problem.
This memory is now added to the total clients memory usage, as well as
the client list.
(cherry picked from commit bea40e6a41)
PROBLEM:
[$rd1 read] reads invalidation messages one by one, so it's never going to see the second invalidation message produced after INCR b, whether or not it exists. Adding another read will block incase no invalidation message is produced.
FIX:
We switch the order of "INCR a" and "INCR b" - now "INCR b" comes first. We still only read the first invalidation message produces. If an invalidation message is wrongly produces for b - then it will be produced before that of a, since "INCR b" comes before "INCR a".
Co-authored-by: Nitai Caro <caronita@amazon.com>
(cherry picked from commit 8fb89a5728)
Before this commit, we would have continued to add replies to the reply buffer even if client
output buffer limit is reached, so the used memory would keep increasing over the configured limit.
What's more, we shouldn’t write any reply to the client if it is set 'CLIENT_CLOSE_ASAP' flag
because that doesn't conform to its definition and we will close all clients flagged with
'CLIENT_CLOSE_ASAP' in ‘beforeSleep’.
Because of code execution order, before this, we may firstly write to part of the replies to
the socket before disconnecting it, but in fact, we may can’t send the full replies to clients
since OS socket buffer is limited. But this unexpected behavior makes some commands work well,
for instance ACL DELUSER, if the client deletes the current user, we need to send reply to client
and close the connection, but before, we close the client firstly and write the reply to reply
buffer. secondly, we shouldn't do this despite the fact it works well in most cases.
We add a flag 'CLIENT_CLOSE_AFTER_COMMAND' to mark clients, this flag means we will close the
client after executing commands and send all entire replies, so that we can write replies to
reply buffer during executing commands, send replies to clients, and close them later.
We also fix some implicit problems. If client output buffer limit is enforced in 'multi/exec',
all commands will be executed completely in redis and clients will not read any reply instead of
partial replies. Even more, if the client executes 'ACL deluser' the using user in 'multi/exec',
it will not read the replies after 'ACL deluser' just like before executing 'client kill' itself
in 'multi/exec'.
We added some tests for output buffer limit breach during multi-exec and using a pipeline of
many small commands rather than one with big response.
Co-authored-by: Oran Agra <oran@redislabs.com>
(cherry picked from commit 57709c4bc6)
We're already using bg_unlink in several places to delete the rdb file in the background,
and avoid paying the cost of the deletion from our main thread.
This commit uses bg_unlink to remove the temporary rdb file in the background too.
However, in case we delete that rdb file just before exiting, we don't actually wait for the
background thread or the main thread to delete it, and just let the OS clean up after us.
i.e. we open the file, unlink it and exit with the fd still open.
Furthermore, rdbRemoveTempFile can be called from a thread and was using snprintf which is
not async-signal-safe, we now use ll2string instead.
(cherry picked from commit b002d2b4f1)
Before this commit, following command did not show --tls option:
./runtest-cluster --help
./runtest-sentinel --help
(cherry picked from commit 98c8ac0d70)
This test was nearly always failing on MacOS github actions.
This is because of bugs in the test that caused it to nearly always run
all 3 attempts and just look at the last one as the pass/fail creteria.
i.e. the test was nearly always running all 3 attempts and still sometimes
succeed. this is because the break condition was different than the test
completion condition.
The reason the test succeeded is because the break condition tested the
results of all 3 tests (PSETEX/PEXPIRE/PEXPIREAT), but the success check
at the end was only testing the result of PSETEX.
The reason the PEXPIREAT test nearly always failed is because it was
getting the current time wrong: getting the current second and loosing
the sub-section time, so the only chance for it to succeed is if it run
right when a certain second started.
Because i now get the time from redis, adding another round trip, i
added another 100ms to the PEXPIRE test to make it less fragile, and
also added many more attempts.
Adding many more attempts before failure to account for slow platforms,
github actions and valgrind
(cherry picked from commit ed9bfe2262)
The key save delay is too short and on certain systems the child process
is gone before we have a chance to inspect it.
(cherry picked from commit b2a73c404b)
This is a catch-all test to confirm that that rewrite produces a valid
output for all parameters and that this process does not introduce
undesired configuration changes.
(cherry picked from commit a8b7268911)
There was a bug. Although cluster replicas would allow read commands,
they would not allow a MULTI-EXEC that's composed solely of read commands.
Adds tests for coverage.
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Eran Liberty <eranl@amazon.com>
(cherry picked from commit b120366d48)
if there are nested tests and nested servers, we need to restore the
previous value of cur_test when a test exist.
example:
```
test{test 1} {
start_server {
test{test 1.1 - master only} {
}
start_server {
test{test 1.2 - with replication} {
}
}
}
}
```
when `test 1.1 - master only exists`, we're still inside `test 1`
(cherry picked from commit 0a1e734193)
1) cur_test: when restart_server, "no such variable" error occurs
./runtest --single integration/rdb
test {client freed during loading}
SET ::cur_test
restart_server
kill_server
test "Check for memory leaks (pid $pid)"
SET ::cur_test
UNSET ::cur_test
UNSET ::cur_test // This global variable has been unset.
2) `ps --ppid` not available on macOS platform, can be replaced with
`pgrep -P pid`.
(cherry picked from commit f22fa9594d)
This test was failing from time to time see discussion at the bottom of #7635
This was probably due to timing, the DEBUG SLEEP executed by redis-cli
didn't sleep for enough time.
This commit changes:
1) use SET-ACTIVE-EXPIRE instead of DEBUG SLEEP
2) reduce many `after` sleeps with retry loops to speed up the test.
3) add many comment explaining the different steps of the test and
it's purpose.
4) config appendonly before populating the volatile keys, so that they'll
be part of the AOF command stream rather than the preamble RDB portion.
other complications: recently kill_instance switched from SIGKILL to
SIGTERM, and this would sometimes fail since there was an AOFRW running
in the background. now we wait for it to end before attempting the kill.
(cherry picked from commit b491d477c3)
There is an inherent race condition in port allocation for spawned
servers. If a server fails to start because a port is taken, a new port
is allocated. This fixes a problem where the logs are not truncated and
as a result a large number of unmonitored servers are started.
(cherry picked from commit 2df4cb93ac)
2b998de46 added a file for stderr to keep valgrind log but i forgot to
add a similar thing when valgrind isn't being used.
the result is that `glob */err.txt` fails.
(cherry picked from commit 42ba7a1b75)
- redirect valgrind reports to a dedicated file rather than console
- try to avoid killing instances with SIGKILL so that we get the memory
leak report (killing with SIGTERM before resorting to SIGKILL)
- search for valgrind reports when done, print them and fail the tests
- add --dont-clean option to keep the logs on exit
- fix exit error code when crash is found (would have exited with 0)
changes that affect the normal redis test suite:
- refactor check_valgrind_errors into two functions one to search and
one to report
- move the search half into util.tcl to serve the cluster tests too
- ignore "address range perms" valgrind warnings which seem non relevant.
(cherry picked from commit 2b998de460)
in some cases a command that returns an error possibly due to a timing
issue causes the tcl code to crash and thus prevents the rest of the
tests from running. this adds an option to make the test proceed despite
the crash.
maybe it should be the default mode some day.
(cherry picked from commit fe5da2e60d)
- skip full units
- skip a single test (not just a list of tests)
- when skipping tag, skip spinning up servers, not just the tests
- skip tags when running against an external server too
- allow using multiple tags (split them)
(cherry picked from commit 677d14c213)
When runtest-cluster, at first, we need to create a cluster use spawn_instance,
a port which is not used is choosen, however sometimes we can't run server on
the port. possibley due to a race with another process taking it first.
such as redis/redis/runs/896537490. It may be due to the machine problem or
In order to reduce the probability of failure when start redis in
runtest-cluster, we attemp to use another port when find server do not start up.
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: yanhui13 <yanhui13@meituan.com>
(cherry picked from commit e2d64485b8)
Add Linux kernel OOM killer control option.
This adds the ability to control the Linux OOM killer oom_score_adj
parameter for all Redis processes, depending on the process role (i.e.
master, replica, background child).
A oom-score-adj global boolean flag control this feature. In addition,
specific values can be configured using oom-score-adj-values if
additional tuning is required.
(cherry picked from commit 2530dc0ebd)
Added RedisModule_HoldString that either returns a
shallow copy of the given String (by increasing
the String ref count) or a new deep copy of String
in case its not possible to get a shallow copy.
Co-authored-by: Itamar Haber <itamar@redislabs.com>
(cherry picked from commit 3f494cc49d)
If the server gets MULTI command followed by only read
commands, and right before it gets the EXEC it reaches OOM,
the client will get OOM response.
So, from now on, it will get OOM response only if there was
at least one command that was tagged with `use-memory` flag
(cherry picked from commit b7289e912c)
When calling to LPOS command when RANK is higher than matches,
the return value is non valid response. For example:
```
LPUSH l a
:1
LPOS l b RANK 5 COUNT 10
*-4
```
It may break client-side parser.
Now, we count how many replies were replied in the array.
```
LPUSH l a
:1
LPOS l b RANK 5 COUNT 10
*0
```
(cherry picked from commit 9204a9b2c2)
After fork, the child process(redis-aof-rewrite) will get the fd opened
by the parent process(redis), when redis killed by kill -9, it will not
graceful exit(call prepareForShutdown()), so redis-aof-rewrite thread may still
alive, the fd(lock) will still be held by redis-aof-rewrite thread, and
redis restart will fail to get lock, means fail to start.
This issue was causing failures in the cluster tests in github actions.
Co-authored-by: Oran Agra <oran@redislabs.com>
(cherry picked from commit cbaf3c5bba)
The `REDISMODULE_CLIENTINFO_FLAG_SSL` flag was already a part of the `RedisModuleClientInfo` structure but was not implemented.
(cherry picked from commit 64c360c515)
besides, hooks test was time sensitive. when the replica managed to
reconnect quickly after the client kill, the test would fail
(cherry picked from commit f7e7775990)
- the test now waits for specific set of log messages rather than wait for
timeout looking for just one message.
- we don't wanna sample the current length of the log after an action, due
to a race, we need to start the search from the line number of the last
message we where waiting for.
- when attempting to trigger a full sync, use multi-exec to avoid a race
where the replica manages to re-connect before we completed the set of
actions that should force a full sync.
- fix verify_log_message which was broken and unused
(cherry picked from commit 109b5ccdcd)
Adds an `optional` value to the previously boolean `tls-auth-clients` configuration keyword.
Co-authored-by: Yossi Gottlieb <yossigo@gmail.com>
(cherry picked from commit f31260b044)
on ci.redis.io the test fails a lot, reporting that bgsave didn't end.
increaseing the timeout we wait for that bgsave to get aborted.
in addition to that, i also verify that it indeed got aborted by
checking that the save counter wasn't reset.
add another test to verify that a successful bgsave indeed resets the
change counter.
(cherry picked from commit 8a57969fd7)
in cases where you have
test name {
start_server {
start_server {
assert
}
}
}
the exception will be thrown to the test proc, and the servers are
supposed to be killed on the way out. but it seems there was always a
bug of not cleaning the server stack, and recently (#7404) we started
relying on that stack in order to kill them, so with that bug sometimes
we would have tried to kill the same server twice, and leave one alive.
luckly, in most cases the pattern is:
start_server {
test name {
}
}
(cherry picked from commit 36b9494385)
This re-implements the redis-cli --pipe test so it no longer depends on a close feature available only in TCL 8.6.
Basically what this test does is run redis-cli --pipe, generates a bunch of commands and pipes them through redis-cli, and inspects the result in both Redis and the redis-cli output.
To do that, we need to close stdin for redis-cli to indicate we're done so it can flush its buffers and exit. TCL has bi-directional channels can only offers a way to "one-way close" a channel with TCL 8.6. To work around that, we now generate the commands into a file and feed that file to redis-cli directly.
As we're writing to an actual file, the number of commands is now reduced.
(cherry picked from commit f57e844b2e)
Before this commit, the output of "./runtest-cluster --help" is incorrect.
After this commit, the format of the following 3 output is consistent:
./runtest --help
./runtest-cluster --help
./runtest-sentinel --help
(cherry picked from commit 8128d39737)
interestingly the latency monitor test fails because valgrind is slow
enough so that the time inside PEXPIREAT command from the moment of
the first mstime() call to get the basetime until checkAlreadyExpired
calls mstime() again is more than 1ms, and that test was too sensitive.
using this opportunity to speed up the test (unrelated to the failure)
the fix is just the longer time passed to PEXPIRE.
(cherry picked from commit e5227aab89)
in the majority of the cases (on this rarely used feature) we want to
stop and be able to connect to the shard with redis-cli.
since these are two different processes interracting with the tty we
need to stop both, and we'll have to hit enter twice, but it's not that
bad considering it is rarely used.
(cherry picked from commit 02ef355f98)
* Tests: fix and reintroduce redis-cli tests.
These tests have been broken and disabled for 10 years now!
* TLS: add remaining redis-cli support.
This adds support for the redis-cli --pipe, --rdb and --replica options
previously unsupported in --tls mode.
* Fix writeConn().
(cherry picked from commit d9f970d8d3)
Similarly to EXPIREAT with TTL in the past, which implicitly deletes the
key and return success, RESTORE should not store key that are already
expired into the db.
When used together with REPLACE it should emit a DEL to keyspace
notification and replication stream.
(cherry picked from commit 5977a94842)
On some platforms strtold("+inf") with valgrind returns a non-inf result
[err]: INCRBYFLOAT does not allow NaN or Infinity in tests/unit/type/incr.tcl
Expected 'ERR*would produce*' to equal or match '1189731495357231765085759.....'
(cherry picked from commit 909bc97c52)
tests were sensitive to additional log lines appearing in the log
causing the search to come empty handed.
instead of just looking for the n last log lines, capture the log lines
before performing the action, and then search from that offset.
(cherry picked from commit 8e76e13472)
* tests/valgrind: don't use debug restart
DEBUG REATART causes two issues:
1. it uses execve which replaces the original process and valgrind doesn't
have a chance to check for errors, so leaks go unreported.
2. valgrind report invalid calls to close() which we're unable to resolve.
So now the tests use restart_server mechanism in the tests, that terminates
the old server and starts a new one, new PID, but same stdout, stderr.
since the stderr can contain two or more valgrind report, it is not enough
to just check for the absence of leaks, we also need to check for some known
errors, we do both, and fail if we either find an error, or can't find a
report saying there are no leaks.
other changes:
- when killing a server that was already terminated we check for leaks too.
- adding DEBUG LEAK which was used to test it.
- adding --trace-children to valgrind, although no longer needed.
- since the stdout contains two or more runs, we need slightly different way
of checking if the new process is up (explicitly looking for the new PID)
- move the code that handles --wait-server to happen earlier (before
watching the startup message in the log), and serve the restarted server too.
* squashme - CR fixes
(cherry picked from commit 69ade87325)
In order to support the use of multi-exec in pipeline, it is important that
MULTI and EXEC are never rejected and it is easy for the client to know if the
connection is still in multi state.
It was easy to make sure MULTI and DISCARD never fail (done by previous
commits) since these only change the client state and don't do any actual
change in the server, but EXEC is a different story.
Since in the past, it was possible for clients to handle some EXEC errors and
retry the EXEC, we now can't affort to return any error on EXEC other than
EXECABORT, which now carries with it the real reason for the abort too.
Other fixes in this commit:
- Some checks that where performed at the time of queuing need to be re-
validated when EXEC runs, for instance if the transaction contains writes
commands, it needs to be aborted. there was one check that was already done
in execCommand (-READONLY), but other checks where missing: -OOM, -MISCONF,
-NOREPLICAS, -MASTERDOWN
- When a command is rejected by processCommand it was rejected with addReply,
which was not recognized as an error in case the bad command came from the
master. this will enable to count or MONITOR these errors in the future.
- make it easier for tests to create additional (non deferred) clients.
- add tests for the fixes of this commit.
(cherry picked from commit 65a3307bc9)
The scan key module API provides the scan callback with the current
field name and value (if it exists). Those arguments are RedisModuleString*
which means it supposes to point to robj which is encoded as a string.
Using createStringObjectFromLongLong function might return robj that
points to an integer and so break a module that tries for example to
use RedisModule_StringPtrLen on the given field/value.
The PR introduces a fix that uses the createObject function and sdsfromlonglong function.
Using those function promise that the field and value pass to the to the
scan callback will be Strings.
The PR also changes the Scan test module to use RedisModule_StringPtrLen
to catch the issue. without this, the issue is hidden because
RedisModule_ReplyWithString knows to handle integer encoding of the
given robj (RedisModuleString).
The PR also introduces a new test to verify the issue is solved.
(cherry picked from commit a89bf734a9)
i.e. don't start the search from scratch hitting the used ones again.
this will also reduce the likelihood of collisions (if there are any
left) by increasing the time until we re-use a port we did use in the
past.
apparently when running tests in parallel (the default of --clients 16),
there's a chance for two tests to use the same port.
specifically, one test might shutdown a master and still have the
replica up, and then another test will re-use the port number of master
for another master, and then that replica will connect to the master of
the other test.
this can cause a master to count too many full syncs and fail a test if
we run the tests with --single integration/psync2 --loop --stop
see Probmem 2 in #7314
these tests create several edge cases that are otherwise uncovered (at
least not consistently) by the test suite, so although they're no longer
testing what they were meant to test, it's still a good idea to keep
them in hope that they'll expose some issue in the future.
There's a rare case which leads to stagnation in the defragger, causing
it to keep scanning the keyspace and do nothing (not moving any
allocation), this happens when all the allocator slabs of a certain bin
have the same % utilization, but the slab from which new allocations are
made have a lower utilization.
this commit fixes it by removing the current slab from the overall
average utilization of the bin, and also eliminate any precision loss in
the utilization calculation and move the decision about the defrag to
reside inside jemalloc.
and also add a test that consistently reproduce this issue.
with the original version of 6.0.0, this test detects an excessive full
sync.
with the fix in 1a7cd2c0e, this test detects memory corruption,
especially when using libc allocator with or without valgrind.
Seems like on some systems choosing specific TLS v1/v1.1 versions no
longer works as expected. Test is reduced for v1.2 now which is still
good enough to test the mechansim, and matters most anyway.
This bug was introduced by a recent change in which readQueryFromClient
is using freeClientAsync, and despite the fact that now
freeClientsInAsyncFreeQueue is in beforeSleep, that's not enough since
it's not called during loading in processEventsWhileBlocked.
furthermore, afterSleep was called in that case but beforeSleep wasn't.
This bug also caused slowness sine the level-triggered mode of epoll
kept signaling these connections as readable causing us to keep doing
connRead again and again for ll of these, which keep accumulating.
now both before and after sleep are called, but not all of their actions
are performed during loading, some are only reserved for the main loop.
fixes issue #7215
this test which has coverage for varoius flows of diskless master was
failing randomly from time to time.
the failure was:
[err]: diskless all replicas drop during rdb pipe in tests/integration/replication.tcl
log message of '*Diskless rdb transfer, last replica dropped, killing fork child*' not found
what seemed to have happened is that the master didn't detect that all
replicas dropped by the time the replication ended, it thought that one
replica is still connected.
now the test takes a few seconds longer but it seems stable.
Currently, there are several types of threads/child processes of a
redis server. Sometimes we need deeply optimise the performance of
redis, so we would like to isolate threads/processes.
There were some discussion about cpu affinity cases in the issue:
https://github.com/antirez/redis/issues/2863
So implement cpu affinity setting by redis.conf in this patch, then
we can config server_cpulist/bio_cpulist/aof_rewrite_cpulist/
bgsave_cpulist by cpu list.
Examples of cpulist in redis.conf:
server_cpulist 0-7:2 means cpu affinity 0,2,4,6
bio_cpulist 1,3 means cpu affinity 1,3
aof_rewrite_cpulist 8-11 means cpu affinity 8,9,10,11
bgsave_cpulist 1,10-11 means cpu affinity 1,10,11
Test on linux/freebsd, both work fine.
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
* fix memlry leaks with diskless replica short read.
* fix a few timing issues with valgrind runs
* fix issue with valgrind and watchdog schedule signal
about the valgrind WD issue:
the stack trace test in logging.tcl, has issues with valgrind:
==28808== Can't extend stack to 0x1ffeffdb38 during signal delivery for thread 1:
==28808== too small or bad protection modes
it seems to be some valgrind bug with SA_ONSTACK.
SA_ONSTACK seems unneeded since WD is not recursive (SA_NODEFER was removed),
also, not sure if it's even valid without a call to sigaltstack()