redis

Commit Graph

Author	SHA1	Message	Date
Harkrishn Patro	9ca8490315	Increase timeout for expiry cluster tests (#12752 ) Test recently added fails on timeout in valgrind in GH actions. Locally with valgrind the test finishes within 1.5 sec(s). Couldn't find any issue due to lack of reproducibility. Increasing the timeout and adding an additional log to the test to understand how many keys were left at the end.	2023-11-11 12:01:04 +02:00
Meir Shpilraien (Spielrein)	0ffb9d2ea9	Before evicted and before expired server events are not executed inside an execution unit. (#12733 ) Redis 7.2 (#9406) introduced a new modules event, `RedisModuleEvent_Key`. This new event allows the module to read the key data just before it is removed from the database (either deleted, expired, evicted, or overwritten). When the key is removed from the database, either by active expire or eviction. The new event was not called as part of an execution unit. This can cause an issue if the module registers a post notification job inside the event. This job will not be executed atomically with the expiration/eviction operation and will not replicated inside a Multi/Exec. Moreover, the post notification job will be executed right after the event where it is still not safe to perform any write operation, this will violate the promise that post notification job will be called atomically with the operation that triggered it and only when it is safe to write. This PR fixes the issue by wrapping each expiration/eviction of a key with an execution unit. This makes sure the entire operation will run atomically and all the post notification jobs will be executed at the end where it is safe to write. Tests were modified to verify the fix.	2023-11-08 09:28:22 +02:00
Yossi Gottlieb	6223355cf3	Use cross-platform-actions for FreeBSD support. (#12732 ) This change overcomes many stability issues experienced with the vmactions action. We need to limit VMs to 8GB for better stability, as the 13GB default seems to hang them occasionally. Shell code has been simplified since this action seem to use `bash -e` which will abort on non-zero exit codes anyway.	2023-11-06 18:07:14 +02:00
Roshan Khatri	15a048d4f0	re-enable defrag tests in cluster mode. (#12710 ) Reverts the skipping defrag tests in cluster mode (done in #12672. instead it skips only some defrag tests that are relevant for cluster modes. The test now run well after investigating and making the changes in #12674 and #12694. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-11-02 13:55:48 +02:00
Viktor Söderqvist	8878817d89	Optimize SCAN with MATCH when pattern implies cluster slot (#12536 ) Optimize the performance of SCAN commands when a match pattern can only contain keys from a single slot in cluster mode. This can happen when the pattern contains a hash tag before any wildcard matchers or when the key contains no matchers.	2023-11-01 00:06:49 -07:00
Harkrishn Patro	3fac869f02	Fix test, disable expiration until empty buckets are formed (#12689 ) Test failure on freebsd CI: ``` *** [err]: expire scan should skip dictionaries with lot's of empty buckets in tests/unit/expire.tcl scan didn't handle slot skipping logic. ``` Observation: expiry of keys might happen before the empty buckets are formed and won't help with the expiry skip logic validation. Solution: Disable expiration until the empty buckets are formed.	2023-10-24 11:29:40 +03:00
Harkrishn Patro	26eb4ce397	Fix defrag test (#12674 ) Fixing issues started after #11695 when the defrag tests are being executed in cluster mode too. For some reason, it looks like the defragmentation is over too quickly, before the test is able to detect that it's running. so now instead of waiting to see that it's active, we wait to see that it did some work ``` [err]: Active defrag big list: cluster in tests/unit/memefficiency.tcl defrag not started. [err]: Active defrag big keys: cluster in tests/unit/memefficiency.tcl defrag didn't stop. ```	2023-10-22 11:56:45 +03:00
Harkrishn Patro	becd50d0da	Disable flaky defrag tests affecting daily run (#12672 ) Temporarily disabling few of the defrag tests in cluster mode to make the daily run stable: Active defrag eval scripts Active defrag big keys Active defrag big list Active defrag edge case	2023-10-19 21:12:58 +03:00
Harkrishn Patro	f3bf8485d8	Fix resize hash table dictionary iterator (#12660 ) Dictionary iterator logic in the `tryResizeHashTables` method is picking the next (incorrect) dictionary while the cursor is at a given slot. This could lead to some dictionary/slot getting skipped from resizing. Also stabilize the test. problem introduced recently in #11695	2023-10-19 13:58:32 +03:00
Vitaly	0270abda82	Replace cluster metadata with slot specific dictionaries (#11695 ) This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data. ## Important changes * Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms. * getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time. * Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree. * scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue. * Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot. * Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading. * DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well. ## Performance This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict. RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load. ## Interface changes * Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS` * Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored. * New RDB version to support the new op code for SLOT information. --------- Co-authored-by: Vitaly Arbuzov <arvit@amazon.com> Co-authored-by: Harkrishn Patro <harkrisp@amazon.com> Co-authored-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com>	2023-10-14 23:58:26 -07:00
Oran Agra	f0c1c730d4	test suite: clean server pids after server crashed (#12639 ) when a server in the test suite crashes and is restarted by redstart_server, we didn't clean it's pid from the list. we can see that when the corrupt-dump-fuzzer hangs, it has a long list of servers to lean, but in fact they're all already dead.	2023-10-13 16:28:52 +03:00
Harkrishn Patro	b784c5375e	Unsubscribe all clients from replica for shard channel if the master ownership changes (#12577 ) Unsubscribe all clients from replica for shard channel if the master ownership changes	2023-10-12 20:48:27 -07:00
zhaozhao.zz	77a65e82b2	support XREAD[GROUP] with BLOCK option in scripts (#12596 ) In #11568 we removed the NOSCRIPT flag from commands and keep the BLOCKING flag. Aiming to allow them in scripts and let them implicitly behave in the non-blocking way. In that sense, the old behavior was to allow LPOP and reject BLPOP, and the new behavior, is to allow BLPOP too, and fail it only in case it ends up blocking. So likewise, so far we allowed XREAD and rejected XREAD BLOCK, and we will now allow that too, and only reject it if it ends up blocking.	2023-10-12 10:54:50 +03:00
Oran Agra	b810384c62	dump server longs on hang corrupt dump fuzzer test recently there are some incidents of hanged tests in the CI when we try to reproduce them, we get an assertion, not a hang. maybe the server logs will reveal some info.	2023-10-08 16:19:31 +03:00
YaacovHazan	2cf50ddbad	Fix 'load corrupted rdb with no CRC' test (#12629 ) After the change in #12626 (`2e0f6724e`), the is_alive proc gets pid and not server config. This PR aligns it in 'load corrupted rdb with no CRC' test.	2023-10-03 11:09:25 +03:00
meiravgri	4ba9e18ef0	fix crash in crash-report and other improvements (#12623 ) ## Crash fix ### Current behavior We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example. The threads mngr API guarantees that the output array length will be `tids_len`, however, some indices can be NULL, in case it fails to collect some of the threads' outputs. When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage, so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() ) ### fix Allocate the global output array with zcalloc. ### To reproduce the bug, you'll have to change the code: in threadsmngr:ThreadsManager_runOnThreads(): make sure the g_output_array allocation is initialized with garbage and not 0s (add `memset(g_output_array, 2, sizeof(void) tids_len);` below the allocation). Force one of the threads to write to the array: add a global var: `static redisAtomic size_t return_now = 0;` add to `invoke_callback()` before writing to the output array: ``` size_t i_return; atomicGetIncr(return_now, i_return, 1); if(i_return == 1) return; ``` compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert` The assertion triggers the the stacktrace collection. Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from the threads mngr, it can be safely triggered again. ## Added global variables r/w lock in ThreadsManager To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still invoking the signal handler, we use a r/w lock. For cleanups, we will acquire the write lock. The threads will acquire the read lock to enable them to write simultaneously. If we fail to acquire the read lock, it means cleanups are in progress and we return immediately. After acquiring the lock we can safely check that the global output array wasn't nullified and proceed to write to it. This way we ensure the threads are not modifying the global variables/ trying to write to the output array after they were zeroed/nullified/destroyed(the semaphore). ## other minor logging change 1. removed logging if the semaphore times out because the threads can still write to the output array after this check. Instead, we print the total number of printed stacktraces compared to the exacted number (len_tids). 2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct. 3. improve testing Co-authored-by: Oran Agra <oran@redislabs.com>	2023-10-02 20:02:02 +03:00
YaacovHazan	2e0f6724e0	Stabilization and improvements around aof tests (#12626 ) In some tests, the code manually searches for a log message, and it uses tail -1 with a delay of 1 second, which can miss the expected line. Also, because the aof tests use start_server_aof and not start_server, the test name doesn't log into the server log. To fix the above, I made the following changes: - Change the start_server_aof to wrap the start_server. This will add the created aof server to the servers list, and make srv() and wait_for_log_messages() available for the tests. - Introduce a new option for start_server. 'wait_ready' - an option to let the caller start the test code without waiting for the server to be ready. useful for tests on a server that is expected to exit on startup. - Create a new start_server_aof_ex. The new proc also accept options as argument and make use of the new 'short_life' option for tests that are expected to exit on startup because of some error in the aof file(s). Because of the above, I had to change many lines and replace every local srv variable (a server config) usage with the srv().	2023-10-02 08:20:53 +03:00
guybe7	c2a4b78491	WAITAOF: Update fsynced_reploff_pending even if there's nothing to fsync (#12622 ) The problem is that WAITAOF could have hang in case commands were propagated only to replicas. This can happen if a module uses RM_Call with the REDISMODULE_ARGV_NO_AOF flag. In that case, master_repl_offset would increase, but there would be nothing to fsync, so in the absence of other traffic, fsynced_reploff_pending would stay the static, and WAITAOF can hang. This commit updates fsynced_reploff_pending to the latest offset in flushAppendOnlyFile in case there's nothing to fsync. i.e. in case it's behind because of the above mentions case it'll be refreshed and release the WAITAOF. Other changes: Fix a race in wait.tcl (client getting blocked vs. the fsync thread)	2023-09-28 17:19:20 +03:00
guybe7	bfa3931a04	WAITAOF: Update fsynced_reploff_pending just before starting the initial AOFRW fork (#12620 ) If we set `fsynced_reploff_pending` in `startAppendOnly`, and the fork doesn't start immediately (e.g. there's another fork active at the time), any subsequent commands will increment `server.master_repl_offset`, but will not cause a fsync (given they were executed before the fork started, they just ended up in the RDB part of it) Therefore, any WAITAOF will wait on the new master_repl_offset, but it will time out because no fsync will be executed. Release notes: ``` WAITAOF could timeout in the absence of write traffic in case a new AOF is created and an AOFRW can't immediately start. This can happen by the appendonly config is changed at runtime, but also after FLUSHALL, and replica full sync. ```	2023-09-28 17:05:53 +03:00
Binbin	9fe63bdc80	Dump server logs when corrupt fuzzer reports crash (#12612 ) Recently we found some signal crashes, but unable to reproduce them. It is a good idea to dump the server logs when a failure happens.	2023-09-27 09:08:18 +03:00
meiravgri	cc2be63997	Print stack trace from all threads in crash report (#12453 ) In this PR we are adding the functionality to collect all the process's threads' backtraces. ## Changes made in this PR ### introduce threads mngr API The threads mngr API which has 2 abilities: * `ThreadsManager_init() `- register to SIGUSR2. called on the server start-up. * ` ThreadsManager_runOnThreads()` - receives a list of a pid_t and a callback, tells every thread in the list to invoke the callback, and returns the output collected by each invocation. Elaborating atomicvar API * `atomicIncrGet(var,newvalue_var,count) `-- Increment and get the atomic counter new value * `atomicFlagGetSet` -- Get and set the atomic counter value to 1 ### Always set SIGALRM handler SIGALRM handler prints the process's stacktrace to the log file. Up until now, it was set only if the `server.watchdog_period` > 0. This can be also useful if debugging is needed. However, in situations where the server can't get requests, (a deadlock, for example) we weren't able to change the signal handler. To make it available at run time we set SIGALRM handler on server startup. The signal handler name was changed to a more general `sigalrmSignalHandler`. ### Print all the process' threads' stacktraces `logStackTrace()` now calls `writeStacktraces()`, instead of logging the current thread stacktrace. `writeStacktraces()`: * On Linux systems we use the threads manager API to collect the backtraces of all the process' threads. To get the `tids` list (threads ids) we read the `/proc/<redis-server-pid>/tasks` file which includes a list of directories. Each directory name corresponds to one tid (including the main thread). For each thread, we also need to check if it can get the signal from the threads manager (meaning it is not blocking/ignoring that signal). We send the threads manager this tids list and `collect_stacktrace_data()` callback, which collects the thread's backtrace addresses, its name, and tid. * On other systems, the behavior remained as it was (writing only the current thread stacktrace to the log file). ## compatibility notes 1. The threads mngr API is only supported in linux. 2. glibc earlier than 2.3 We use `syscall(SYS_gettid)` and `syscall(SYS_tgkill...)` because their dedicated alternatives (`gettid()` and `tgkill`) were added in glibc 2.3. ## Output example Each thread backtrace will have the following format: `<tid> <thread_name> [additional_info]` * tid: as read from the `/proc/<redis-server-pid>/tasks` file * thread_name: the tread name as it is registered in the os/ * additional_info: Sometimes we want to add specific information about one of the threads. currently. it is only used to mark the thread that handles the backtraces collection by adding "". In case of crash - this also indicates which thread caused the crash. The handling thread in won't necessarily appear first. ``` ------ STACK TRACE ------ EIP: /lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc] 67089 redis-server linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb9437790] /lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc] redis-server :6379(+0x75e0c)[0xaaaac2fe5e0c] redis-server :6379(aeProcessEvents+0x18c)[0xaaaac2fe6c00] redis-server :6379(aeMain+0x24)[0xaaaac2fe7038] redis-server :6379(main+0xe0c)[0xaaaac3001afc] /lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xffffb91d73fc] /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffb91d74cc] redis-server :6379(_start+0x30)[0xaaaac2fe0370] 67093 bio_lazy_free /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server :6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67091 bio_close_file /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server :6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67092 bio_aof /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server :6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67089:signal-handler (1693824528) -------- ```	2023-09-24 09:47:23 +03:00
Binbin	96e9dec419	Bump codespell from 2.2.4 to 2.2.5 (#12557 ) and adjustments.	2023-09-08 16:10:17 +03:00
alonre24	044e29dd34	redis-benchmark - add the support for binary strings (#9414 ) Recently, the option of sending an argument from stdin using `-x` flag was added to redis-benchmark (this option is available in redis-cli as well). However, using the `-x` option for sending a blobs that contains null-characters doesn't work as expected - the argument is trimmed in the first occurrence of `\X00` (unlike in redis-cli). This PR aims to fix this issue and add the support for every binary string input, by sending arguments length to `redisFormatCommandArgv` when processing redis-benchmark command, so we won't treat the arguments as C-strings. Additionally, we add a simple test coverage for `-x` (without binary strings, and also remove an excessive server started in tests, and make sure to select db 0 so that `r` and the benchmark work on the same db. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-09-02 15:37:04 +03:00
Binbin	4ba144a4eb	Add logreqres:skip flag to new INFO obuf limit test (#12537 ) The new test added in #12476 causes reply-schemas-validator to fail. When doing `catch {r get key}`, the req-res output is: ``` 3 get 3 key 12 __argv_end__ $100000 aaaaaaaaaaaaaaaaaaaa...4 info 5 stats 12 __argv_end__ =1670 txt:# Stats ... ``` And we can see the link after `$100000`, there is a 4 in the last, it break the req-res-log-validator script since the format is wrong. The reason i guess is after the client reconnection (after the output buf limit), we will not add newlines, but append args directly. Since obuf-limits.tcl is doing the same thing, and it had the logreqres:skip flag, so this PR is following it.	2023-09-01 14:15:11 +03:00
Chen Tianjie	b26e8e3213	Optimize ZRANGE offset location from linear search to skiplist jump. (#12450 ) ZRANGE BYSCORE/BYLEX with [LIMIT offset count] option was using every level in skiplist to jump to the first/last node in range, but only use level[0] in skiplist to locate the node at offset, resulting in sub-optimal performance using LIMIT: ``` while (ln && offset--) { if (reverse) { ln = ln->backward; } else { ln = ln->level[0].forward; } } ``` It could be slow when offset is very big. We can get the total rank of the offset location and use skiplist to jump to it. It is an improvement from O(offset) to O(log rank). Below shows how this is implemented (if the offset is positve): Use the skiplist to seach for the first element in the range, record its rank `rank_0`, so we can have the rank of the target node `rank_t`. Meanwhile we record the last node we visited which has zsl->level-1 levels and its rank `rank_1`. Then we start from the zsl->level-1 node, use skiplist to go forward `rank_t-rank_1` nodes to reach the target node. It is very similiar when the offset is reversed. Note that if `rank_t` is very close to `rank_0`, we just start from the first element in range and go node by node, this for the case when zsl->level-1 node is to far away and it is quicker to reach the target node by node. Here is a test using a random generated zset including 10000 elements (with different positive scores), doing a bench mark which compares how fast the `ZRANGE` command is exucuted before and after the optimization. The start score is set to 0 and the count is set to 1 to make sure that most of the time is spent on locating the offset. ``` memtier_benchmark -h 127.0.0.1 -p 6379 --command="zrange test 0 +inf byscore limit <offset> 1" ``` \| offset \| QPS(unstable) \| QPS(optimized) \| \|--------\|--------\|--------\| \| 10 \| 73386.02 \| 74819.82 \| \| 1000 \| 48084.96 \| 73177.73 \| \| 2000 \| 31156.79 \| 72805.83 \| \| 5000 \| 10954.83 \| 71218.21 \| With the result above, we can see that the original code is greatly slowed down when offset gets bigger, and with the optimization the speed is almost not affected. Similiar results are generated when testing reversed offset: ``` memtier_benchmark -h 127.0.0.1 -p 6379 --command="zrange test +inf 0 byscore rev limit <offset> 1" ``` \| offset \| QPS(unstable) \| QPS(optimized) \| \|--------\|--------\|--------\| \| 10 \| 74505.14 \| 71653.67 \| \| 1000 \| 46829.25 \| 72842.75 \| \| 2000 \| 28985.48 \| 73669.01 \| \| 5000 \| 11066.22 \| 73963.45 \| And the same conclusion is drawn from the tests of ZRANGE BYLEX.	2023-08-31 14:42:08 +03:00
Binbin	9ce8c54d74	Update sort_ro reply_schema to mention the null reply (#12534 ) Also added a test to cover this case, so this can cover the reply schemas check.	2023-08-31 06:36:35 +03:00
Roshan Khatri	7519960527	Allows modules to declare new ACL categories. (#12486 ) This PR adds a new Module API int RM_AddACLCategory(RedisModuleCtx ctx, const char category_name) to add a new ACL command category. Here, we initialize the ACLCommandCategories array by allocating space for 64 categories and duplicate the 21 default categories from the predefined array 'ACLDefaultCommandCategories' into the ACLCommandCategories array while ACL initialization. Valid ACL category names can only contain alphanumeric characters, underscores, and dashes. The API when called, checks for the onload flag, category name validity, and for duplicate category name if present. If the conditions are satisfied, the API adds the new category to the trailing end of the ACLCommandCategories array and assigns the acl_categories flag bit according to the index at which the category is added. If any error is encountered the errno is set accordingly by the API. --------- Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2023-08-30 13:01:24 -07:00
bodong.ybd	b59f53efb3	Fix sort_ro get-keys function return wrong key number (#12522 ) Before： ``` 127.0.0.1:6379> command getkeys sort_ro key (empty array) 127.0.0.1:6379> ``` After: ``` 127.0.0.1:6379> command getkeys sort_ro key 1) "key" 127.0.0.1:6379> ```	2023-08-30 22:00:02 +03:00
Chen Tianjie	e3d4b30d09	Add two stats to count client input and output buffer oom. (#12476 ) Add these INFO metrics: * client_query_buffer_limit_disconnections * client_output_buffer_limit_disconnections Sometimes it is useful to monitor whether clients reaches size limit of query buffer and output buffer, to decide whether we need to adjust the buffer size limit or reduce client query payload.	2023-08-30 21:51:14 +03:00
Binbin	e792653753	Add printing for LATENCY related tests (#12514 ) This test failed several times: ``` *** [err]: LATENCY GRAPH can output the event graph in tests/unit/latency-monitor.tcl Expected '478' to be more than or equal to '500' (context: type eval line 8 cmd {assert_morethan_equal $high 500} proc ::test) ``` Not sure why, adding some verbose printing that'll print the command result on the next time.	2023-08-27 11:42:55 +03:00
Binbin	1407ac1f3e	BITCOUNT and BITPOS with non-existing key and illegal arguments should return error, not 0 (#11734 ) BITCOUNT and BITPOS with non-existing key will return 0 even the arguments are error, before this commit: ``` > flushall OK > bitcount s 0 (integer) 0 > bitpos s 0 0 1 hello (integer) 0 > set s 1 OK > bitcount s 0 (error) ERR syntax error > bitpos s 0 0 1 hello (error) ERR syntax error ``` The reason is that we judged non-existing before parameter checking and returned. This PR fixes it, and after this commit: ``` > flushall OK > bitcount s 0 (error) ERR syntax error > bitpos s 0 0 1 hello (error) ERR syntax error ``` Also BITPOS made the same fix as #12394, check for wrong argument, before checking for key. ``` > lpush mylist a b c (integer) 3 > bitpos mylist 1 a b (error) WRONGTYPE Operation against a key holding the wrong kind of value ```	2023-08-21 19:48:30 +03:00
Wen Hui	45d3310694	BITCOUNT: check for argument, before checking for key (#12394 ) Generally, In any command we first check for the argument and then check if key exist. Some of the examples are ``` 127.0.0.1:6379> getrange no-key invalid1 invalid2 (error) ERR value is not an integer or out of range 127.0.0.1:6379> setbit no-key 1 invalid (error) ERR bit is not an integer or out of range 127.0.0.1:6379> xrange no-key invalid1 invalid2 (error) ERR Invalid stream ID specified as stream command argument ``` Before change ``` bitcount no-key invalid1 invalid2 0 ``` After change ``` bitcount no-key invalid1 invalid2 (error) ERR value is not an integer or out of range ```	2023-08-21 12:53:46 +03:00
Wen Hui	e532c95dfc	Added tests for Client commands (#10276 ) In our test case, now we missed some test coverage for client sub-commands. This pr goal is to add some test coverage cases of the following commands: Client caching Client kill Client no-evict Client pause Client reply Client tracking Client setname At the very least, this is useful to make sure there are no leaks and crashes in these code paths.	2023-08-20 19:17:51 +03:00
Oran Agra	2b8cde71bb	Update supported version list. (#12488 ) Add 7.2, drop 6.0 as per https://redis.io/docs/about/releases/ Also replace a few concordances of the `’` char, with standard `'`	2023-08-16 08:36:40 +03:00
Madelyn Olson	7c179f9bf4	Fixed a bug where sequential matching ACL rules weren't compressed (#12472 ) When adding a new ACL rule was added, an attempt was made to remove any "overlapping" rules. However, there when a match was found, the search was not resumed at the right location, but instead after the original position of the original command. For example, if the current rules were `-config +config\|get` and a rule `+config` was added. It would identify that `-config` was matched, but it would skip over `+config\|get`, leaving the compacted rule `-config +config`. This would be evaluated safely, but looks weird. This bug can only be triggered with subcommands, since that is the only way to have sequential matching rules. Resolves #12470. This is also only present in 7.2. I think there was also a minor risk of removing another valid rule, since it would start the search of the next command at an arbitrary point. I couldn't find a valid offset that would have cause a match using any of the existing commands that have subcommands with another command.	2023-08-10 09:58:53 +03:00
Binbin	6abfda54c3	Fix flaky SENTINEL RESET test (#12437 ) After SENTINEL RESET, sometimes the sentinel can sense the master again, causing the test to fail. Here we give it a few more chances.	2023-08-10 08:58:52 +03:00
zhaozhao.zz	8226f39fb2	do not call handleClientsBlockedOnKeys inside yielding command (#12459 ) Fix the assertion when a busy script (timeout) signal ready keys (like LPUSH), and then an arbitrary client's `allow-busy` command steps into `handleClientsBlockedOnKeys` try wake up clients blocked on keys (like BLPOP). Reproduction process: 1. start a redis with aof `./redis-server --appendonly yes` 2. exec blpop `127.0.0.1:6379> blpop a 0` 3. use another client call a busy script and this script push the blocked key `127.0.0.1:6379> eval "redis.call('lpush','a','b') while(1) do end" 0` 4. user a new client call an allow-busy command like auth `127.0.0.1:6379> auth a` BTW, this issue also break the atomicity of script. This bug has been around for many years, the old versions only have the atomic problem, only 7.0/7.2 has the assertion problem. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-08-05 09:52:03 +03:00
sundb	da9c2804a5	Avoid mostly harmless integer overflow in cjson (#12456 ) This PR mainly fixes a possible integer overflow in `json_append_string()`. When we use `cjson.encoding()` to encode a string larger than 2GB, at specific compilation flags, an integer overflow may occur leading to truncation, resulting in the part of the string larger than 2GB not being encoded. On the other hand, this overflow doesn't cause any read or write out-of-range or segment fault. 1) using -O0 for lua_cjson (`make LUA_DEBUG=yes`) In this case, `i` will overflow and leads to truncation. When `i` reaches `INT_MAX+1` and overflows to INT_MIN, when compared to len, `i` (1000000..00) is expanded to 64 bits signed integer (1111111.....000000) . At this point i will be greater than len and jump out of the loop, so `for (i = 0; i < len; i++)` will loop up to 2^31 times, and the part of larger than 2GB will be truncated. ```asm `i` => -0x24(%rbp) <+253>: addl $0x1,-0x24(%rbp) ; overflow if i large than 2^31 <+257>: mov -0x24(%rbp),%eax <+260>: movslq %eax,%rdx ; move a 32-bit value with sign extension into a 64-bit signed <+263>: mov -0x20(%rbp),%rax <+267>: cmp %rax,%rdx ; check `i < len` <+270>: jb 0x212600 <json_append_string+148> ``` 2) using -O2/-O3 for lua_cjson (`make LUA_DEBUG=no`, the default) In this case, because singed integer overflow is an undefined behavior, `i` will not overflow. `i` will be optimized by the compiler and use 64-bit registers for all subsequent instructions. ```asm <+180>: add $0x1,%rbx ; Using 64-bit register `rbx` for i++ <+184>: lea 0x1(%rdx),%rsi <+188>: mov %rsi,0x10(%rbp) <+192>: mov %al,(%rcx,%rdx,1) <+195>: cmp %rbx,(%rsp) ; check `i < len` <+199>: ja 0x20b63a <json_append_string+154> ``` 3) using 32bit Because `strbuf_ensure_empty_length()` preallocates memory of length (len * 6 + 2), in 32-bit `cjson.encode()` can only handle strings smaller than ((2 ^ 32) - 3 ) / 6. So 32bit is not affected. Also change `i` in `strbuf_append_string()` to `size_t`. Since its second argument `str` is taken from the `char2escape` string array which is never larger than 6, so `strbuf_append_string()` is not at risk of overflow (the bug was unreachable).	2023-08-05 07:57:06 +03:00
zhaozhao.zz	90ab91f00b	fix false success and a memory leak for ACL selector with bad parenthesis combination (#12452 ) When doing merge selector, we should check whether the merge has started (i.e., whether open_bracket_start is -1) every time. Otherwise, encountering an illegal selector pattern could succeed and also cause memory leaks, for example: ``` acl setuser test1 (+PING (+SELECT (+DEL ) ``` The above would leak memory and succeed with only DEL being applied, and would now error after the fix. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-08-02 10:46:06 +03:00
Diego Lopez Recas	b653c759cd	Fix race condition in tests/unit/auth.tcl (#12444 ) Changing the masterauth while turning into a replica is racy. Turn into replica after changing the masterauth instead.	2023-08-01 18:03:33 +03:00
DarrenJiang13	6abb3c4038	change log match to line match in tcl sanitizer_errors_from_file. (#12446 ) In the tcl foreach loop, the function should compare line rather than the whole file.	2023-07-30 08:48:29 +03:00
Harkrishn Patro	42985b00ea	Test coverage for incr/decr operation on robj encoding type optimization (#12435 ) Additional test coverage for incr/decr operation. integer number could be present in raw encoding format due to operation like append. A incr/decr operation following it optimize the string to int encoding format.	2023-07-25 16:43:31 -07:00
Harkrishn Patro	34b95f752c	Add test case for APPEND command usage on integer value (#12429 ) Add test coverage to validate object encoding update on APPEND command usage on a integer value	2023-07-24 18:25:50 -07:00
Makdon	2495b90a64	redis-cli: use previous hostip when not provided by redis cluster server (#12273 ) When the redis server cluster running on cluster-preferred-endpoint-type unknown-endpoint mode, and receive a request that should be redirected to another redis server node, it does not reply the hostip, but a empty host like MOVED 3999 :6381. The redis-cli would try to connect to an address without a host, which cause the issue: ``` 127.0.0.1:7002> set bar bar -> Redirected to slot [5061] located at :7000 Could not connect to Redis at :7000: No address associated with hostname Could not connect to Redis at :7000: No address associated with hostname not connected> exit ``` In this case, the redis-cli should use the previous hostip when there's no host provided by the server. --------- Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Madelyn Olson <madelynolson@gmail.com>	2023-07-20 15:31:06 -07:00
Oran Agra	936cfa464f	Lua cjson and cmsgpack integer overflow issues (CVE-2022-24834) (#12398 ) * Fix integer overflows due to using wrong integer size. * Add assertions / panic when overflow still happens. * Deletion of dead code to avoid need to maintain it * Some changes are not because of bugs, but rather paranoia. * Improve cmsgpack and cjson test coverage. Co-authored-by: Yossi Gottlieb <yossigo@gmail.com>	2023-07-10 10:26:09 +03:00
Sankar	1190f25ca7	Process loss of slot ownership in cluster bus (#12344 ) Process loss of slot ownership in cluster bus When a node no longer owns a slot, it clears the bit corresponding to the slot in the cluster bus messages. The receiving nodes currently don't record the fact that the sender stopped claiming a slot until some other node in the cluster starts claiming the slot. This can cause a slot to go missing during slot migration when subjected to inopportune race with addition of new shards or a failover. This fix forces the receiving nodes to process the loss of ownership to avoid spreading wrong information.	2023-07-05 17:46:23 -07:00
Binbin	d56a7d9b05	Fix and increase tollerance of event loop test, add verbose logs (#12385 ) The test fails on freebsd CI: ``` *** [err]: stats: eventloop metrics in tests/unit/info.tcl Expected '31777' to be less than '16183' (context: type eval line 17 cmd {assert_lessthan $el_sum2 [expr $el_sum1+10000] } proc ::test) ``` The test added in #11963, fails on freebsd CI which is slow, increase tollerance and also add some verbose logs, now we can see these logs in verbose mode (for better views): ``` eventloop metrics cycle1: 12, cycle2: 15 eventloop metrics el_sum1: 315, el_sum2: 411 eventloop metrics cmd_sum1: 126, cmd_sum2: 137 [ok]: stats: eventloop metrics (111 ms) instantaneous metrics instantaneous_eventloop_cycles_per_sec: 8 instantaneous metrics instantaneous_eventloop_duration_usec: 55 [ok]: stats: instantaneous metrics (1603 ms) [ok]: stats: debug metrics (112 ms) ```	2023-07-05 09:32:30 +03:00
Lior Lahav	b7559d9f3b	Fix possible crash in command getkeys (#12380 ) When getKeysUsingKeySpecs processes a command with more than one key-spec, and called with a total of more than 256 keys, it'll call getKeysPrepareResult again, but since numkeys isn't updated, getKeysPrepareResult will not bother to copy key names from the old result (leaving these slots uninitialized). Furthermore, it did not consider the keys it already found when allocating more space. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-07-03 12:45:18 +03:00
Binbin	6b57241fa8	Revert zrangeGenericCommand negative offset check (#12377 ) The negative offset check was added in #9052, we realized that this is a non-mandatory breaking change and we would like to add it only in 8.0. This reverts PR #9052, will be re-introduced later in 8.0.	2023-07-02 20:38:30 +03:00
judeng	07ed0eafa9	improve performance for scan command when matching pattern or data type (#12209 ) Optimized the performance of the SCAN command in a few ways: 1. Move the key filtering (by MATCH pattern) in the scan callback, so as to avoid collecting them for later filtering. 2. Reduce a many memory allocations and copying (use a reference to the original sds, instead of creating an robj, an excessive 2 mallocs and one string duplication) 3. Compare TYPE filter directly (as integers), instead of inefficient string compare per key. 4. fixed a small bug: when scan zset and hash types, maxiterations uses a more accurate number to avoid wrong double maxiterations. Changes postponed for a later version (8.0): 1. Prepare to move the TYPE filtering to the scan callback as well. this was put on hold since it has side effects that can be considered a breaking change, which is that we will not attempt to do lazy expire (delete) a key that was filtered by not matching the TYPE (changing it would mean TYPE filter starts behaving the same as MATCH filter already does in that respect). 2. when the specified key TYPE filter is an unknown type, server will reply a error immediately instead of doing a full scan that comes back empty handed. Benchmark result: For different scenarios, we obtained about 30% or more performance improvement. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-06-27 16:43:46 +03:00
Chen Tianjie	22a29935ff	Support TLS service when "tls-cluster" is not enabled and persist both plain and TLS port in nodes.conf (#12233 ) Originally, when "tls-cluster" is enabled, `port` is set to TLS port. In order to support non-TLS clients, `pport` is used to propagate TCP port across cluster nodes. However when "tls-cluster" is disabled, `port` is set to TCP port, and `pport` is not used, which means the cluster cannot provide TLS service unless "tls-cluster" is on. ``` typedef struct { // ... uint16_t port; /* Latest known clients port (TLS or plain). / uint16_t pport; / Latest known clients plaintext port. Only used if the main clients port is for TLS. / // ... } clusterNode; ``` ``` typedef struct { // ... uint16_t port; / TCP base port number. / uint16_t pport; / Sender TCP plaintext port, if base port is TLS */ // ... } clusterMsg; ``` This PR renames `port` and `pport` in `clusterNode` to `tcp_port` and `tls_port`, to record both ports no matter "tls-cluster" is enabled or disabled. This allows to provide TLS service to clients when "tls-cluster" is disabled: when displaying cluster topology, or giving `MOVED` error, server can provide TLS or TCP port according to client's connection type, no matter what type of connection cluster bus is using. For backwards compatibility, `port` and `pport` in `clusterMsg` are preserved, when "tls-cluster" is enabled, `port` is set to TLS port and `pport` is set to TCP port, when "tls-cluster" is disabled, `port` is set to TCP port and `pport` is set to TLS port (instead of 0). Also, in the nodes.conf file, a new aux field displaying an extra port is added to complete the persisted info. We may have `tls_port=xxxxx` or `tcp_port=xxxxx` in the aux field, to complete the cluster topology, while the other port is stored in the normal `<ip>:<port>` field. The format is shown below. ``` <node-id> <ip>:<tcp_port>@<cport>,<hostname>,shard-id=...,tls-port=6379 myself,master - 0 0 0 connected 0-1000 ``` Or we can switch the position of two ports, both can be correctly resolved. ``` <node-id> <ip>:<tls_port>@<cport>,<hostname>,shard-id=...,tcp-port=6379 myself,master - 0 0 0 connected 0-1000 ```	2023-06-26 07:43:38 -07:00
Meir Shpilraien (Spielrein)	153f8f082e	Fix use after free on blocking RM_Call. (#12342 ) blocking RM_Call was introduced on: #11568, It allows a module to perform blocking commands and get the reply asynchronously.If the command gets block, a special promise CallReply is returned that allow to set the unblock handler. The unblock handler will be called when the command invocation finish and it gets, as input, the command real reply. The issue was that the real CallReply was created using a stack allocated RedisModuleCtx which is no longer available after the unblock handler finishes. So if the module keeps the CallReply after the unblock handler finished, the CallReply holds a pointer to invalid memory and will try to access it when the CallReply will be released. The solution is to create the CallReply with a NULL context to make it totally detached and can be freed freely when the module wants. Test was added to cover this case, running the test with valgrind before the fix shows the use after free error. With the fix, there are no valgrind errors. unrelated: adding a missing `$rd close` in many tests in that file.	2023-06-25 14:12:27 +03:00
guybe7	3230199920	Modules: Unblock from within a timer coverage (#12337 ) Apart from adding the missing coverage, this PR also adds `blockedBeforeSleep` that gathers all block-related functions from `beforeSleep` The order inside `blockedBeforeSleep` is different: now `handleClientsBlockedOnKeys` (which may unblock clients) is called before `processUnblockedClients` (which handles unblocked clients). It makes sense to have this order. There are no visible effects of the wrong ordering, except some cleanups of the now-unblocked client would have happen in the next `beforeSleep` (will now happen in the current one) The reason we even got into it is because i triggers an assertion in logresreq.c (breaking the assumption that `unblockClient` is called before actually flushing the reply to the socket): `handleClientsBlockedOnKeys` is called, then it calls `moduleUnblockClientOnKey`, which calls `moduleUnblockClient`, which adds the client to `moduleUnblockedClients` back to `beforeSleep`, we call `handleClientsWithPendingWritesUsingThreads`, it writes the data of buf to the client, so `client->bufpos` became 0 On the next `beforeSleep`, we call `moduleHandleBlockedClients`, which calls `unblockClient`, which calls `reqresAppendResponse`, triggering the assert. (because the `bufpos` is 0) - see https://github.com/redis/redis/pull/12301#discussion_r1226386716	2023-06-22 23:15:16 +03:00
Madelyn Olson	73cf0243df	Make nodename test more consistent (#12330 ) To determine when everything was stable, we couldn't just query the nodename since they aren't API visible by design. Instead, we were using a proxy piece of information which was bumping the epoch and waiting for everyone to observe that. This works for making source Node 0 and Node 1 had pinged, and Node 0 and Node 2 had pinged, but did not guarantee that Node 1 and Node 2 had pinged. Although unlikely, this can cause this failure message. To fix it I hijacked hostnames and used its validation that it has been propagated, since we know that it is stable. I also noticed while stress testing this sometimes the test took almost 4.5 seconds to finish, which is really close to the current 5 second limit of the log check, so I bumped that up as well just to make it a bit more consistent.	2023-06-20 18:00:55 -07:00
guybe7	20fa156067	Align RM_ReplyWithErrorFormat with RM_ReplyWithError (#12321 ) Introduced by https://github.com/redis/redis/pull/11923 (Redis 7.2 RC2) It's very weird and counterintuitive that `RM_ReplyWithError` requires the error-code without a hyphen while `RM_ReplyWithErrorFormat` requires either the error-code with a hyphen or no error-code at all ``` RedisModule_ReplyWithError(ctx, "BLA bla bla"); ``` vs. ``` RedisModule_ReplyWithErrorFormat(ctx, "-BLA %s", "bla bla"); ``` This commit aligns RM_ReplyWithErrorFormat to behvae like RM_ReplyWithError. it's a breaking changes but it's done before 7.2 goes GA.	2023-06-20 20:44:43 +03:00
Oran Agra	8ad8f0f9d8	Fix broken protocol when PUBLISH emits local push inside MULTI (#12326 ) When a connection that's subscribe to a channel emits PUBLISH inside MULTI-EXEC, the push notification messes up the EXEC response. e.g. MULTI, PING, PUSH foo bar, PING, EXEC the EXEC's response will contain: PONG, {message foo bar}, 1. and the second PONG will be delivered outside the EXEC's response. Additionally, this PR changes the order of responses in case of a plain PUBLISH (when the current client also subscribed to it), by delivering the push after the command's response instead of before it. This also affects modules calling RM_PublishMessage in a similar way, so that we don't run the risk of getting that push mixed together with the module command's response.	2023-06-20 20:41:41 +03:00
Binbin	d9c2ef8a72	zrangeGenericCommand add check for negative offset (#9052 ) Now we will check the offset in zrangeGenericCommand. With a negative offset, we will throw an error and return. This also resolve the issue of zeroing the destination key in case of the "store" variant when we input a negative offset. ``` 127.0.0.1:6379> set key value OK 127.0.0.1:6379> zrangestore key myzset 0 10 byscore limit -1 10 (integer) 0 127.0.0.1:6379> exists key (integer) 0 ``` This change affects the following commands: - ZRANGE / ZRANGESTORE / ZRANGEBYLEX / ZRANGEBYSCORE - ZREVRANGE / ZREVRANGEBYSCORE / ZREVRANGEBYLEX	2023-06-20 14:33:17 +03:00
Wen Hui	66ea178cb6	adding geo command edge cases tests (#12274 ) For geosearch and georadius we have already test coverage for wrong type, but we dont have for geodist, geohash, geopos commands. So adding the wrong type test cases for geodist, geohash, geopos commands. Existing code, we have verify_geo_edge_response_bymember function for wrong type test cases which has member as an option. But the function is being called in other test cases where the output is not inline with these commnds(geodist, geohash, geopos). So I could not include these commands(geodist, geohash, geopos) as part of existing function, hence implemented a new function verify_geo_edge_response_generic and called from the test case.	2023-06-20 12:50:03 +03:00
Binbin	d306d86146	Fix ZRANK/ZREVRANK reply_schema description (#12331 ) The parameter name is WITHSCORE instead of WITHSCORES.	2023-06-20 11:15:40 +03:00
Wen Hui	813924b4c3	Sanitizer reported memory leak for '--invalid' option or port number is missed cases to redis-server. (#12322 ) Observed that the sanitizer reported memory leak as clean up is not done before the process termination in negative/following cases: - when we passed '--invalid' as option to redis-server. ``` -vm:~/mem-leak-issue/redis$ ./src/redis-server --invalid * FATAL CONFIG FILE ERROR (Redis 255.255.255) * Reading the configuration file, at line 2 >>> 'invalid' Bad directive or wrong number of arguments ================================================================= ==865778==ERROR: LeakSanitizer: detected memory leaks Direct leak of 8 byte(s) in 1 object(s) allocated from: #0 0x7f0985f65867 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145 #1 0x558ec86686ec in ztrymalloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:117 #2 0x558ec86686ec in ztrymalloc_usable /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:135 #3 0x558ec86686ec in ztryrealloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:276 #4 0x558ec86686ec in zrealloc /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:327 #5 0x558ec865dd7e in sdssplitargs /home/ubuntu/mem-leak-issue/redis/src/sds.c:1172 #6 0x558ec87a1be7 in loadServerConfigFromString /home/ubuntu/mem-leak-issue/redis/src/config.c:472 #7 0x558ec87a13b3 in loadServerConfig /home/ubuntu/mem-leak-issue/redis/src/config.c:718 #8 0x558ec85e6f15 in main /home/ubuntu/mem-leak-issue/redis/src/server.c:7258 #9 0x7f09856e5d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 SUMMARY: AddressSanitizer: 8 byte(s) leaked in 1 allocation(s). ``` - when we pass '--port' as option and missed to add port number to redis-server. ``` vm:~/mem-leak-issue/redis$ ./src/redis-server --port * FATAL CONFIG FILE ERROR (Redis 255.255.255) * Reading the configuration file, at line 2 >>> 'port' wrong number of arguments ================================================================= ==865846==ERROR: LeakSanitizer: detected memory leaks Direct leak of 8 byte(s) in 1 object(s) allocated from: #0 0x7fdcdbb1f867 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145 #1 0x557e8b04f6ec in ztrymalloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:117 #2 0x557e8b04f6ec in ztrymalloc_usable /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:135 #3 0x557e8b04f6ec in ztryrealloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:276 #4 0x557e8b04f6ec in zrealloc /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:327 #5 0x557e8b044d7e in sdssplitargs /home/ubuntu/mem-leak-issue/redis/src/sds.c:1172 #6 0x557e8b188be7 in loadServerConfigFromString /home/ubuntu/mem-leak-issue/redis/src/config.c:472 #7 0x557e8b1883b3 in loadServerConfig /home/ubuntu/mem-leak-issue/redis/src/config.c:718 #8 0x557e8afcdf15 in main /home/ubuntu/mem-leak-issue/redis/src/server.c:7258 #9 0x7fdcdb29fd8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 Indirect leak of 10 byte(s) in 1 object(s) allocated from: #0 0x7fdcdbb1fc18 in __interceptor_realloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:164 #1 0x557e8b04f9aa in ztryrealloc_usable_internal /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:287 #2 0x557e8b04f9aa in ztryrealloc_usable /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:317 #3 0x557e8b04f9aa in zrealloc_usable /home/ubuntu/mem-leak-issue/redis/src/zmalloc.c:342 #4 0x557e8b033f90 in _sdsMakeRoomFor /home/ubuntu/mem-leak-issue/redis/src/sds.c:271 #5 0x557e8b033f90 in sdsMakeRoomFor /home/ubuntu/mem-leak-issue/redis/src/sds.c:295 #6 0x557e8b033f90 in sdscatlen /home/ubuntu/mem-leak-issue/redis/src/sds.c:486 #7 0x557e8b044e1f in sdssplitargs /home/ubuntu/mem-leak-issue/redis/src/sds.c:1165 #8 0x557e8b188be7 in loadServerConfigFromString /home/ubuntu/mem-leak-issue/redis/src/config.c:472 #9 0x557e8b1883b3 in loadServerConfig /home/ubuntu/mem-leak-issue/redis/src/config.c:718 #10 0x557e8afcdf15 in main /home/ubuntu/mem-leak-issue/redis/src/server.c:7258 #11 0x7fdcdb29fd8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 SUMMARY: AddressSanitizer: 18 byte(s) leaked in 2 allocation(s). ``` As part analysis found that the sdsfreesplitres is not called when this condition checks are being hit. Output after the fix: ``` vm:~/mem-leak-issue/redis$ ./src/redis-server --invalid * FATAL CONFIG FILE ERROR (Redis 255.255.255) * Reading the configuration file, at line 2 >>> 'invalid' Bad directive or wrong number of arguments vm:~/mem-leak-issue/redis$ =========================================== vm:~/mem-leak-issue/redis$ ./src/redis-server --jdhg * FATAL CONFIG FILE ERROR (Redis 255.255.255) * Reading the configuration file, at line 2 >>> 'jdhg' Bad directive or wrong number of arguments --------------------------------------------------------------------------- vm:~/mem-leak-issue/redis$ ./src/redis-server --port * FATAL CONFIG FILE ERROR (Redis 255.255.255) * Reading the configuration file, at line 2 >>> 'port' wrong number of arguments ``` Co-authored-by: Oran Agra <oran@redislabs.com>	2023-06-20 10:07:29 +03:00
Shaya Potter	07316f16f0	Add ability for modules to know which client is being cmd filtered (#12219 ) Adds API - RedisModule_CommandFilterGetClientId() Includes addition to commandfilter test module to validate that it works by performing the same command from 2 different clients	2023-06-20 09:03:52 +03:00
Wen Hui	070453eef3	Cluster human readable nodename feature (#9564 ) This PR adds a human readable name to a node in clusters that are visible as part of error logs. This is useful so that admins and operators of Redis cluster have better visibility into failures without having to cross-reference the generated ID with some logical identifier (such as pod-ID or EC2 instance ID). This is mentioned in #8948. Specific nodenames can be set by using the variable cluster-announce-human-nodename. The nodename is gossiped using the clusterbus extension in #9530. Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2023-06-17 21:16:51 -07:00
sundb	b00a235186	Use Reservoir Sampling for random sampling of dict, and fix hang during fork (#12276 ) ## Issue: When a dict has a long chain or the length of the chain is longer than the number of samples, we will never be able to sample the elements at the end of the chain using dictGetSomeKeys(). This could mean that SRANDMEMBER can be hang in and endless loop. The most severe case, is the pathological case of when someone uses SCAN+DEL or SSCAN+SREM creating an unevenly distributed dict. This was amplified by the recent change in #11692 which prevented a down-sizing rehashing while there is a fork. ## Solution 1. Before, we will stop sampling when we reach the maximum number of samples, even if there is more data after the current chain. Now when we reach the maximum we use the Reservoir Sampling algorithm to fairly sample the end of the chain that cannot be sampled 2. Fix the rehashing code, so that the same as it allows rehashing for up-sizing during fork when the ratio is extreme, it will allow it for down-sizing as well. Issue was introduced (or became more severe) by #11692 Co-authored-by: Oran Agra <oran@redislabs.com>	2023-06-16 19:13:07 +03:00
Binbin	439b0315c8	Fix SPOP/RESTORE propagation when doing lazy free (#12320 ) In SPOP, when COUNT is greater than or equal to set's size, we will remove the set. In dbDelete, we will do DEL or UNLINK according to the lazy flag. This is also required for propagate. In RESTORE, we won't store expired keys into the db, see #7472. When used together with REPLACE, it should emit a DEL or UNLINK according to the lazy flag. This PR also adds tests to cover the propagation. The RESTORE test will also cover #7472.	2023-06-16 08:14:11 -07:00
YaacovHazan	5da9eecdb8	Removing duplicated tests (#12318 ) In `4ba47d2d2` the following tests added in both tracking.tcl and introspection.tcl - Coverage: Basic CLIENT CACHING - Coverage: Basic CLIENT REPLY - Coverage: Basic CLIENT TRACKINGINFO - Coverage: Basic CLIENT GETREDIR	2023-06-16 15:55:24 +03:00
Binbin	9dc6f93e9d	Add command being unblocked cause another command to get unblocked execution order test (#12324 ) * Add command being unblocked cause another command to get unblocked execution order test In #12301, we observed that if the `while(listLength(server.ready_keys) != 0)` in handleClientsBlockedOnKeys is changed to `if(listLength(server.ready_keys) != 0)`, the order of command execution will change. It is wrong to change that. It means that if a command being unblocked causes another command to get unblocked (like a BLMOVE would do), then the new unblocked command will wait for later to get processed rather than right away. It'll not have any real implication if we change that since we do call handleClientsBlockedOnKeys in beforeSleep again, and redis will still behave correctly, but we don't change that. An example: 1. $rd1 blmove src{t} dst{t} left right 0 2. $rd2 blmove dst{t} src{t} right left 0 3. $rd3 set key1{t}, $rd3 lpush src{t}, $rd3 set key2{t} in a pipeline The correct order would be: 1. set key1{t} 2. lpush src{t} 3. lmove src{t} dst{t} left right 4. lmove dst{t} src{t} right left 5. set key2{t} The wrong order would be: 1. set key1{t} 2. lpush src{t} 3. lmove src{t} dst{t} left right 4. set key2{t} 5. lmove dst{t} src{t} right left This PR adds corresponding test to cover it. * Add comment near while(listLength(server.ready_keys) != 0)	2023-06-16 15:39:00 +03:00
Binbin	254475537a	Optimize SET PXAT to reduce calls of rewriteClientCommandVector (#12316 ) In PXAT case, there is no need to do the rewriteClientCommandVector, a simply benchmark show we gain a improvement of 10%.	2023-06-15 10:07:47 +03:00
Wen Hui	1946083410	Removing the duplicate test case (#12310 ) Looks like the Zadd test case was copied to create Zincrby test case ,but missed to change the command.	2023-06-14 10:03:33 +03:00
Harkrishn Patro	a9e32767f7	Allow cluster slots/shards api to respond during loading (#12269 ) It would be helpful for clients to get cluster slots/shards information during a node failover and is loading data.	2023-06-13 18:16:32 +03:00
Binbin	e7129e43e0	Fix XREADGROUP BLOCK stuck in endless loop (#12301 ) For the XREADGROUP BLOCK > scenario, there is an endless loop. Due to #11012, it keep going, reprocess command -> blockForKeys -> reprocess command The right fix is to avoid an endless loop in handleClientsBlockedOnKey and handleClientsBlockedOnKeys, looks like there was some attempt in handleClientsBlockedOnKeys but maybe not sufficiently good, and it looks like using a similar trick in handleClientsBlockedOnKey is complicated. i.e. stashing the list on the stack and iterating on it after creating a fresh one for future use, is problematic since the code keeps accessing the global list. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-06-13 13:27:05 +03:00
Oran Agra	f228ec1ea5	flushSlavesOutputBuffers should not write to replicas scheduled to drop (#12242 ) This will increase the size of an already large COB (one already passed the threshold for disconnection) This could also mean that we'll attempt to write that data to the socket and the replica will manage to read it, which will result in an undesired partial sync (undesired for the test)	2023-06-12 14:05:34 +03:00
YaacovHazan	0bfb6d5582	Reset command duration for rejected command. (#12247 ) In 7.2, After `971b177fa` we make sure (assert) that the duration has been recorded when resetting the client. This is not true for rejected commands. The use case I found is a blocking command that an ACL rule changed before it was unblocked, and while reprocessing it, the command rejected and triggered the assert. The PR reset the command duration inside rejectCommand / rejectCommandSds. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-06-11 23:12:05 +03:00
Chen Tianjie	2d7d3911da	Allow bigger tolerance in eventloop duration test. (#12179 ) In #11963, some new tests about eventloop duration were added, which includes time measurement in TCL scripts. This has caused some unexpected CI failures, such as #12169 and #12177, due to slow test servers or some performance jittering.	2023-06-11 09:02:41 +03:00
Wen Hui	d412269ff8	Adding missing test cases for Addslot Command (#12288 ) Added missing test case coverage for below scenarios: 1. The command only works if all the specified slots are, from the point of view of the node receiving the command, currently not assigned. A node will refuse to take ownership for slots that already belong to some other node (including itself). 2. The command fails if the same slot is specified multiple times.	2023-06-11 08:36:26 +03:00
Binbin	da8f7428fa	Try to fix SENTINEL SIMULATE-FAILURE test by re-source init-tests before each test (#12194 ) This test was introduced in #12079, it works well most of the time, but occasionally fails: ``` 00:34:45> SENTINEL SIMULATE-FAILURE crash-after-election works: OK 00:34:45> SENTINEL SIMULATE-FAILURE crash-after-promotion works: FAILED: Sentinel set crash-after-promotion but did not exit ``` Don't know the reason, it may be affected by the exit of the previous crash-after-election test. Because it doesn't really make much sense to go deeper into it now, we re-source init-tests to get a clean environment before each test, to try to fix this. After applying this change, we found a new error: ``` 16:39:33> SENTINEL SIMULATE-FAILURE crash-after-election works: FAILED: caught an error in the test couldn't open socket: connection refused couldn't open socket: connection refused ``` I am guessing the sentinel triggers failover and exits before SENTINEL FAILOVER, added a new \|\| condition in wait_for_condition to fix it.	2023-05-29 13:43:26 +03:00
Oran Agra	2764dc3768	Optimize MSETNX to avoid double lookup (#11944 ) This is a redo of #11594 which got reverted in #11940 It improves performance by avoiding double lookup of the the key.	2023-05-28 10:58:29 +03:00
Oran Agra	6117f28822	Fix WAIT for clients being blocked in a module command (#12220 ) So far clients being blocked and unblocked by a module command would update the c->woff variable and so WAIT was ineffective and got released without waiting for the command actions to propagate. This seems to have existed since forever, but not for RM_BlockClientOnKeys. It is problematic though to know if the module did or didn't propagate anything in that command, so for now, instead of adding an API, we'll just update the woff to the latest offset when unblocking, this will cause the client to possibly wait excessively, but that's not that bad.	2023-05-28 10:10:52 +03:00
Wen Hui	1a188e4ed6	[BUG] Incorrect error msg for XREAD command (#12238 ) XREAD only supports a special ID of $ and XREADGROUP only supports ^. make sure not to suggest the wrong one when rerunning an error about unbalanced ID arguments Co-authored-by: Oran Agra <oran@redislabs.com>	2023-05-28 08:37:32 +03:00
judeng	d71478a889	postpone the initialization of oject's lru&lfu until it is added to the db as a value object (#11626 ) This pr can get two performance benefits: 1. Stop redundant initialization when most robj objects are created 2. LRU_CLOCK will no longer be called in io threads, so we can avoid the `atomicGet` Another code optimization: deleted the redundant judgment in dbSetValue, no matter in LFU or LRU, the lru field inold robj is always the freshest (it is always updated in lookupkey), so we don't need to judge if in LFU	2023-05-24 09:40:11 +03:00
Wen Hui	d664889992	Adding test case for hvals, hkeys, hexists against wrong type (#12198 ) HVALS, HKEYS and HEXISTS commands wrong type test cases were not covered so added the test cases.	2023-05-24 09:34:13 +03:00
Binbin	d0994c5bca	Sync the new loglevel nothing to sentinel (#12223 ) We add a new loglevel 'nothing' to disable logging in #12133. This PR syncs that config change to sentinel. Because in #11214 we support modifying loglevel in runtime. Although I think sentinel doesn't need this nothing config, it's better to be consistent.	2023-05-24 09:32:39 +03:00
Binbin	ec5721d6ca	Add dummy CLUSTER SLAVES call tests to fix reply ci (#12224 ) In #12166, we removed a call to CLUSTER SLAVES, which then caused reply-schemas ci to fail: ``` WARNING! The following commands were not hit at all: cluster\|slaves ERROR! at least one command was not hit by the tests ``` Because we already have command output that cover CLUSTER REPLICAS elsewhere, here we simply add some dummy tests to fix the ci.	2023-05-24 09:28:38 +03:00
Ping Xie	4c74dd986f	Exclude aux fields from "cluster nodes" and "cluster replicas" output (#12166 ) This commit excludes aux fields from the output of the `cluster nodes` and `cluster replicas` command. We may decide to re-introduce them in some form or another in the future, but not in v7.2.	2023-05-23 18:32:37 +03:00
binfeng-xin	38e284f106	optimize spopwithcount propagation (#12082 ) A single SPOP with command with count argument resulted in many SPOP commands being propagated to the replica. This is inefficient because the key name is repeated many times, and is also being looked-up many times. also it results in high QPS metrics on the replica. To solve that, we flush batches of 1024 fields per SPOP command. Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>	2023-05-22 10:27:14 +03:00
Binbin	48757934ff	Performance improvement to ZADD and ZRANGESTORE, convert to skiplist and expand dict in advance (#12185 ) For zsets that will eventually be stored as the skiplist encoding (has a dict), we can convert it to skiplist ahead of time. This change checks the number of arguments in the ZADD command, and converts the data-structure if the number of new entries exceeds the listpack-max-entries configuration. This can cause us to over-allocate memory if there are duplicate entries in the input, which is unexpected. For ZRANGESTORE, we know the size of the zset, so we can expand the dict in advance, to avoid the temporary dict from being rehashed while it grows. Simple benchmarks shows it provides some 4% improvement in ZADD and 20% in ZRANGESTORE	2023-05-18 15:24:46 +03:00
Hanna Fadida	37cf1984b9	Add BITFIELD_RO basic tests for non-repl use cases (#12187 ) Current tests for BITFIELD_RO command are skipped in the external mode, and therefore reply-schemas-validator reports a coverage error. This PR adds basic tests to increase coverage.	2023-05-18 12:16:46 +03:00
Wen Hui	df1890ef7f	Allow SENTINEL CONFIG SET and SENTINEL CONFIG GET to handle multiple parameters. (#10362 ) Extend SENTINEL CONFIG SET and SENTINEL CONFIG GET to be compatible with variadic CONFIG SET and CONFIG GET and allow multiple parameters to be modified in a single call atomically. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-05-17 10:26:02 +03:00
Binbin	fd566f4050	Fix for set max entries edge case in setTypeCreate / setTypeMaybeConvert (#12183 ) In the judgment in setTypeCreate, we should judge size_hint <= max_entries. This results in the following inconsistencies: ``` 127.0.0.1:6379> config set set-max-intset-entries 5 set-max-listpack-entries 5 OK 127.0.0.1:6379> sadd intset_set1 1 2 3 4 5 (integer) 5 127.0.0.1:6379> object encoding intset_set1 "hashtable" 127.0.0.1:6379> sadd intset_set2 1 2 3 4 (integer) 4 127.0.0.1:6379> sadd intset_set2 5 (integer) 1 127.0.0.1:6379> object encoding intset_set2 "intset" 127.0.0.1:6379> sadd listpack_set1 a 1 2 3 4 (integer) 5 127.0.0.1:6379> object encoding listpack_set1 "hashtable" 127.0.0.1:6379> sadd listpack_set2 a 1 2 3 (integer) 4 127.0.0.1:6379> sadd listpack_set2 4 (integer) 1 127.0.0.1:6379> object encoding listpack_set2 "listpack" ``` This was introduced in #12019, added corresponding tests.	2023-05-16 11:32:21 -07:00
Wen Hui	e45272884e	Adding missing test case for smembers scard commands (#12148 ) Minor missing test case addition. SMEMBERS SCARD against non set SMEMBERS SCARD against non existing key	2023-05-16 09:26:49 -07:00
Oran Agra	2ffde15a1d	increase tollerance of new event loop test, fails on freebsd CI (#12169 ) new test added in #11963, fails on freebsd CI which is slow.	2023-05-14 17:40:29 +03:00
JJ Lu	11cf5cbdcc	Fix bug: LPOS RANK LONG_ MIN causes overflow (#12167 ) Limit the range of RANK to -LONG_ MAX ~ LONG_ MAX. Without this limit, passing -9223372036854775808 would effectively be the same as passing -1.	2023-05-14 09:04:33 +03:00
Chen Tianjie	29ca87955e	Add basic eventloop latency measurement. (#11963 ) The measured latency(duration) includes the list below, which can be shown by `INFO STATS`. ``` eventloop_cycles // ever increasing counter eventloop_duration_sum // cumulative duration of eventloop in microseconds eventloop_duration_cmd_sum // cumulative duration of executing commands in microseconds instantaneous_eventloop_cycles_per_sec // average eventloop count per second in recent 1.6s instantaneous_eventloop_duration_usec // average single eventloop duration in recent 1.6s ``` Also added some experimental metrics, which are shown only when `INFO DEBUG` is called. This section isn't included in the default INFO, or even in `INFO ALL` and the fields in this section can change in the future without considering backwards compatibility. ``` eventloop_duration_aof_sum // cumulative duration of writing AOF eventloop_duration_cron_sum // cumulative duration cron jobs (serverCron, beforeSleep excluding IO and AOF) eventloop_cmd_per_cycle_max // max number of commands executed in one eventloop eventloop_duration_max // max duration of one eventloop ``` All of these are being reset by CONFIG RESETSTAT	2023-05-12 20:13:15 +03:00
Leibale Eidelman	e04ebdb8d3	fix `GEORADIUS[BYMEMBER]` `STORE` & `STOREDIST` args spec (#12151 ) in GEO commands, `STORE` and `STOREDIST` are mutually exclusive. use `oneof` block to contain them	2023-05-09 14:24:37 +03:00
Binbin	6ab2174d37	EXPIRE precision test more attempts and more tolerant (#12150 ) The test failed on MacOS: ``` *** [err]: EXPIRE precision is now the millisecond in tests/unit/expire.tcl Expected 'somevalue {}' to equal or match '{} {}' ``` `set a [r get x]`, even though we tried 10 times, sometimes we still get {}, this is a time-sensitive test. In this PR, we add the following changes: 1. More attempts, change it from 10 to 30. 2. More tolerant, change the `after 900` to `after 800`. In addition, we judging $a in advance and changing `after 1100` to `after 300`, this will save us some times.	2023-05-09 14:14:22 +03:00
cui fliter	03d50e0c30	Remove several instances of duplicate "the" in comments (#12144 ) Remove several instances of duplicate "the" in comments	2023-05-08 16:12:44 -07:00
Wen Hui	42dd98ec19	adding missing test cases GET and GETEX (#12125 ) adding test case of expired key or not exist for GET and GETEX. for better test coverage.	2023-05-07 11:46:11 +03:00
sundb	ce5f4ea3a9	Delete empty key if fails after moduleCreateEmptyKey() in module (#12129 ) When `RM_ZsetAdd()`/`RM_ZsetIncrby()`/`RM_StreamAdd()` fails, if a new key happens to be created using `moduleCreateEmptyKey()`, we should clean up the empty key. ## Test 1) Add new module commands(`zset.add` and `zset.incrby`) to cover `RM_ZsetAdd()`/`RM_ZsetIncrby()`. 2) Add a large-memory test to cover `RM_StreamAdd()`.	2023-05-07 10:13:19 +03:00
zhaozhao.zz	b0dd7b3245	Free backlog only if rsi is invalid when master reboot (#12088 ) When master reboot from RDB, if rsi in RDB is valid we should not free replication backlog, even if master_repl_offset or repl-offset is 0. Since if master doesn't send any data to replicas master_repl_offset is 0, it's a valid number. A clear example: 1. start a master and apply some write commands, the master's master_repl_offset is 0 since it has no replicas. 2. stop write commands on master, and start another instance and replicaof the master, trigger an FULLRESYNC 3. the master's master_repl_offset is still 0 (set a large number for repl-ping-replica-period), do BGSAVE and restart the master 4. master load master_repl_offset from RDB's rsi and it's still 0, and we should make sure replica can partially resync with master.	2023-05-06 11:53:28 +08:00
Binbin	e49c2a5292	Pause cron to prevent premature shrinking in querybuf test (#12126 ) Tests occasionally fail since #12000: ``` * [err]: query buffer resized correctly when not idle in tests/unit/querybuf.tcl Expected 0 > 32768 (context: type eval line 11 cmd {assert {$orig_test_client_qbuf > 32768}} proc ::test) * [err]: query buffer resized correctly with fat argv in tests/unit/querybuf.tcl query buffer should not be resized when client idle time smaller than 2s ``` The reason may be because we set hz to 100, querybuf shrinks before we count client_query_buffer. We avoid this problem by setting pause-cron to 1.	2023-05-04 13:02:08 +03:00
guybe7	857c09b04d	multi.tcl: reset readraw at the end of the test (#12123 ) 1. reset the readraw mode after a test that uses it. undetected since the only test after that on the same server didn't read any replies. 2. fix a cross slot issue that was undetected in cluster mode because readraw doesn't throw exceptions on errors.	2023-05-04 11:58:31 +03:00

1 2 3 4 5 ...

2198 Commits