redis

Commit Graph

Author	SHA1	Message	Date
Salvatore Sanfilippo	bc71e8fe2d	[Vector sets] More rdb loading fixes (#14032 ) Hi all, this PR fixes two things: 1. An assertion, that prevented the RDB loading from recovery if there was a quantization type mismatch (with regression test). 2. Two code paths that just returned NULL without proper cleanup during RDB loading.	2025-05-27 15:39:18 +03:00
Salvatore Sanfilippo	ad9575dd6d	[Vector sets] RDB IO errors handling (#13978 ) This PR adds support for REDISMODULE_OPTIONS_HANDLE_IO_ERRORS. and tests for short read and corrupted RESTORE payload. Please: note that I also removed the comment about async loading support since we should be already covered. No manipulation of global data structures in Vector Sets, if not for the unique ID used to create new vector sets with different IDs.	2025-05-27 15:39:18 +03:00
Pieter Cailliau	d65102861f	Adding AGPLv3 as a license option to Redis! (#13997 ) Read more about [the new license option](http://redis.io/blog/agplv3/) and [the Redis 8 release](http://redis.io/blog/redis-8-ga/).	2025-05-01 14:04:22 +01:00
Oran Agra	725c71b87a	fix flaky replication test (#13945 ) from the master's perspective, the replica can become online before it's actually done loading the rdb file. this was always like that, in disk-based repl, and thus ok with diskless and rdb channel. in this test, because all the keys are added before the backlog is created, the replication offset is 0, so the test proceeds and could get a LOADING error when trying to run the function.	2025-04-15 17:18:00 +03:00
Ozan Tezcan	eafc365040	Fix flaky replication test (#13909 ) CI / test-ubuntu-latest (push) Failing after 31s Details CI / test-sanitizer-address (push) Failing after 32s Details CI / build-debian-old (push) Failing after 31s Details CI / build-32bit (push) Failing after 31s Details CI / build-libc-malloc (push) Failing after 31s Details CI / build-centos-jemalloc (push) Failing after 31s Details CI / build-old-chain-jemalloc (push) Failing after 32s Details Codecov / code-coverage (push) Failing after 31s Details Spellcheck / Spellcheck (push) Failing after 31s Details Coverity Scan / coverity (push) Has been skipped Details External Server Tests / test-external-standalone (push) Failing after 32s Details External Server Tests / test-external-cluster (push) Failing after 32s Details External Server Tests / test-external-nodebug (push) Failing after 1m56s Details CI / build-macos-latest (push) Has been cancelled Details Test fails time to time: ``` *** [err]: Slave is able to detect timeout during handshake in tests/integration/replication.tcl Replica is not able to detect timeout ``` Depending on the timing, "debug sleep" may occur during rdbchannel handshake and required log statement won't be printed to the log in that case. Better to wait after rdbchannel handshake.	2025-04-07 13:12:33 +03:00
Ozan Tezcan	3cdb8c6046	Improve replication buffering on replica and fix a related bug (#13904 ) With RDB channel replication, we introduced parallel replication stream and RDB delivery to the replica during a full sync. Currently, after the replica loads the RDB and begins streaming the accumulated buffer to the database, it does not read from the master connection during this period. Although streaming the local buffer is generally a fast operation, it can take some time if the buffer is large. This PR introduces buffering during the streaming of the local buffer. One important consideration is ensuring that we consume more than we read during this operation; otherwise, it could take indefinitely. To guarantee that it will eventually complete, we limit the read to at most half of what we consume, e.g. read at most 1 mb once we consume at least 2 mb. Additional changes Bug fix - Currently, when replica starts draining accumulated buffer, we call protectClient() for the master client as we occasionally yield back to event loop via processEventsWhileBlocked(). So, it prevents freeing the master client. While we are in this loop, if replica receives "replicaof newmaster" command, we call replicaSetMaster() which expects to free the master client and trigger a new connection attempt. As the client object is protected, its destruction will happen asynchronously. Though, a new connection attempt to new master will be made immediately. Later, when the replication buffer is drained, we realize master client was marked as CLOSE_ASAP, and freeing master client triggers another connection attempt to the new master. In most cases, we realize something is wrong in the replication state machine and abort the second attempt later. So, the bug may go undetected. Fix is not calling protectClient() for the master client. Instead, trying to detect if master client is disconnected during processEventsWhileBlocked() and if so, breaking the loop immediately. Related improvement: - Currently, the replication buffer is a linked list of buffers, each of which is 1 MB in size. While consuming the buffer, we process one buffer at a time and check if we need to yield back to `processEventsWhileBlocked()`. However, if `loading-process-events-interval-bytes` is set to less than 1 MB, this approach doesn't handle it well. To improve this, I've modified the code to process 16KB at a time and check `loading-process-events-interval-bytes` more frequently. This way, depending on the configuration, we may yield back to networking more often. - In replication.c, `disklessLoadingRio` will be set before a call to `emptyData()`. This change should not introduce any behavioral change but it is logically more correct as emptyData() may yield to networking and we may need to call rioAbort() on disklessLoadingRio. Otherwise, failure of main channel may go undetected until a failure on rdb channel on a corner case. Config changes - The default value for the `loading-process-events-interval-bytes` configuration is being lowered from 2MB to 512KB. This configuration primarily used for testing and controls the frequency of networking during the loading phase, specifically when loading the RDB or applying accumulated buffers during a full sync on the replica side. Before the introduction of RDB channel replication, the 2MB value was sufficient for occasionally yielding to networking, mainly to reply -loading to the clients. However, with RDB channel replication, during a full sync on the replica side (either while loading the RDB or applying the accumulated buffer), we need to yield back to networking more frequently to continue accumulating the replication stream. If this doesn’t happen often enough, the replication stream can accumulate on the master side, which is undesirable. To address this, we’ve decided to lower the default value to 512KB. One concern with frequent yielding to networking is the potential performance impact, as each call to processEventsWhileBlocked() involves 4 syscalls, which could slow down the RDB loading phase. However, benchmarking with various configuration values has shown that using 512KB or higher does not negatively impact RDB loading performance. Based on these results, 512KB is now selected as the default value. Test changes - Added improved version of a replication test which checks memory usage on master during full sync. --------- Co-authored-by: Oran Agra <oran@redislabs.com>	2025-04-03 10:04:29 +03:00
Ozan Tezcan	366c6aff81	Put replica online when bgsave is done (#13895 ) Before https://github.com/redis/redis/pull/13732, replicas were brought online immediately after master wrote the last bytes of the RDB file to the socket. This behavior remains unchanged if rdbchannel replication is not used. However, with rdbchannel replication, the replica is brought online after receiving the first ack which is sent by replica after rdb is loaded. To align the behavior, reverting this change to put replica online once bgsave is done. Additonal changes: - INFO field `mem_total_replication_buffers` will also contain `server.repl_full_sync_buffer.mem_used` which shows accumulated replication stream during rdbchannel replication on replica side. - Deleted debug level logging from some replication tests. These tests generate thousands of keys and it may cause per key logging on some cases.	2025-03-31 13:48:49 +03:00
Ozan Tezcan	a0da8390a2	Fix use-after-free when diskless load config is not swapdb (#13887 ) CI / build-macos-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Failing after 32s Details CI / build-debian-old (push) Failing after 31s Details CI / build-centos-jemalloc (push) Failing after 32s Details CI / build-libc-malloc (push) Failing after 32s Details CI / build-32bit (push) Failing after 32s Details CI / build-old-chain-jemalloc (push) Failing after 32s Details Codecov / code-coverage (push) Failing after 31s Details External Server Tests / test-external-standalone (push) Failing after 32s Details External Server Tests / test-external-cluster (push) Failing after 32s Details External Server Tests / test-external-nodebug (push) Failing after 32s Details CI / test-ubuntu-latest (push) Failing after 1m37s Details Spellcheck / Spellcheck (push) Failing after 32s Details When the diskless load configuration is set to on-empty-db, we retain a pointer to the function library context. When emptyData() is called, it frees this function library context pointer, leading to a use-after-free situation. I refactored code to ensure that emptyData() is called first, followed by retrieving the valid pointer to the function library context. Refactored code should not introduce any runtime implications. Bug introduced by https://github.com/redis/redis/pull/13495 (Redis 8.0) Co-authored-by: Oran Agra <oran@redislabs.com>	2025-03-26 21:50:10 +03:00
Yuan Wang	f1d6542b1a	Stabilize tcl test cases (#13829 ) Recently encountered some errors as bellow, HGETEX/HSETEX with PXAT/EXAT options, after getting ttl, we calculate current time by `[clock seconds]` that may have a delay that causes results greater than expected. Dismiss memory test error, now we introduced rdb-channel replication, the full synchronization might finish before the child process exits. So we may fail if calling `bgsave` immediately after full sync.	2025-02-25 16:31:53 +08:00
Yuan Wang	7f5f588232	AOF offset info (#13773 ) ### Background AOF is often used as an effective data recovery method, but now if we have two AOFs from different nodes, it is hard to learn which one has latest data. Generally, we determine whose data is more up-to-date by reading the latest modification time of the AOF file, but because of replication delay, even if both master and replica write to the AOF at the same time, the data in the master is more up-to-date (there are commands that didn't arrive at the replica yet, or a large number of commands have accumulated on replica side ), so we may make wrong decision. ### Solution The replication offset always increments when AOF is enabled even if there is no replica, we think replication offset is better method to determine which one has more up-to-date data, whoever has a larger offset will have newer data, so we add the start replication offset info for AOF, as bellow. ``` file appendonly.aof.2.base.rdb seq 2 type b file appendonly.aof.2.incr.aof seq 2 type i startoffset 224 ``` And if we close gracefully the AOF file, not a crash, such as `shutdown`, `kill signal 15` or `config set appendonly no`, we will add the end replication offset, as bellow. ``` file appendonly.aof.2.base.rdb seq 2 type b file appendonly.aof.2.incr.aof seq 2 type i startoffset 224 endoffset 532 ``` #### Things to pay attention to - For BASE AOF, we do not add `startoffset` and `endoffset` info, since we could not know the start replication replication of data, and it is useless to help us to determine which one has more up-to-date data. - For AOFs from old version, we also don't add `startoffset` and `endoffset` info, since we also don't know start replication replication of them. If we add the start offset from 0, we might make the judgment even less accurate. For example, if the master has just rewritten the AOF, its INCR AOF will inevitably be very small. However, if the replica has not rewritten AOF for a long time, its INCR AOF might be much larger. By applying the following method, we might make incorrect decisions, so we still just check timestamp instead of adding offset info - If the last INCR AOF has `startoffset` or `endoffset`, we need to restore `server.master_repl_offset` according to them to avoid the rollback of the `startoffset` of next INCR AOF. If it has `endoffset`, we just use this value as `server.master_repl_offset`, and a very important thing is to remove this information from the manifest file to avoid the next time we load the manifest file with wrong `endoffset`. If it only has `startoffset`, we calculate `server.master_repl_offset` by the `startoffset` plus the file size. ### How to determine which one has more up-to-date data If AOF has a larger replication offset, it will have more up-to-date data. The following is how to get AOF offset: Read the AOF manifest file to obtain information about the last INCR AOF 1. If the last INCR AOF has `endoffset` field, we can directly use the `endoffset` to present the replication offset of AOF 2. If there is no `endoffset`(such as redis crashes abnormally), but there is `startoffset` filed of the last INCR AOF, we can get the replication offset of AOF by `startoffset` plus the file size 3. Finally, if the AOF doesn’t have both `startoffset` and `endoffset`, maybe from old version, and new version redis has not rewritten AOF yet, we still need to check the modification timestamp of the last INCR AOF ### TODO Fix ping causing inconsistency between AOF size and replication offset in the future PR. Because we increment the replication offset when sending PING/REPLCONF to the replica but do not write data to the AOF file, this might cause the starting offset of the AOF file plus its size to be inconsistent with the actual replication offset.	2025-02-13 17:31:40 +08:00
Ozan Tezcan	09f8a2f374	Start AOFRW before streaming repl buffer during fullsync (#13758 ) During fullsync, before loading RDB on the replica, we stop aof child to prevent copy-on-write disaster. Once rdb is loaded, aof is started again and it will trigger aof rewrite. With https://github.com/redis/redis/pull/13732 , for rdbchannel replication, this behavior was changed. Currently, we start aof after replication buffer is streamed to db. This PR changes it back to start aof just after rdb is loaded (before repl buffer is streamed) Both approaches may have pros and cons. If we start aof before streaming repl buffers, we may still face with copy-on-write issues as repl buffers potentially include large amount of changes. If we wait until replication buffer drained, it means we are delaying starting aof persistence. Additional changes are introduced as part of this PR: - Interface change: Added `mem_replica_full_sync_buffer` field to the `INFO MEMORY` command reply. During full sync, it shows total memory consumed by accumulated replication stream buffer on replica. Added same metric to `MEMORY STATS` command reply as `replica.fullsync.buffer` field. - Fixes: - Count repl stream buffer size of replica as part of 'memory overhead' calculation for fields in "INFO MEMORY" and "MEMORY STATS" outputs. Before this PR, repl buffer was not counted as part of memory overhead calculation, causing misreports for fields like `used_memory_overhead` and `used_memory_dataset` in "INFO STATS" and for `overhead.total` field in "MEMORY STATS" command reply. - Dismiss replication stream buffers memory of replica in the fork to reduce COW impact during a fork. - Fixed a few time sensitive flaky tests, deleted a noop statement, fixed some comments and fail messages in rdbchannel tests.	2025-02-04 21:40:18 +03:00
Ozan Tezcan	73a9b916c9	Rdb channel replication (#13732 ) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 Motivation During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. Implementation steps 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. Some details - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. API changes in INFO output: 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` API changes in CLIENT LIST In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` Config changes: - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) Internal API changes: Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. Testing: As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. Replica state diagram ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>	2025-01-13 15:09:52 +03:00
Yuan Wang	64a40b20d9	Async IO Threads (#13695 ) ## Introduction Redis introduced IO Thread in 6.0, allowing IO threads to handle client request reading, command parsing and reply writing, thereby improving performance. The current IO thread implementation has a few drawbacks. - The main thread is blocked during IO thread read/write operations and must wait for all IO threads to complete their current tasks before it can continue execution. In other words, the entire process is synchronous. This prevents the efficient utilization of multi-core CPUs for parallel processing. - When the number of clients and requests increases moderately, it causes all IO threads to reach full CPU utilization due to the busy wait mechanism used by the IO threads. This makes it challenging for us to determine which part of Redis has reached its bottleneck. - When IO threads are enabled with TLS and io-threads-do-reads, a disconnection of a connection with pending data may result in it being assigned to multiple IO threads simultaneously. This can cause race conditions and trigger assertion failures. Related issue: redis#12540 Therefore, we designed an asynchronous IO threads solution. The IO threads adopt an event-driven model, with the main thread dedicated to command processing, meanwhile, the IO threads handle client read and write operations in parallel. ## Implementation ### Overall As before, we did not change the fact that all client commands must be executed on the main thread, because Redis was originally designed to be single-threaded, and processing commands in a multi-threaded manner would inevitably introduce numerous race and synchronization issues. But now each IO thread has independent event loop, therefore, IO threads can use a multiplexing approach to handle client read and write operations, eliminating the CPU overhead caused by busy-waiting. the execution process can be briefly described as follows: the main thread assigns clients to IO threads after accepting connections, IO threads will notify the main thread when clients finish reading and parsing queries, then the main thread processes queries from IO threads and generates replies, IO threads handle writing reply to clients after receiving clients list from main thread, and then continue to handle client read and write events. ### Each IO thread has independent event loop We now assign each IO thread its own event loop. This approach eliminates the need for the main thread to perform the costly `epoll_wait` operation for handling connections (except for specific ones). Instead, the main thread processes requests from the IO threads and hands them back once completed, fully offloading read and write events to the IO threads. Additionally, all TLS operations, including handling pending data, have been moved entirely to the IO threads. This resolves the issue where io-threads-do-reads could not be used with TLS. ### Event-notified client queue To facilitate communication between the IO threads and the main thread, we designed an event-notified client queue. Each IO thread and the main thread have two such queues to store clients waiting to be processed. These queues are also integrated with the event loop to enable handling. We use pthread_mutex to ensure the safety of queue operations, as well as data visibility and ordering, and race conditions are minimized, as each IO thread and the main thread operate on independent queues, avoiding thread suspension due to lock contention. And we implemented an event notifier based on `eventfd` or `pipe` to support event-driven handling. ### Thread safety Since the main thread and IO threads can execute in parallel, we must handle data race issues carefully. client->flags The primary tasks of IO threads are reading and writing, i.e. `readQueryFromClient` and `writeToClient`. However, IO threads and the main thread may concurrently modify or access `client->flags`, leading to potential race conditions. To address this, we introduced an io-flags variable to record operations performed by IO threads, thereby avoiding race conditions on `client->flags`. Pause IO thread In the main thread, we may want to operate data of IO threads, maybe uninstall event handler, access or operate query/output buffer or resize event loop, we need a clean and safe context to do that. We pause IO thread in `IOThreadBeforeSleep`, do some jobs and then resume it. To avoid thread suspended, we use busy waiting to confirm the target status. Besides we use atomic variable to make sure memory visibility and ordering. We introduce these functions to pause/resume IO Threads as below. ``` pauseIOThread, resumeIOThread pauseAllIOThreads, resumeAllIOThreads pauseIOThreadsRange, resumeIOThreadsRange ``` Testing has shown that `pauseIOThread` is highly efficient, allowing the main thread to execute nearly 200,000 operations per second during stress tests. Similarly, `pauseAllIOThreads` with 8 IO threads can handle up to nearly 56,000 operations per second. But operations performed between pausing and resuming IO threads must be quick; otherwise, they could cause the IO threads to reach full CPU utilization. freeClient and freeClientAsync The main thread may need to terminate a client currently running on an IO thread, for example, due to ACL rule changes, reaching the output buffer limit, or evicting a client. In such cases, we need to pause the IO thread to safely operate on the client. maxclients and maxmemory-clients updating When adjusting `maxclients`, we need to resize the event loop for all IO threads. Similarly, when modifying `maxmemory-clients`, we need to traverse all clients to calculate their memory usage. To ensure safe operations, we pause all IO threads during these adjustments. Client info reading The main thread may need to read a client’s fields to generate a descriptive string, such as for the `CLIENT LIST` command or logging purposes. In such cases, we need to pause the IO thread handling that client. If information for all clients needs to be displayed, all IO threads must be paused. Tracking redirect Redis supports the tracking feature and can even send invalidation messages to a connection with a specified ID. But the target client may be running on IO thread, directly manipulating the client’s output buffer is not thread-safe, and the IO thread may not be aware that the client requires a response. In such cases, we pause the IO thread handling the client, modify the output buffer, and install a write event handler to ensure proper handling. clientsCron In the `clientsCron` function, the main thread needs to traverse all clients to perform operations such as timeout checks, verifying whether they have reached the soft output buffer limit, resizing the output/query buffer, or updating memory usage. To safely operate on a client, the IO thread handling that client must be paused. If we were to pause the IO thread for each client individually, the efficiency would be very low. Conversely, pausing all IO threads simultaneously would be costly, especially when there are many IO threads, as clientsCron is invoked relatively frequently. To address this, we adopted a batched approach for pausing IO threads. At most, 8 IO threads are paused at a time. The operations mentioned above are only performed on clients running in the paused IO threads, significantly reducing overhead while maintaining safety. ### Observability In the current design, the main thread always assigns clients to the IO thread with the least clients. To clearly observe the number of clients handled by each IO thread, we added the new section in INFO output. The `INFO THREADS` section can show the client count for each IO thread. ``` # Threads io_thread_0:clients=0 io_thread_1:clients=2 io_thread_2:clients=2 ``` Additionally, in the `CLIENT LIST` output, we also added a field to indicate the thread to which each client is assigned. `id=244 addr=127.0.0.1:41870 laddr=127.0.0.1:6379 ... resp=2 lib-name= lib-ver= io-thread=1` ## Trade-off ### Special Clients For certain special types of clients, keeping them running on IO threads would result in severe race issues that are difficult to resolve. Therefore, we chose not to offload these clients to the IO threads. For replica, monitor, subscribe, and tracking clients, main thread may directly write them a reply when conditions are met. Race issues are difficult to resolve, so we have them processed in the main thread. This includes the Lua debug clients as well, since we may operate connection directly. For blocking client, after the IO thread reads and parses a command and hands it over to the main thread, if the client is identified as a blocking type, it will be remained in the main thread. Once the blocking operation completes and the reply is generated, the client is transferred back to the IO thread to send the reply and wait for event triggers. ### Clients Eviction To support client eviction, it is necessary to update each client’s memory usage promptly during operations such as read, write, or command execution. However, when a client operates on an IO thread, it is not feasible to update the memory usage immediately due to the risk of data races. As a result, memory usage can only be updated either in the main thread while processing commands or in the `ClientsCron` periodically. The downside of this approach is that updates might experience a delay of up to one second, which could impact the precision of memory management for eviction. To avoid incorrectly evicting clients. We adopted a best-effort compensation solution, when we decide to eviction a client, we update its memory usage again before evicting, if the memory used by the client does not decrease or memory usage bucket is not changed, then we will evict it, otherwise, not evict it. However, we have not completely solved this problem. Due to the delay in memory usage updates, it may lead us to make incorrect decisions about the need to evict clients. ### Defragment In the majority of cases we do NOT use the data from argv directly in the db. 1. key names We store a copy that we allocate in the main thread, see `sdsdup()` in `dbAdd()`. 2. hash key and value We store key as hfield and store value as sds, see `hfieldNew()` and `sdsdup()` in `hashTypeSet()`. 3. other datatypes They don't even use SDS, so there is no reference issues. But in some cases client the data from argv may be retain by the main thread. As a result, during fragmentation cleanup, we need to move allocations from the IO thread’s arena to the main thread’s arena. We always allocate new memory in the main thread’s arena, but the memory released by IO threads may not yet have been reclaimed. This ultimately causes the fragmentation rate to be higher compared to creating and allocating entirely within a single thread. The following cases below will lead to memory allocated by the IO thread being kept by the main thread. 1. string related command: `append`, `getset`, `mset` and `set`. If `tryObjectEncoding()` does not change argv, we will keep it directly in the main thread, see the code in `tryObjectEncoding()`(specifically `trimStringObjectIfNeeded()`) 2. block related command. the key names will be kept in `c->db->blocking_keys`. 3. watch command the key names will be kept in `c->db->watched_keys`. 4. [s]subscribe command channel name will be kept in `serverPubSubChannels`. 5. script load command script will be kept in `server.lua_scripts`. 7. some module API: `RM_RetainString`, `RM_HoldString` Those issues will be handled in other PRs. ## Testing ### Functional Testing The commit with enabling IO Threads has passed all TCL tests, but we did some changes: Client query buffer: In the original code, when using a reusable query buffer, ownership of the query buffer would be released after the command was processed. However, with IO threads enabled, the client transitions from an IO thread to the main thread for processing. This causes the ownership release to occur earlier than the command execution. As a result, when IO threads are enabled, the client's information will never indicate that a shared query buffer is in use. Therefore, we skip the corresponding query buffer tests in this case. Defragment: Add a new defragmentation test to verify the effect of io threads on defragmentation. Command delay: For deferred clients in TCL tests, due to clients being assigned to different threads for execution, delays may occur. To address this, we introduced conditional waiting: the process proceeds to the next step only when the `client list` contains the corresponding commands. ### Sanitizer Testing The commit passed all TCL tests and reported no errors when compiled with the `fsanitizer=thread` and `fsanitizer=address` options enabled. But we made the following modifications: we suppressed the sanitizer warnings for clients with watched keys when updating `client->flags`, we think IO threads read `client->flags`, but never modify it or read the `CLIENT_DIRTY_CAS` bit, main thread just only modifies this bit, so there is no actual data race. ## Others ### IO thread number In the new multi-threaded design, the main thread is primarily focused on command processing to improve performance. Typically, the main thread does not handle regular client I/O operations but is responsible for clients such as replication and tracking clients. To avoid breaking changes, we still consider the main thread as the first IO thread. When the io-threads configuration is set to a low value (e.g., 2), performance does not show a significant improvement compared to a single-threaded setup for simple commands (such as SET or GET), as the main thread does not consume much CPU for these simple operations. This results in underutilized multi-core capacity. However, for more complex commands, having a low number of IO threads may still be beneficial. Therefore, it’s important to adjust the `io-threads` based on your own performance tests. Additionally, you can clearly monitor the CPU utilization of the main thread and IO threads using `top -H -p $redis_pid`. This allows you to easily identify where the bottleneck is. If the IO thread is the bottleneck, increasing the `io-threads` will improve performance. If the main thread is the bottleneck, the overall performance can only be scaled by increasing the number of shards or replicas. --------- Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: oranagra <oran@redislabs.com>	2024-12-23 14:16:40 +08:00
Yuan Wang	b71a610f5c	Clean up .rediscli_history_test temporary file (#13601 ) After running test in local, there will be a file named `.rediscli_history_test`, and it is not in `.gitignore` file, so this is considered to have changed the code base. It is a little annoying, this commit just clean up the temporary file. We should delete `.rediscli_history_test` in the end since the second server tests also write somethings into it, to make it corresponding, i put `set ::env(REDISCLI_HISTFILE) ".rediscli_history_test"` at the beginning. Maybe we also can add this file into `.gitignore`	2024-10-17 09:12:11 +08:00
debing.sun	438cfed70a	Replace wrongly free with zfree in redis-cli (#13560 ) #13258 Incorrect use of free instead of zfree	2024-09-23 09:40:47 +08:00
Filipe Oliveira (Redis)	9146ac050b	Optimize HSCAN/ZSCAN command in case of listpack encoding: avoid the usage of intermediate list (#13531 ) Similar to #13530 , applied to HSCAN and ZSCAN in case of listpack encoding. Preliminary benchmark results showcase an improvement of 108% on the achievable ops/sec for ZSCAN and 65% for HSCAN. --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2024-09-13 20:36:19 +08:00
Filipe Oliveira (Redis)	f2f85ba354	Optimize SSCAN command in case of listpack or intset encoding: avoid the usage of intermediate list. From 2N to N iterations (#13530 ) On SSCAN, in case of listpack and intset encoding we actually reply the entire set, and always reply with the cursor 0. For those cases, we don't need to accumulate the replies in a list and can completely avoid the overhead of list appending and then iterating over the list again -- meaning we do N iterations instead of 2N iterations over the SET and save intermediate memory as well. Preliminary benchmarks, `SSCAN set:100 0`, showcased an improvement of 60% as visible bellow on a SET with 100 string elements (listpack encoded).	2024-09-12 22:36:54 +08:00
debing.sun	74609d44cd	Fix set with invalid length causes smembers to hang (#13515 ) After https://github.com/redis/redis/pull/13499, If the length set by `addReplySetLen()` does not match the actual number of elements in the reply, it will cause protocol broken and result in the client hanging.	2024-09-04 17:35:46 +08:00
Ozan Tezcan	a7afd1d2b2	Reply LOADING on replica while flushing the db (#13495 ) On a full sync, replica starts discarding existing db. If the existing db is huge and flush is happening synchronously, replica may become unresponsive. Adding a change to yield back to event loop while flushing db on a replica. Replica will reply -LOADING in this case. Note that while replica is loading the new rdb, it may get an error and start flushing the partial db. This step may take a long time as well. Similarly, replica will reply -LOADING in this case. To call processEventsWhileBlocked() and reply -LOADING, we need to do: - Set connSetReadHandler() null not to process further data from the master - Set server.loading flag - Call blockingOperationStarts() rdbload() already does these steps and calls processEventsWhileBlocked() while loading the rdb. Added a new call rdbLoadWithEmptyFunc() which accepts callback to flush db before loading rdb or when an error happens while loading. For diskless replication, doing something similar and calling emptyData() after setting required flags. Additional changes: - Allow `appendonly` config change during loading. Config can be changed while loading data on startup or on replication when slave is loading RDB. We allow config change command to update `server.aof_enabled` and then lazily apply config change after loading operation is completed. - Added a test for `replica-lazy-flush` config	2024-09-03 09:48:44 +03:00
Vitah Lin	8038eb3147	Fix wrong dbnum showed in redis-cli after client reconnected (#13411 ) When the server restarts while the CLI is connecting, the reconnection does not automatically select the previous db. This may lead users to believe they are still in the previous db, in fact, they are in db0. This PR will automatically reset the current dbnum and `cliSelect()` again when reconnecting. --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2024-08-03 12:06:02 +08:00
debing.sun	e750c619b2	Fix some test failures caused by key being deleted due to premature expiration (#13453 ) 1. Fix fuzzer test failure when the key was deleted due to expiration before sending random traffic for the key. After HFE, when all fields in a hash are expired, the hash might be deleted due to expiration. If the key was expired in the mid of `RESTORE` command and sending rand trafic, `fuzzer` test will fail in the following code because the 'TYPE key' will return `none` and then throw an exception because it cannot be found in `$commands` `94b9072e44/tests/support/util.tcl (L712-L713)` This PR adds a `None` check for the reply of `KEY TYPE` command and adds a print of `err` to avoid false positives in the future. failed CI: https://github.com/redis/redis/actions/runs/10127334121/job/28004985388 2. Fix the issue where key was deleted due to expiration before the `scan.scan_key` command could execute, caused by premature enabling of `set-active-expire`. failed CI: https://github.com/redis/redis/actions/runs/10153722715/job/28077610552 --------- Co-authored-by: oranagra <oran@redislabs.com>	2024-07-31 08:15:39 +08:00
Oran Agra	e74550dd10	solve races in replication lpop tests (#13445 ) * some tests didn't wait for replication offset sync * tests that used deferring client, didn't wait for it to get blocked. an in some cases, the replication offset sync ended before the deferring client finished, so the digest match failed. * some tests used deferring clients excessively * the tests didn't read the client response * the tests didn't close the client (fd leak)	2024-07-25 14:06:40 +03:00
Oran Agra	447ce11a64	solve race conditions in tests (#13433 ) [exception]: Executing test client: ERR FAILOVER target replica is not online.. ERR FAILOVER target replica is not online. while executing "$node_0 failover to $node_1_host $node_1_port" ("uplevel" body line 16) invoked from within "uplevel 1 $code" (procedure "test" line 58) invoked from within "test {failover command to specific replica works} { [err]: client evicted due to percentage of maxmemory in tests/unit/client-eviction.tcl Expected 33622 >= 220200 && 33622 < 440401 (context: type eval line 17 cmd {assert {$tot_mem >= $n && $tot_mem < $maxmemory_clients_actual}} proc ::test)	2024-07-22 10:11:56 +03:00
Oran Agra	880e147d52	solve redis-cli test failures due to local history file (#13419 ) test failure: ``` [err]: Interactive CLI: should find second search result if user presses ctrl+s in tests/integration/redis-cli.tcl Expected '1' to be equal to '0' (context: type eval line 10 cmd {assert_equal 1 [regexp {$i-search$: \x1B\[0mk\x1B\[1mey\x1B\[0ms one} $result]} proc ::test) ``` this test (introduced in #12543) depends on the local history file, so it can fail if there's some match there. the fix is to use a different history file, and delete it before each run.	2024-07-15 15:24:43 +03:00
debing.sun	69b480cb7a	Hide user data from log (#13400 ) This PR is based on the commits from PR #11747. In the event of an assertion failure, hide command arguments from the operator. In some cases, private client information can be voluntarily exposed when a redis instance crashes due to an assertion failure. This commit prevent וnintentional client info exposure. Operators can still access the hidden data, but they must actively request it. Any of the client info commands remains the unchanged. ### Config Add a new config `hide-user-data-from-log` to turn this feature on and off, default off. --------- Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: naglera <58042354+naglera@users.noreply.github.com>	2024-07-09 18:54:18 +08:00
Moti Cohen	a84cc20aef	HFE - Fix statistic to count also lazy expired and rename INFO params (#13372 ) * INFO command : rename `hashes_with_expiry_fields` to `subexpiry` * INFO command : rename `expired_hash_fields` to `expired_subkeys` * Fix statistic of `expired_subkeys` to count also lazy expired * Remove TODOs comments leftover in TCL * Fix potential flaky test of rdb load of hash-field-expiration	2024-07-02 18:22:10 +03:00
Moti Cohen	f01fdc3960	Reserve 2 bits out of EB_EXPIRE_TIME_MAX for possible future use (#13331 ) Reserve 2 bits out of hash-field expiration time (`EB_EXPIRE_TIME_MAX`) for possible future lightweight indexing/categorizing of fields. It can be achieved by hacking HFE as follows: ``` HPEXPIREAT key [ 2^47 + USER_INDEX ] FIELDS numfields field [field …] ``` Redis will also need to expose kind of `HEXPIRESCAN` and `HEXPIRECOUNT` for this idea. Yet to be better defined. `HFE_MAX_ABS_TIME_MSEC` constraint must be enforced only at API level. Internally, the expiration time can be up to `EB_EXPIRE_TIME_MAX` for future readiness.	2024-06-10 16:57:26 +03:00
debing.sun	7b9e960690	Hash Field Expiration (#13303 ) ## Background This PR introduces support for field-level expiration in Redis hashes. Previously, Redis supported expiration only at the key level, but this enhancement allows setting expiration times for individual fields within a hash. ## New commands * HEXPIRE * HEXPIREAT * HEXPIRETIME * HPERSIST * HPEXPIRE * HPEXPIREAT * HPEXPIRETIME * HPTTL * HTTL ## Short example from @moticless ```sh 127.0.0.1:6379> hset myhash f1 v1 f2 v2 f3 v3 (integer) 3 127.0.0.1:6379> hpexpire myhash 10000 NX fields 2 f2 f3 1) (integer) 1 2) (integer) 1 127.0.0.1:6379> hpttl myhash fields 3 f1 f2 f3 1) (integer) -1 2) (integer) 9997 3) (integer) 9997 127.0.0.1:6379> hgetall myhash 1) "f3" 2) "v3" 3) "f2" 4) "v2" 5) "f1" 6) "v1" ... after 10 seconds ... 127.0.0.1:6379> hgetall myhash 1) "f1" 2) "v1" 127.0.0.1:6379> ``` ## Expiration strategy 1. Integrate active Redis periodically performs active expiration and deletion of hash keys that contain expired fields, with a maximum attempt limit. 3. Lazy expiration When a client touches fields within a hash, Redis checks if the fields are expired. If a field is expired, it will be deleted. However, we do not delete expired fields during a traversal, we implicitly skip over them. ## RDB changes Add two new rdb type s`RDB_TYPE_HASH_METADATA` and `RDB_TYPE_HASH_LISTPACK_EX`. ## Notification 1. Add `hpersist` notification for `HPERSIST` command. 5. Add `hexpire` notification for `HEXPIRE`, `HEXPIREAT`, `HPEXPIRE` and `HPEXPIREAT` commands. ## Internal 1. Add new data structure `ebuckets`, which is used to store TTL and keys, enabling quick retrieval of keys based on TTL. 2. Add new data structure `mstr` like sds, which is used to store a string with TTL. This work was done by @moticless, @tezc, @ronen-kalish, @sundb, I just release it.	2024-05-30 15:26:19 +08:00
Moti Cohen	33fc0fbfae	HFE to support AOF and replicas (#13285 ) * For replica sake, rewrite commands `HEXPIRE` , `HSETF`, `HGETF` to have absolute unix time in msec. * On active-expiration of field, propagate HDEL to replica (`propagateHashFieldDeletion()`) * On lazy-expiration, propagate HDEL to replica (`hashTypeGetValue()` now calls `hashTypeDelete()`. It also takes care to call `propagateHashFieldDeletion()`). * Fix `HEXPIRE` command such that if it gets flag `LT` and it doesn’t have any expiration on the field then it will considered as valid condition. Note, replicas doesn’t make any active expiration, and should avoid lazy expiration. On `hashTypeGetValue()` it doesn't check expiration (As long as the master didn’t request to delete the field, it is valid) TODO: * Attach `dbid` to HASH metadata. See [here](https://github.com/redis/redis/pull/13209#discussion_r1593385850) --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2024-05-29 19:47:48 +08:00
debing.sun	95cbe879e5	sanitize dump payload for HFE (#13278 ) Add the following validations: 1. Get TTL using the lpGetIntegerValue() method instead of lpGetValue(), Ref https://github.com/redis/redis/pull/13209#discussion_r1602569422 2. The TTL of listpackex is a number in the valid range (0~EB_EXPIRE_TIME_MAX) and ordered. 3. The TTL fields of listpackex are ordered. 4. The TTL of hashtable is within the valid range (0~EB_EXPIRE_TIME_MAX). Other: Fix the missing of handling OBJ_ENCODING_LISTPACK_EX in dismissHashObject(). --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>	2024-05-22 10:53:30 +08:00
Ozan Tezcan	36c3cec6d1	Fix hfe RDB tests by adding FIELDS keyword to hexpire commands (#13277 ) FIELDS keyword was added as part of [#13270](https://github.com/redis/redis/pull/13270). It was missing in [#13243](https://github.com/redis/redis/pull/13243)	2024-05-18 11:34:13 +03:00
Ronen Kalish	323be4d699	Hfe serialization listpack (#13243 ) Add RDB de/serialization for HFE This PR adds two new RDB types: `RDB_TYPE_HASH_METADATA` and `RDB_TYPE_HASH_LISTPACK_TTL` to save HFE data. When the hash RAM encoding is dict, it will be saved in the former, and when it is listpack it will be saved in the latter. Both formats just add the TTL value for each field after the data that was previously saved, i.e HASH_METADATA will save the number of entries and, for each entry, key, value and TTL, whereas listpack is saved as a blob. On read, the usual dict <--> listpack conversion takes place if required. In addition, when reading a hash that was saved as a dict fields are actively expired if expiry is due. Currently this slao holds for listpack encoding, but it is supposed to be removed. TODO: Remove active expiry on load when loading from listpack format (unless we'll decide to keep it)	2024-05-17 18:27:02 +08:00
ClaytonNorthey92	8a05f0092b	Add reverse history search in redis-cli (linenoise) (#12543 ) added reverse history search to redis-cli, use it with the following: * CTRL+R : enable search backward mode, and search next one when pressing CTRL+R again until reach index 0. ``` 127.0.0.1:6379> keys one 127.0.0.1:6379> keys two (reverse-i-search): # press CTRL+R (reverse-i-search): keys two # input `keys` (reverse-i-search): keys one # press CTRL+R again (reverse-i-search): keys one # press CTRL+R again, still `keys one` due to reaching index 0 (i-search): keys two # press CTRL+S, enable search forward (i-search): keys two # press CTRL+S, still `keys one` due to reaching index 1 ``` * CTRL+S : enable search forward mode, and search next one when pressing CTRL+S again until reach index 0. ``` 127.0.0.1:6379> keys one 127.0.0.1:6379> keys two (i-search): # press CTRL+S (i-search): keys one # input `keys` (i-search): keys two # press CTRL+S again (i-search): keys two # press CTRL+R again, still `keys two` due to reaching index 0 (reverse-i-search): keys one # press CTRL+R, enable search backward (reverse-i-search): keys one # press CTRL+S, still `keys one` due to reaching index 1 ``` * CTRL+G : disable ``` 127.0.0.1:6379> keys one 127.0.0.1:6379> keys two (reverse-i-search): # press CTRL+R (reverse-i-search): keys two # input `keys` 127.0.0.1:6379> # press CTRL+G ``` * CTRL+C : disable ``` 127.0.0.1:6379> keys one 127.0.0.1:6379> keys two (reverse-i-search): # press CTRL+R (reverse-i-search): keys two # input `keys` 127.0.0.1:6379> # press CTRL+G ``` * TAB : use the current search result and exit search mode ``` 127.0.0.1:6379> keys one 127.0.0.1:6379> keys two (reverse-i-search): # press CTRL+R (reverse-i-search): keys two # input `keys` 127.0.0.1:6379> keys two # press TAB ``` * ENTER : use the current search result and execute the command ``` 127.0.0.1:6379> keys one 127.0.0.1:6379> keys two (reverse-i-search): # press CTRL+R (reverse-i-search): keys two # input `keys` 127.0.0.1:6379> keys two # press ENTER (empty array) 127.0.0.1:6379> ``` * any arrow key will disable reverse search your result will have the search match bolded, you can press enter to execute the full result note: I have _only added this for multi-line mode_, as it seems to be forced that way when `repl` is called Closes: https://github.com/redis/redis/issues/8277 --------- Co-authored-by: Clayton Northey <clayton@knowbl.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Bjorn Svensson <bjorn.a.svensson@est.tech> Co-authored-by: Viktor Söderqvist <viktor@zuiderkwast.se>	2024-05-10 11:10:14 +08:00
Binbin	da727ad445	Fix redis-check-aof incorrectly considering data in manifest format as MP-AOF (#12958 ) The check in fileIsManifest misjudged the manifest file. For example, if resp aof contains "file", it will be considered a manifest file and the check will fail: ``` *3 $3 set $4 file $4 file ``` In #12951, if the preamble aof also contains it, it will also fail. Fixes #12951. the bug was happening if the the word "file" is mentioned in the first 1024 lines of the AOF. and now as soon as it finds a non-comment line it'll break (if it contains "file" or doesn't)	2024-03-12 08:47:43 +02:00
Binbin	ca5cac998e	xinfo-stream add minimum to seen-time, skip logreqres in fuzzer (#13056 ) Recently I saw in CI that reply-schemas-validator fails here: ``` Failed validating 'minimum' in schema[1]['properties']['groups']['items']['properties']['consumers']['items']['properties']['active-time']: {'description': 'Last time this consumer was active (successful ' 'reading/claiming).', 'minimum': 0, 'type': 'integer'} On instance['groups'][0]['consumers'][0]['active-time']: -1729380548878722639 ``` The reason is that in fuzzer, we may restore corrupted active-time, which will cause the reply schema CI to fail. The fuzzer can cause corrupt the state in many places, which will bugs that mess up the reply, so we decided to skip logreqres. Also, seen-time is the same type as active-time, adding the minimum. --------- Co-authored-by: Oran Agra <oran@redislabs.com>	2024-02-20 12:21:10 +02:00
Madelyn Olson	8bb9a2895e	Address some failures with new tests for improving debug report (#12915 ) Fix a daily test failure because alpine doesn't support stack traces and add in an extra assertion related to making sure the stack trace was printed twice.	2024-01-08 17:56:06 -08:00
meiravgri	2e854bccc6	Fix async safety in signal handlers (#12658 ) see discussion from after https://github.com/redis/redis/pull/12453 was merged ---- This PR replaces signals that are not considered async-signal-safe (AS-safe) with safe calls. #### 1. serverLog() and serverLogFromHandler() `serverLog` uses unsafe calls. It was decided that we will avoid `serverLog` calls by the signal handlers when: * The signal is not fatal, such as SIGALRM. In these cases, we prefer using `serverLogFromHandler` which is the safe version of `serverLog`. Note they have different prompts: `serverLog`: `62220:M 26 Oct 2023 14:39:04.526 # <msg>` `serverLogFromHandler`: `62220:signal-handler (1698331136) <msg>` * The code was added recently. Calls to `serverLog` by the signal handler have been there ever since Redis exists and it hasn't caused problems so far. To avoid regression, from now we should use `serverLogFromHandler` #### 2. `snprintf` `fgets` and `strtoul`(base = 16) --------> `_safe_snprintf`, `fgets_async_signal_safe`, `string_to_hex` The safe version of `snprintf` was taken from [here](`8cfc4ca5e7/src/mc_util.c (L754)`) #### 3. fopen(), fgets(), fclose() --------> open(), read(), close() #### 4. opendir(), readdir(), closedir() --------> open(), syscall(SYS_getdents64), close() #### 5. Threads_mngr sync mechanisms * waiting for the thread to generate stack trace: semaphore --------> busy-wait * `globals_rw_lock` was removed: as we are not using malloc and the semaphore anymore we don't need to protect `ThreadsManager_cleanups`. #### 6. Stacktraces buffer The initial problem was that we were not able to safely call malloc within the signal handler. To solve that we created a buffer on the stack of `writeStacktraces` and saved it in a global pointer, assuming that under normal circumstances, the function `writeStacktraces` would complete before any thread attempted to write to it. However, if threads lag behind, they might access this global pointer after it no longer belongs to the `writeStacktraces` stack, potentially corrupting memory. To address this, various solutions were discussed [here](https://github.com/redis/redis/pull/12658#discussion_r1390442896) Eventually, we decided to create a pipe at server startup that will remain valid as long as the process is alive. We chose this solution due to its minimal memory usage, and since `write()` and `read()` are atomic operations. It ensures that stack traces from different threads won't mix. The stacktraces collection process is now as follows: * Cleaning the pipe to eliminate writes of late threads from previous runs. * Each thread writes to the pipe its stacktrace * Waiting for all the threads to mark completion or until a timeout (2 sec) is reached * Reading from the pipe to print the stacktraces. #### 7. Changes that were considered and eventually were dropped * replace watchdog timer with a POSIX timer: according to [settimer man](https://linux.die.net/man/2/setitimer) > POSIX.1-2008 marks getitimer() and setitimer() obsolete, recommending the use of the POSIX timers API ([timer_gettime](https://linux.die.net/man/2/timer_gettime)(2), [timer_settime](https://linux.die.net/man/2/timer_settime)(2), etc.) instead. However, although it is supposed to conform to POSIX std, POSIX timers API is not supported on Mac. You can take a look here at the Linux implementation: [here](`c7562ee135`) To avoid messing up the code, and uncertainty regarding compatibility, it was decided to drop it for now. * avoid using sds (uses malloc) in logConfigDebugInfo It was considered to print config info instead of using sds, however apparently, `logConfigDebugInfo` does more than just print the sds, so it was decided this fix is out of this issue scope. #### 8. fix Signal mask check The check `signum & sig_mask` intended to indicate whether the signal is blocked by the thread was incorrect. Actually, the bit position in the signal mask corresponds to the signal number. We fixed this by changing the condition to: `sig_mask & (1L << (sig_num - 1))` #### 9. Unrelated changes both `fork.tcl `and `util.tcl` implemented a function called `count_log_message` expecting different parameters. This caused confusion when trying to run daily tests with additional test parameters to run a specific test. The `count_log_message` in `fork.tcl` was removed and the calls were replaced with calls to `count_log_message` located in `util.tcl` --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com>	2023-11-23 13:22:20 +02:00
Oran Agra	b810384c62	dump server longs on hang corrupt dump fuzzer test recently there are some incidents of hanged tests in the CI when we try to reproduce them, we get an assertion, not a hang. maybe the server logs will reveal some info.	2023-10-08 16:19:31 +03:00
YaacovHazan	2cf50ddbad	Fix 'load corrupted rdb with no CRC' test (#12629 ) After the change in #12626 (`2e0f6724e`), the is_alive proc gets pid and not server config. This PR aligns it in 'load corrupted rdb with no CRC' test.	2023-10-03 11:09:25 +03:00
meiravgri	4ba9e18ef0	fix crash in crash-report and other improvements (#12623 ) ## Crash fix ### Current behavior We might crash if we fail to collect some of the threads' output. If it exceeds timeout for example. The threads mngr API guarantees that the output array length will be `tids_len`, however, some indices can be NULL, in case it fails to collect some of the threads' outputs. When we use the threads mngr to collect the threads' stacktraces, we rely on this and skip NULL entries. Since the output array was allocated with malloc, instead of NULL, it contained garbage, so we got a segmentation fault when trying to read this garbage. (in debug.c:writeStacktraces() ) ### fix Allocate the global output array with zcalloc. ### To reproduce the bug, you'll have to change the code: in threadsmngr:ThreadsManager_runOnThreads(): make sure the g_output_array allocation is initialized with garbage and not 0s (add `memset(g_output_array, 2, sizeof(void) tids_len);` below the allocation). Force one of the threads to write to the array: add a global var: `static redisAtomic size_t return_now = 0;` add to `invoke_callback()` before writing to the output array: ``` size_t i_return; atomicGetIncr(return_now, i_return, 1); if(i_return == 1) return; ``` compile, start the server with `--enable-debug-command local` and run `redis-cli debug assert` The assertion triggers the the stacktrace collection. Expect to get 2 prints of the stack trace - since we get the segmentation fault after we return from the threads mngr, it can be safely triggered again. ## Added global variables r/w lock in ThreadsManager To avoid a situation where the main thread runs `ThreadsManager_cleanups` while threads are still invoking the signal handler, we use a r/w lock. For cleanups, we will acquire the write lock. The threads will acquire the read lock to enable them to write simultaneously. If we fail to acquire the read lock, it means cleanups are in progress and we return immediately. After acquiring the lock we can safely check that the global output array wasn't nullified and proceed to write to it. This way we ensure the threads are not modifying the global variables/ trying to write to the output array after they were zeroed/nullified/destroyed(the semaphore). ## other minor logging change 1. removed logging if the semaphore times out because the threads can still write to the output array after this check. Instead, we print the total number of printed stacktraces compared to the exacted number (len_tids). 2. use noinline attribute to make sure the uplevel number of ignored stack trace entries stays correct. 3. improve testing Co-authored-by: Oran Agra <oran@redislabs.com>	2023-10-02 20:02:02 +03:00
YaacovHazan	2e0f6724e0	Stabilization and improvements around aof tests (#12626 ) In some tests, the code manually searches for a log message, and it uses tail -1 with a delay of 1 second, which can miss the expected line. Also, because the aof tests use start_server_aof and not start_server, the test name doesn't log into the server log. To fix the above, I made the following changes: - Change the start_server_aof to wrap the start_server. This will add the created aof server to the servers list, and make srv() and wait_for_log_messages() available for the tests. - Introduce a new option for start_server. 'wait_ready' - an option to let the caller start the test code without waiting for the server to be ready. useful for tests on a server that is expected to exit on startup. - Create a new start_server_aof_ex. The new proc also accept options as argument and make use of the new 'short_life' option for tests that are expected to exit on startup because of some error in the aof file(s). Because of the above, I had to change many lines and replace every local srv variable (a server config) usage with the srv().	2023-10-02 08:20:53 +03:00
Binbin	9fe63bdc80	Dump server logs when corrupt fuzzer reports crash (#12612 ) Recently we found some signal crashes, but unable to reproduce them. It is a good idea to dump the server logs when a failure happens.	2023-09-27 09:08:18 +03:00
meiravgri	cc2be63997	Print stack trace from all threads in crash report (#12453 ) In this PR we are adding the functionality to collect all the process's threads' backtraces. ## Changes made in this PR ### introduce threads mngr API The threads mngr API which has 2 abilities: * `ThreadsManager_init() `- register to SIGUSR2. called on the server start-up. * ` ThreadsManager_runOnThreads()` - receives a list of a pid_t and a callback, tells every thread in the list to invoke the callback, and returns the output collected by each invocation. Elaborating atomicvar API * `atomicIncrGet(var,newvalue_var,count) `-- Increment and get the atomic counter new value * `atomicFlagGetSet` -- Get and set the atomic counter value to 1 ### Always set SIGALRM handler SIGALRM handler prints the process's stacktrace to the log file. Up until now, it was set only if the `server.watchdog_period` > 0. This can be also useful if debugging is needed. However, in situations where the server can't get requests, (a deadlock, for example) we weren't able to change the signal handler. To make it available at run time we set SIGALRM handler on server startup. The signal handler name was changed to a more general `sigalrmSignalHandler`. ### Print all the process' threads' stacktraces `logStackTrace()` now calls `writeStacktraces()`, instead of logging the current thread stacktrace. `writeStacktraces()`: * On Linux systems we use the threads manager API to collect the backtraces of all the process' threads. To get the `tids` list (threads ids) we read the `/proc/<redis-server-pid>/tasks` file which includes a list of directories. Each directory name corresponds to one tid (including the main thread). For each thread, we also need to check if it can get the signal from the threads manager (meaning it is not blocking/ignoring that signal). We send the threads manager this tids list and `collect_stacktrace_data()` callback, which collects the thread's backtrace addresses, its name, and tid. * On other systems, the behavior remained as it was (writing only the current thread stacktrace to the log file). ## compatibility notes 1. The threads mngr API is only supported in linux. 2. glibc earlier than 2.3 We use `syscall(SYS_gettid)` and `syscall(SYS_tgkill...)` because their dedicated alternatives (`gettid()` and `tgkill`) were added in glibc 2.3. ## Output example Each thread backtrace will have the following format: `<tid> <thread_name> [additional_info]` * tid: as read from the `/proc/<redis-server-pid>/tasks` file * thread_name: the tread name as it is registered in the os/ * additional_info: Sometimes we want to add specific information about one of the threads. currently. it is only used to mark the thread that handles the backtraces collection by adding "". In case of crash - this also indicates which thread caused the crash. The handling thread in won't necessarily appear first. ``` ------ STACK TRACE ------ EIP: /lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc] 67089 redis-server linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb9437790] /lib/aarch64-linux-gnu/libc.so.6(epoll_pwait+0x9c)[0xffffb9295ebc] redis-server :6379(+0x75e0c)[0xaaaac2fe5e0c] redis-server :6379(aeProcessEvents+0x18c)[0xaaaac2fe6c00] redis-server :6379(aeMain+0x24)[0xaaaac2fe7038] redis-server :6379(main+0xe0c)[0xaaaac3001afc] /lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xffffb91d73fc] /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffb91d74cc] redis-server :6379(_start+0x30)[0xaaaac2fe0370] 67093 bio_lazy_free /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server :6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67091 bio_close_file /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server :6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67092 bio_aof /lib/aarch64-linux-gnu/libc.so.6(+0x79dfc)[0xffffb9229dfc] /lib/aarch64-linux-gnu/libc.so.6(pthread_cond_wait+0x208)[0xffffb922c8fc] redis-server :6379(bioProcessBackgroundJobs+0x174)[0xaaaac30976e8] /lib/aarch64-linux-gnu/libc.so.6(+0x7d5c8)[0xffffb922d5c8] /lib/aarch64-linux-gnu/libc.so.6(+0xe5d1c)[0xffffb9295d1c] 67089:signal-handler (1693824528) -------- ```	2023-09-24 09:47:23 +03:00
alonre24	044e29dd34	redis-benchmark - add the support for binary strings (#9414 ) Recently, the option of sending an argument from stdin using `-x` flag was added to redis-benchmark (this option is available in redis-cli as well). However, using the `-x` option for sending a blobs that contains null-characters doesn't work as expected - the argument is trimmed in the first occurrence of `\X00` (unlike in redis-cli). This PR aims to fix this issue and add the support for every binary string input, by sending arguments length to `redisFormatCommandArgv` when processing redis-benchmark command, so we won't treat the arguments as C-strings. Additionally, we add a simple test coverage for `-x` (without binary strings, and also remove an excessive server started in tests, and make sure to select db 0 so that `r` and the benchmark work on the same db. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-09-02 15:37:04 +03:00
Oran Agra	2b8cde71bb	Update supported version list. (#12488 ) Add 7.2, drop 6.0 as per https://redis.io/docs/about/releases/ Also replace a few concordances of the `’` char, with standard `'`	2023-08-16 08:36:40 +03:00
Oran Agra	f228ec1ea5	flushSlavesOutputBuffers should not write to replicas scheduled to drop (#12242 ) This will increase the size of an already large COB (one already passed the threshold for disconnection) This could also mean that we'll attempt to write that data to the socket and the replica will manage to read it, which will result in an undesired partial sync (undesired for the test)	2023-06-12 14:05:34 +03:00
binfeng-xin	38e284f106	optimize spopwithcount propagation (#12082 ) A single SPOP with command with count argument resulted in many SPOP commands being propagated to the replica. This is inefficient because the key name is repeated many times, and is also being looked-up many times. also it results in high QPS metrics on the replica. To solve that, we flush batches of 1024 fields per SPOP command. Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>	2023-05-22 10:27:14 +03:00
zhaozhao.zz	b0dd7b3245	Free backlog only if rsi is invalid when master reboot (#12088 ) When master reboot from RDB, if rsi in RDB is valid we should not free replication backlog, even if master_repl_offset or repl-offset is 0. Since if master doesn't send any data to replicas master_repl_offset is 0, it's a valid number. A clear example: 1. start a master and apply some write commands, the master's master_repl_offset is 0 since it has no replicas. 2. stop write commands on master, and start another instance and replicaof the master, trigger an FULLRESYNC 3. the master's master_repl_offset is still 0 (set a large number for repl-ping-replica-period), do BGSAVE and restart the master 4. master load master_repl_offset from RDB's rsi and it's still 0, and we should make sure replica can partially resync with master.	2023-05-06 11:53:28 +08:00
Binbin	20533cc1d7	Tests: Do not save an RDB by default and add a SIGTERM default AOFRW test (#12064 ) In order to speed up tests, avoid saving an RDB (mostly notable on shutdown), except for tests that explicitly test the RDB mechanism In addition, use `shutdown-on-sigterm force` to prevetn shutdown from failing in case the server is in the middle of the initial AOFRW Also a a test that checks that the `shutdown-on-sigterm default` is to refuse shutdown if there's an initial AOFRW Co-authored-by: Guy Benoish <guy.benoish@redislabs.com>	2023-04-18 16:14:26 +03:00
Oran Agra	997fa41e99	Attempt to solve MacOS CI issues in GH Actions (#12013 ) The MacOS CI in github actions often hangs without any logs. GH argues that it's due to resource utilization, either running out of disk space, memory, or CPU starvation, and thus the runner is terminated. This PR contains multiple attempts to resolve this: 1. introducing pause_process instead of SIGSTOP, which waits for the process to stop before resuming the test, possibly resolving race conditions in some tests, this was a suspect since there was one test that could result in an infinite loop in that case, in practice this didn't help, but still a good idea to keep. 2. disable the `save` config in many tests that don't need it, specifically ones that use heavy writes and could create large files. 3. change the `populate` proc to use short pipeline rather than an infinite one. 4. use `--clients 1` in the macos CI so that we don't risk running multiple resource demanding tests in parallel. 5. enable `--verbose` to be repeated to elevate verbosity and print more info to stdout when a test or a server starts.	2023-04-12 09:19:21 +03:00

1 2 3 4 5 ...

388 Commits