Commit Graph

357 Commits

Author SHA1 Message Date
Yuan Wang 951ec79654
Cluster compatibility check (#13846)
CI / build-macos-latest (push) Waiting to run Details
CI / build-32bit (push) Failing after 31s Details
CI / build-libc-malloc (push) Failing after 31s Details
CI / build-debian-old (push) Failing after 1m32s Details
CI / build-old-chain-jemalloc (push) Failing after 31s Details
Codecov / code-coverage (push) Failing after 31s Details
CI / test-ubuntu-latest (push) Failing after 3m21s Details
Spellcheck / Spellcheck (push) Failing after 31s Details
CI / test-sanitizer-address (push) Failing after 6m36s Details
CI / build-centos-jemalloc (push) Failing after 6m36s Details
External Server Tests / test-external-standalone (push) Failing after 2m10s Details
Coverity Scan / coverity (push) Has been skipped Details
External Server Tests / test-external-nodebug (push) Failing after 2m12s Details
External Server Tests / test-external-cluster (push) Failing after 2m16s Details
### Background
The program runs normally in standalone mode, but migrating to cluster
mode may cause errors, this is because some cross slot commands can not
run in cluster mode. We should provide an approach to detect this issue
when running in standalone mode, and need to expose a metric which
indicates the usage of no incompatible commands.

### Solution
To avoid perf impact, we introduce a new config
`cluster-compatibility-sample-ratio` which define the sampling ratio
(0-100) for checking command compatibility in cluster mode. When a
command is executed, it is sampled at the specified ratio to determine
if it complies with Redis cluster constraints, such as cross-slot
restrictions.

A new metric is exposed: `cluster_incompatible_ops` in `info stats`
output.

The following operations will be considered incompatible operations.

- cross-slot command
   If a command has multiple cross slot keys, it is incompatible
- `swap, copy, move, select` command
These commands involve multi databases in some cases, we don't allow
multiple DB in cluster mode, so there are not compatible
- Module command with `no-cluster` flag
If a module command has `no-cluster` flag, we will encounter an error
when loading module, leading to fail to load module if cluster is
enabled, so this is incompatible.
- Script/function with `no-cluster` flag
Similar with module command, if we declare `no-cluster` in shebang of
script/function, we also can not run it in cluster mode
- `sort` command by/get pattern
When `sort` command has `by/get` pattern option, we must ask that the
pattern slot is equal with the slot of keys, otherwise it is
incompatible in cluster mode.

- The script/function command accesses the keys and declared keys have
different slots
For the script/function command, we not only check the slot of declared
keys, but only check the slot the accessing keys, if they are different,
we think it is incompatible.

**Besides**, commands like `keys, scan, flushall, script/function
flush`, that in standalone mode iterate over all data to perform the
operation, are only valid for the server that executes the command in
cluster mode and are not broadcasted. However, this does not lead to
errors, so we do not consider them as incompatible commands.

### Performance impact test
**cross slot test**
Below are the test commands and results. When using MSET with 8 keys,
performance drops by approximately 3%.

**single key test**
It may be due to the overhead of the sampling function, and single-key
commands could cause a 1-2% performance drop.
2025-03-20 10:35:53 +08:00
Alexander Gorbulya 17eb33e0c3
Fix typo in repl-ping-replica-period comment in redis.conf (#13782)
The comment for the `repl-ping-replica-period` option in `redis.conf`
mistakenly refers to `repl_ping_replica_period` (with underscores).
This PR corrects it to use the proper format with dashes, as per the
actual configuration option.
2025-02-06 10:14:56 +08:00
Ozan Tezcan 73a9b916c9
Rdb channel replication (#13732)
This PR is based on:

https://github.com/redis/redis/pull/12109
https://github.com/valkey-io/valkey/pull/60

Closes: https://github.com/redis/redis/issues/11678

**Motivation**

During a full sync, when master is delivering RDB to the replica,
incoming write commands are kept in a replication buffer in order to be
sent to the replica once RDB delivery is completed. If RDB delivery
takes a long time, it might create memory pressure on master. Also, once
a replica connection accumulates replication data which is larger than
output buffer limits, master will kill replica connection. This may
cause a replication failure.

The main benefit of the rdb channel replication is streaming incoming
commands in parallel to the RDB delivery. This approach shifts
replication stream buffering to the replica and reduces load on master.
We do this by opening another connection for RDB delivery. The main
channel on replica will be receiving replication stream while rdb
channel is receiving the RDB.

This feature also helps to reduce master's main process CPU load. By
opening a dedicated connection for the RDB transfer, the bgsave process
has access to the new connection and it will stream RDB directly to the
replicas. Before this change, due to TLS connection restriction, the
bgsave process was writing RDB bytes to a pipe and the main process was
forwarding
it to the replica. This is no longer necessary, the main process can
avoid these expensive socket read/write syscalls. It also means RDB
delivery to replica will be faster as it avoids this step.

In summary, replication will be faster and master's performance during
full syncs will improve.


**Implementation steps**

1. When replica connects to the master, it sends 'rdb-channel-repl' as
part of capability exchange to let master to know replica supports rdb
channel.
2. When replica lacks sufficient data for PSYNC, master sends
+RDBCHANNELSYNC reply with replica's client id. As the next step, the
replica opens a new connection (rdb-channel) and configures it against
the master with the appropriate capabilities and requirements. It also
sends given client id back to master over rdbchannel, so that master can
associate these channels. (initial replica connection will be referred
as main-channel) Then, replica requests fullsync using the RDB channel.
3. Prior to forking, master attaches the replica's main channel to the
replication backlog to deliver replication stream starting at the
snapshot end offset.
4. The master main process sends replication stream via the main
channel, while the bgsave process sends the RDB directly to the replica
via the rdb-channel. Replica accumulates replication stream in a local
buffer, while the RDB is being loaded into the memory.
5. Once the replica completes loading the rdb, it drops the rdb channel
and streams the accumulated replication stream into the db. Sync is
completed.

**Some details**
- Currently, rdbchannel replication is supported only if
`repl-diskless-sync` is enabled on master. Otherwise, replication will
happen over a single connection as in before.
- On replica, there is a limit to replication stream buffering. Replica
uses a new config `replica-full-sync-buffer-limit` to limit number of
bytes to accumulate. If it is not set, replica inherits
`client-output-buffer-limit <replica>` hard limit config. If we reach
this limit, replica stops accumulating. This is not a failure scenario
though. Further accumulation will happen on master side. Depending on
the configured limits on master, master may kill the replica connection.

**API changes in INFO output:**

1. New replica state: `send_bulk_and_stream`. Indicates full sync is
still in progress for this replica. It is receiving replication stream
and rdb in parallel.
```
slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0
```
Replica state changes in steps:
- First, replica sends psync and receives +RDBCHANNELSYNC
:`state=wait_bgsave`
- After replica connects with rdbchannel and delivery starts:
`state=send_bulk_and_stream`
 - After full sync: `state=online`

2. On replica side, replication stream buffering metrics:
- replica_full_sync_buffer_size: Currently accumulated replication
stream data in bytes.
- replica_full_sync_buffer_peak: Peak number of bytes that this instance
accumulated in the lifetime of the process.

```
replica_full_sync_buffer_size:20485             
replica_full_sync_buffer_peak:1048560
```

**API changes in CLIENT LIST**

In `client list` output, rdbchannel clients will have 'C' flag in
addition to 'S' replica flag:
```
id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0
```

**Config changes:**
- `replica-full-sync-buffer-limit`: Controls how much replication data
replica can accumulate during rdbchannel replication. If it is not set,
a value of 0 means replica will inherit `client-output-buffer-limit
<replica>` hard limit config to limit accumulated data.
- `repl-rdb-channel` config is added as a hidden config. This is mostly
for testing as we need to support both rdbchannel replication and the
older single connection replication (to keep compatibility with older
versions and rdbchannel replication will not be enabled if
repl-diskless-sync is not enabled). it affects both the master (not to
respond to rdb channel requests), and the replica (not to declare
capability)

**Internal API changes:**
Changes that were introduced to Redis replication:
- New replication capability is added to replconf command: `capa
rdb-channel-repl`. Indicates replica is capable of rdb channel
replication. Replica sends it when it connects to master along with
other capabilities.
- If replica needs fullsync, master replies `+RDBCHANNELSYNC
<client-id>` to the replica's PSYNC request.
- When replica opens rdbchannel connection, as part of replconf command,
it sends `rdb-channel 1` to let master know this is rdb channel. Also,
it sends `main-ch-client-id <client-id>` as part of replconf command so
master can associate channels.
  
**Testing:**
As rdbchannel replication is enabled by default, we run whole test suite
with it. Though, as we need to support both rdbchannel and single
connection replication, we'll be running some tests twice with
`repl-rdb-channel yes/no` config.

**Replica state diagram**
```
* * Replica state machine *
 *
 * Main channel state
 * ┌───────────────────┐
 * │RECEIVE_PING_REPLY │
 * └────────┬──────────┘
 *          │ +PONG
 * ┌────────▼──────────┐
 * │SEND_HANDSHAKE     │                     RDB channel state
 * └────────┬──────────┘            ┌───────────────────────────────┐
 *          │+OK                ┌───► RDB_CH_SEND_HANDSHAKE         │
 * ┌────────▼──────────┐        │   └──────────────┬────────────────┘
 * │RECEIVE_AUTH_REPLY │        │    REPLCONF main-ch-client-id <clientid>
 * └────────┬──────────┘        │   ┌──────────────▼────────────────┐
 *          │+OK                │   │ RDB_CH_RECEIVE_AUTH_REPLY     │
 * ┌────────▼──────────┐        │   └──────────────┬────────────────┘
 * │RECEIVE_PORT_REPLY │        │                  │ +OK
 * └────────┬──────────┘        │   ┌──────────────▼────────────────┐
 *          │+OK                │   │  RDB_CH_RECEIVE_REPLCONF_REPLY│
 * ┌────────▼──────────┐        │   └──────────────┬────────────────┘
 * │RECEIVE_IP_REPLY   │        │                  │ +OK
 * └────────┬──────────┘        │   ┌──────────────▼────────────────┐
 *          │+OK                │   │ RDB_CH_RECEIVE_FULLRESYNC     │
 * ┌────────▼──────────┐        │   └──────────────┬────────────────┘
 * │RECEIVE_CAPA_REPLY │        │                  │+FULLRESYNC
 * └────────┬──────────┘        │                  │Rdb delivery
 *          │                   │   ┌──────────────▼────────────────┐
 * ┌────────▼──────────┐        │   │ RDB_CH_RDB_LOADING            │
 * │SEND_PSYNC         │        │   └──────────────┬────────────────┘
 * └─┬─────────────────┘        │                  │ Done loading
 *   │PSYNC (use cached-master) │                  │
 * ┌─▼─────────────────┐        │                  │
 * │RECEIVE_PSYNC_REPLY│        │    ┌────────────►│ Replica streams replication
 * └─┬─────────────────┘        │    │             │ buffer into memory
 *   │                          │    │             │
 *   │+RDBCHANNELSYNC client-id │    │             │
 *   ├──────┬───────────────────┘    │             │
 *   │      │ Main channel           │             │
 *   │      │ accumulates repl data  │             │
 *   │   ┌──▼────────────────┐       │     ┌───────▼───────────┐
 *   │   │ REPL_TRANSFER     ├───────┘     │    CONNECTED      │
 *   │   └───────────────────┘             └────▲───▲──────────┘
 *   │                                          │   │
 *   │                                          │   │
 *   │  +FULLRESYNC    ┌───────────────────┐    │   │
 *   ├────────────────► REPL_TRANSFER      ├────┘   │
 *   │                 └───────────────────┘        │
 *   │  +CONTINUE                                   │
 *   └──────────────────────────────────────────────┘
 */
 ```
 -----
 This PR also contains changes and ideas from: 
https://github.com/valkey-io/valkey/pull/837
https://github.com/valkey-io/valkey/pull/1173
https://github.com/valkey-io/valkey/pull/804
https://github.com/valkey-io/valkey/pull/945
https://github.com/valkey-io/valkey/pull/989
---------

Co-authored-by: Yuan Wang <wangyuancode@163.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: Moti Cohen <moticless@gmail.com>
Co-authored-by: naglera <anagler123@gmail.com>
Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 15:09:52 +03:00
Yuan Wang 64a40b20d9
Async IO Threads (#13695)
## Introduction
Redis introduced IO Thread in 6.0, allowing IO threads to handle client
request reading, command parsing and reply writing, thereby improving
performance. The current IO thread implementation has a few drawbacks.
- The main thread is blocked during IO thread read/write operations and
must wait for all IO threads to complete their current tasks before it
can continue execution. In other words, the entire process is
synchronous. This prevents the efficient utilization of multi-core CPUs
for parallel processing.

- When the number of clients and requests increases moderately, it
causes all IO threads to reach full CPU utilization due to the busy wait
mechanism used by the IO threads. This makes it challenging for us to
determine which part of Redis has reached its bottleneck.

- When IO threads are enabled with TLS and io-threads-do-reads, a
disconnection of a connection with pending data may result in it being
assigned to multiple IO threads simultaneously. This can cause race
conditions and trigger assertion failures. Related issue:
redis#12540

Therefore, we designed an asynchronous IO threads solution. The IO
threads adopt an event-driven model, with the main thread dedicated to
command processing, meanwhile, the IO threads handle client read and
write operations in parallel.

## Implementation
### Overall
As before, we did not change the fact that all client commands must be
executed on the main thread, because Redis was originally designed to be
single-threaded, and processing commands in a multi-threaded manner
would inevitably introduce numerous race and synchronization issues. But
now each IO thread has independent event loop, therefore, IO threads can
use a multiplexing approach to handle client read and write operations,
eliminating the CPU overhead caused by busy-waiting.

the execution process can be briefly described as follows:
the main thread assigns clients to IO threads after accepting
connections, IO threads will notify the main thread when clients
finish reading and parsing queries, then the main thread processes
queries from IO threads and generates replies, IO threads handle
writing reply to clients after receiving clients list from main thread,
and then continue to handle client read and write events.

### Each IO thread has independent event loop
We now assign each IO thread its own event loop. This approach
eliminates the need for the main thread to perform the costly
`epoll_wait` operation for handling connections (except for specific
ones). Instead, the main thread processes requests from the IO threads
and hands them back once completed, fully offloading read and write
events to the IO threads.

Additionally, all TLS operations, including handling pending data, have
been moved entirely to the IO threads. This resolves the issue where
io-threads-do-reads could not be used with TLS.

### Event-notified client queue
To facilitate communication between the IO threads and the main thread,
we designed an event-notified client queue. Each IO thread and the main
thread have two such queues to store clients waiting to be processed.
These queues are also integrated with the event loop to enable handling.
We use pthread_mutex to ensure the safety of queue operations, as well
as data visibility and ordering, and race conditions are minimized, as
each IO thread and the main thread operate on independent queues,
avoiding thread suspension due to lock contention. And we implemented an
event notifier based on `eventfd` or `pipe` to support event-driven
handling.

### Thread safety
Since the main thread and IO threads can execute in parallel, we must
handle data race issues carefully.

**client->flags**
The primary tasks of IO threads are reading and writing, i.e.
`readQueryFromClient` and `writeToClient`. However, IO threads and the
main thread may concurrently modify or access `client->flags`, leading
to potential race conditions. To address this, we introduced an io-flags
variable to record operations performed by IO threads, thereby avoiding
race conditions on `client->flags`.

**Pause IO thread**
In the main thread, we may want to operate data of IO threads, maybe
uninstall event handler, access or operate query/output buffer or resize
event loop, we need a clean and safe context to do that. We pause IO
thread in `IOThreadBeforeSleep`, do some jobs and then resume it. To
avoid thread suspended, we use busy waiting to confirm the target
status. Besides we use atomic variable to make sure memory visibility
and ordering. We introduce these functions to pause/resume IO Threads as
below.
```
pauseIOThread, resumeIOThread
pauseAllIOThreads, resumeAllIOThreads
pauseIOThreadsRange, resumeIOThreadsRange
```
Testing has shown that `pauseIOThread` is highly efficient, allowing the
main thread to execute nearly 200,000 operations per second during
stress tests. Similarly, `pauseAllIOThreads` with 8 IO threads can
handle up to nearly 56,000 operations per second. But operations
performed between pausing and resuming IO threads must be quick;
otherwise, they could cause the IO threads to reach full CPU
utilization.

**freeClient and freeClientAsync**
The main thread may need to terminate a client currently running on an
IO thread, for example, due to ACL rule changes, reaching the output
buffer limit, or evicting a client. In such cases, we need to pause the
IO thread to safely operate on the client.

**maxclients and maxmemory-clients updating**
When adjusting `maxclients`, we need to resize the event loop for all IO
threads. Similarly, when modifying `maxmemory-clients`, we need to
traverse all clients to calculate their memory usage. To ensure safe
operations, we pause all IO threads during these adjustments.

**Client info reading**
The main thread may need to read a client’s fields to generate a
descriptive string, such as for the `CLIENT LIST` command or logging
purposes. In such cases, we need to pause the IO thread handling that
client. If information for all clients needs to be displayed, all IO
threads must be paused.

**Tracking redirect**
Redis supports the tracking feature and can even send invalidation
messages to a connection with a specified ID. But the target client may
be running on IO thread, directly manipulating the client’s output
buffer is not thread-safe, and the IO thread may not be aware that the
client requires a response. In such cases, we pause the IO thread
handling the client, modify the output buffer, and install a write event
handler to ensure proper handling.

**clientsCron**
In the `clientsCron` function, the main thread needs to traverse all
clients to perform operations such as timeout checks, verifying whether
they have reached the soft output buffer limit, resizing the
output/query buffer, or updating memory usage. To safely operate on a
client, the IO thread handling that client must be paused.
If we were to pause the IO thread for each client individually, the
efficiency would be very low. Conversely, pausing all IO threads
simultaneously would be costly, especially when there are many IO
threads, as clientsCron is invoked relatively frequently.
To address this, we adopted a batched approach for pausing IO threads.
At most, 8 IO threads are paused at a time. The operations mentioned
above are only performed on clients running in the paused IO threads,
significantly reducing overhead while maintaining safety.

### Observability
In the current design, the main thread always assigns clients to the IO
thread with the least clients. To clearly observe the number of clients
handled by each IO thread, we added the new section in INFO output. The
`INFO THREADS` section can show the client count for each IO thread.
```
# Threads
io_thread_0:clients=0
io_thread_1:clients=2
io_thread_2:clients=2
```

Additionally, in the `CLIENT LIST` output, we also added a field to
indicate the thread to which each client is assigned.

`id=244 addr=127.0.0.1:41870 laddr=127.0.0.1:6379 ... resp=2 lib-name=
lib-ver= io-thread=1`

## Trade-off
### Special Clients
For certain special types of clients, keeping them running on IO threads
would result in severe race issues that are difficult to resolve.
Therefore, we chose not to offload these clients to the IO threads.

For replica, monitor, subscribe, and tracking clients, main thread may
directly write them a reply when conditions are met. Race issues are
difficult to resolve, so we have them processed in the main thread. This
includes the Lua debug clients as well, since we may operate connection
directly.

For blocking client, after the IO thread reads and parses a command and
hands it over to the main thread, if the client is identified as a
blocking type, it will be remained in the main thread. Once the blocking
operation completes and the reply is generated, the client is
transferred back to the IO thread to send the reply and wait for event
triggers.

### Clients Eviction
To support client eviction, it is necessary to update each client’s
memory usage promptly during operations such as read, write, or command
execution. However, when a client operates on an IO thread, it is not
feasible to update the memory usage immediately due to the risk of data
races. As a result, memory usage can only be updated either in the main
thread while processing commands or in the `ClientsCron` periodically.
The downside of this approach is that updates might experience a delay
of up to one second, which could impact the precision of memory
management for eviction.

To avoid incorrectly evicting clients. We adopted a best-effort
compensation solution, when we decide to eviction a client, we update
its memory usage again before evicting, if the memory used by the client
does not decrease or memory usage bucket is not changed, then we will
evict it, otherwise, not evict it.

However, we have not completely solved this problem. Due to the delay in
memory usage updates, it may lead us to make incorrect decisions about
the need to evict clients.

### Defragment
In the majority of cases we do NOT use the data from argv directly in
the db.
1. key names
We store a copy that we allocate in the main thread, see `sdsdup()` in
`dbAdd()`.
2. hash key and value
We store key as hfield and store value as sds, see `hfieldNew()` and
`sdsdup()` in `hashTypeSet()`.
3. other datatypes
   They don't even use SDS, so there is no reference issues.

But in some cases client the data from argv may be retain by the main
thread.
As a result, during fragmentation cleanup, we need to move allocations
from the IO thread’s arena to the main thread’s arena. We always
allocate new memory in the main thread’s arena, but the memory released
by IO threads may not yet have been reclaimed. This ultimately causes
the fragmentation rate to be higher compared to creating and allocating
entirely within a single thread.
The following cases below will lead to memory allocated by the IO thread
being kept by the main thread.
1. string related command: `append`, `getset`, `mset` and `set`.
If `tryObjectEncoding()` does not change argv, we will keep it directly
in the main thread, see the code in `tryObjectEncoding()`(specifically
`trimStringObjectIfNeeded()`)
2. block related command.
    the key names will be kept in `c->db->blocking_keys`.
3. watch command
    the key names will be kept in `c->db->watched_keys`.
4. [s]subscribe command
    channel name will be kept in `serverPubSubChannels`.
5. script load command
    script will be kept in `server.lua_scripts`.
7. some module API: `RM_RetainString`, `RM_HoldString`

Those issues will be handled in other PRs.

## Testing
### Functional Testing
The commit with enabling IO Threads has passed all TCL tests, but we did
some changes:
**Client query buffer**: In the original code, when using a reusable
query buffer, ownership of the query buffer would be released after the
command was processed. However, with IO threads enabled, the client
transitions from an IO thread to the main thread for processing. This
causes the ownership release to occur earlier than the command
execution. As a result, when IO threads are enabled, the client's
information will never indicate that a shared query buffer is in use.
Therefore, we skip the corresponding query buffer tests in this case.
**Defragment**: Add a new defragmentation test to verify the effect of
io threads on defragmentation.
**Command delay**: For deferred clients in TCL tests, due to clients
being assigned to different threads for execution, delays may occur. To
address this, we introduced conditional waiting: the process proceeds to
the next step only when the `client list` contains the corresponding
commands.

### Sanitizer Testing
The commit passed all TCL tests and reported no errors when compiled
with the `fsanitizer=thread` and `fsanitizer=address` options enabled.
But we made the following modifications: we suppressed the sanitizer
warnings for clients with watched keys when updating `client->flags`, we
think IO threads read `client->flags`, but never modify it or read the
`CLIENT_DIRTY_CAS` bit, main thread just only modifies this bit, so
there is no actual data race.

## Others
### IO thread number
In the new multi-threaded design, the main thread is primarily focused
on command processing to improve performance. Typically, the main thread
does not handle regular client I/O operations but is responsible for
clients such as replication and tracking clients. To avoid breaking
changes, we still consider the main thread as the first IO thread.

When the io-threads configuration is set to a low value (e.g., 2),
performance does not show a significant improvement compared to a
single-threaded setup for simple commands (such as SET or GET), as the
main thread does not consume much CPU for these simple operations. This
results in underutilized multi-core capacity. However, for more complex
commands, having a low number of IO threads may still be beneficial.
Therefore, it’s important to adjust the `io-threads` based on your own
performance tests.

Additionally, you can clearly monitor the CPU utilization of the main
thread and IO threads using `top -H -p $redis_pid`. This allows you to
easily identify where the bottleneck is. If the IO thread is the
bottleneck, increasing the `io-threads` will improve performance. If the
main thread is the bottleneck, the overall performance can only be
scaled by increasing the number of shards or replicas.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: oranagra <oran@redislabs.com>
2024-12-23 14:16:40 +08:00
Ali 05b99c8f4c
Fix typo in redis.conf (#12634)
unnecessarily and repetitive "OR"
2024-11-22 20:29:17 +08:00
David Dougherty 9906daf5c9
Update old links for modules-api-ref.md (#13479)
This PR replaces old .../topics/... links with current links,
specifically for the modules-api-ref.md file and the new automation that
Paolo Lazzari is working on. A few of the topics links have redirects,
but some don't. Best to use updated links.
2024-11-04 18:18:22 +02:00
debing.sun 69b480cb7a
Hide user data from log (#13400)
This PR is based on the commits from PR #11747.

In the event of an assertion failure, hide command arguments from the
operator.

In some cases, private client information can be voluntarily exposed
when a redis instance crashes due to an assertion failure.
This commit prevent וnintentional client info exposure.
Operators can still access the hidden data, but they must actively
request it.
Any of the client info commands remains the unchanged.

### Config
Add a new config `hide-user-data-from-log` to turn this feature on and
off, default off.

---------

Co-authored-by: naglera <anagler123@gmail.com>
Co-authored-by: naglera <58042354+naglera@users.noreply.github.com>
2024-07-09 18:54:18 +08:00
Binbin acd9605223
Fix maxmemory-samples stack overflow crash in evictionPoolPopulate, limit its value to [1,64] (#13000)
We have not limited the value of maxmemory-samples in the past, it can
be set very large. If it is set very large, we will have stack overflow
in evictionPoolPopulate when we trigger the key eviction.

There is no reason for this config to be set too high, so just limit its
range to [1,64].
2024-01-29 10:38:52 +02:00
zhaozhao.zz 85a834bfa2
Revert multi OOM limit and add multi buffer limit (#12961)
Fix #9926 , and introduce an alternative method to prevent abuse of
transactions:

1. revert #5454 (which was blocking read-only transactions in OOM
state), and break the tie of MULTI state memory usage and the server OOM
state. Meaning that we'll limit the total memory a single client can
queue, and do that unconditionally regardless of the server being OOM or
not.
2. to prevent abuse of transactions, we use the
`client-query-buffer-limit` to restrict the size of the transaction.
Because the commands cached in the MULTI/EXEC queue have not been
executed yet, so they are also considered a part of the "query buffer"
in a broader sense. In other words, the commands in the MULTI queue and
the `querybuf` of the client together constitute the "query buffer".
When they exceed the limit, the connection will be disconnected.

The reasoning is that it's sensible to sends a single command with a
huge (1GB) argument, and it's sensible to sends a transaction with many
small commands, but it's probably not common to sends a long transaction
with many huge arguments (will consume a lot of memory before even being
executed).

If anyone runs into that, they can simply increase the
`client-query-buffer-limit` config.

P.S. To prevent DDoS attacks, unauthenticated clients have a separate
hard limit. Their query buffer should not exceed a maximum of 1MB. In
other words, if the query buffer of an unauthenticated client exceeds
1MB or the `client-query-buffer-limit` (if it is set to a value smaller
than 1MB,), the connection will be disconnected.
2024-01-25 11:17:39 +02:00
AshMosh c3f8b542ee
Manage number of new connections per cycle (#12178)
There are situations (especially in TLS) in which the engine gets too occupied managing a large number of new connections. Existing connections may time-out while the server is processing the new connections initial TLS handshakes, which may cause cause new connections to be established, perpetuating the problem. To better manage the tradeoff between new connection rate and other workloads, this change adds a new config to manage maximum number of new connections per event loop cycle, instead of using a predetermined number (currently 1000).

This change introduces two new configurations, max-new-connections-per-cycle and max-new-tls-connections-per-cycle. The default value of the tcp connections is 10 per cycle and the default value of tls connections per cycle is 1.
---------

Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2024-01-02 15:15:03 -08:00
Binbin c85a9b7896
Fix delKeysInSlot server events are not executed inside an execution unit (#12745)
This is a follow-up fix to #12733. We need to apply the same changes to
delKeysInSlot. Refer to #12733 for more details.

This PR contains some other minor cleanups / improvements to the test
suite and docs.
It uses the postnotifications test module in a cluster mode test which
revealed a leak in the test module (fixed).
2023-12-11 20:15:19 +02:00
zhaozhao.zz 3431b1f156
format cpu config as redis style (#7351)
The following four configurations are renamed to align with Redis style:

1. server_cpulist renamed to server-cpulist
2. bio_cpulist renamed to bio-cpulist
3. aof_rewrite_cpulist renamed to aof-rewrite-cpulist
4. bgsave_cpulist renamed to bgsave-cpulist

The original names are retained as aliases to ensure compatibility with
old configuration files. We recommend users to gradually transition to
using the new configuration names to maintain consistency in style.
2023-11-29 13:40:06 +08:00
Danilo Bargen a6eff389b5
redis.conf: Add data loss warning to "appendonly" (#12506)
warning against editing the config file and restarting the server.
which will attempt to load an AOF file and disregard the RDB.

Co-authored-by: Oran Agra <oran@redislabs.com>
2023-08-22 18:15:47 +03:00
Wen Hui 070453eef3
Cluster human readable nodename feature (#9564)
This PR adds a human readable name to a node in clusters that are visible as part of error logs. This is useful so that admins and operators of Redis cluster have better visibility into failures without having to cross-reference the generated ID with some logical identifier (such as pod-ID or EC2 instance ID). This is mentioned in #8948. Specific nodenames can be set by using the variable cluster-announce-human-nodename. The nodename is gossiped using the clusterbus extension in #9530.

Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2023-06-17 21:16:51 -07:00
zhaozhao.zz 07ea220419
add a new loglevel 'nothing' to disable logging (#12133)
Users can record logs of different levels by setting the `loglevel`.
However, sometimes there are many logs even at the warning level,
which can affect the performance of Redis.

For example, when a user accesses the tls-port using a non-encrypted link,
Redis will log lots of "# Error accepting a client connection: ...".

We can provide the ability to disable logging so that users can temporarily turn
off logging and turn it back on after the problem is resolved.
2023-05-23 18:30:44 +03:00
Binbin 0b159b34ea
Bump codespell to 2.2.4, fix typos and outupdated comments (#11911)
Fix some seen typos and wrong comments.
2023-03-16 08:50:32 +02:00
Viktor Söderqvist 4e472a1a7f
Listpack encoding for sets (#11290)
Small sets with not only integer elements are listpack encoded, by default
up to 128 elements, max 64 bytes per element, new config `set-max-listpack-entries`
and `set-max-listpack-value`. This saves memory for small sets compared to using a hashtable.

Sets with only integers, even very small sets, are still intset encoded (up to 1G
limit, etc.). Larger sets are hashtable encoded.

This PR increments the RDB version, and has an effect on OBJECT ENCODING

Possible conversions when elements are added:

    intset -> listpack
    listpack -> hashtable
    intset -> hashtable

Note: No conversion happens when elements are deleted. If all elements are
deleted and then added again, the set is deleted and recreated, thus implicitly
converted to a smaller encoding.
2022-11-09 19:50:07 +02:00
Taishi Kasuga f609a4eda7
Fix non-existent directive name at comments in redis.conf, s/cluster-tls/tls-cluster/g (#11364) 2022-10-08 10:16:47 +03:00
Binbin bb6513cbba
ACL default newly created user set USER_FLAG_SANITIZE_PAYLOAD flag (#11279)
Starting from 6.2, after ACL SETUSER user reset, the user
will carry the sanitize-payload flag. It was added in #7807,
and then ACL SETUSER reset is inconsistent with default
newly created user which missing sanitize-payload flag.

Same as `off` and `on` these two bits are mutually exclusive,
the default created user needs to have sanitize-payload flag.
Adds USER_FLAG_SANITIZE_PAYLOAD flag to ACLCreateUser.

Note that the bug don't have any real implications,
since the code in rdb.c (rdbLoadObject) checks for
`USER_FLAG_SANITIZE_PAYLOAD_SKIP`, so the fact that
`USER_FLAG_SANITIZE_PAYLOAD` is missing doesn't really matters.

Added tests to make sure it won't be broken in the future,
and updated the comment in ACLSetUser and redis.conf
2022-09-22 09:13:39 +03:00
Eduardo Semprebon 36abc0fa8f
Improve redis.conf documentation on repl-diskless-load (#11213)
Just noticed that there are some inaccurate, or at least confusing information about `repl-diskless-load` in `redis.conf`
It shouldn't scare away users willing to spend the extra memory.

`may mean that we have to flush the contents of the current database before the full rdb was received.`: this is likely related to the time when there was an option `always`, where content on replica was flushed before loading from master.
2022-09-11 11:22:59 +03:00
yourtree ca6aeadfbe
Support setlocale via CONFIG operation. (#11059)
Till now Redis officially supported tuning it via environment variable see #1074.
But we had other requests to allow changing it at runtime, see #799, and #11041.

Note that `strcoll()` is used as Lua comparison function and also for comparison of
certain string objects in Redis, which leads to a problem that, in different regions,
for some characters, the result may be different. Below is an example.
```
127.0.0.1:6333> SORT test alpha
1) "<"
2) ">"
3) ","
4) "*"
127.0.0.1:6333> CONFIG GET locale-collate
1) "locale-collate"
2) ""
127.0.0.1:6333> CONFIG SET locale-collate 1
(error) ERR CONFIG SET failed (possibly related to argument 'locale')
127.0.0.1:6333> CONFIG SET locale-collate C
OK
127.0.0.1:6333> SORT test alpha
1) "*"
2) ","
3) "<"
4) ">"
```
That will cause accidental code compatibility issues for Lua scripts and some
Redis commands. This commit creates a new config parameter to control the
local environment which only affects `Collate` category. Above shows how it
affects `SORT` command, and below shows the influence on Lua scripts.
```
127.0.0.1:6333> CONFIG GET locale-collate
1) " locale-collate"
2) "C"
127.0.0.1:6333> EVAL "return ',' < '*'" 0
(nil)
127.0.0.1:6333> CONFIG SET locale-collate ""
OK
127.0.0.1:6333> EVAL "return ',' < '*'" 0
(integer) 1
```

Co-authored-by: calvincjli <calvincjli@tencent.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2022-08-21 17:55:45 +03:00
Binbin 1f600efd01
Fix outdated lfu-decay-time doc in redis.conf (#11108)
The divided by two and less <= 10 logics were changed in 06ca9d6839.
Now we just decrement the counter by num_periods.

The lfu-decay-time special value of 0 's meaning was actually changed in 06ca9d6839.
Now we won't do anything on counter if lfu-decay-time is 0.
2022-08-14 10:50:31 +03:00
Binbin 8203461120
Fix some outdated comments and some typo (#10946)
* Fix some outdated comments and some typo
2022-07-06 20:31:59 -07:00
Eduardo Semprebon 3a1d14259d
Allow configuring signaled shutdown flags (#10594)
The SHUTDOWN command has various flags to change it's default behavior,
but in some cases establishing a connection to redis is complicated and it's easier
for the management software to use signals. however, so far the signals could only
trigger the default shutdown behavior.
Here we introduce the option to control shutdown arguments for SIGTERM and SIGINT.

New config options:
`shutdown-on-sigint [nosave | save] [now] [force]` 
`shutdown-on-sigterm [nosave | save] [now] [force]`

Implementation:
Support MULTI_ARG_CONFIG on createEnumConfig to support multiple enums to be applied as bit flags.

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-04-26 14:34:04 +03:00
Madelyn Olson 6fa8e4f7af
Set replicas to panic on disk errors, and optionally panic on replication errors (#10504)
* Till now, replicas that were unable to persist, would still execute the commands
  they got from the master, now they'll panic by default, and we add a new
  `replica-ignore-disk-errors` config to change that.
* Till now, when a command failed on a replica or AOF-loading, it only logged a
  warning and a stat, we add a new `propagation-error-behavior` config to allow
  panicking in that state (may become the default one day)

Note that commands that fail on the replica can either indicate a bug that could
cause data inconsistency between the replica and the master, or they could be
in some cases (specifically in previous versions), a result of a command (e.g. EVAL)
that failed on the master, but still had to be propagated to fail on the replica as well.
2022-04-26 13:25:33 +03:00
David CARLIER aba2865c86
Add socket-mark-id support for marking sockets. (#10349)
Add a configuration option to attach an operating system-specific identifier to Redis sockets, supporting advanced network configurations using iptables (Linux) or ipfw (FreeBSD).
2022-04-20 09:29:37 +03:00
Luke Palmer bb7891f080
Keyspace event for new keys (#10512)
Add an optional keyspace event when new keys are added to the db.

This is useful for applications where clients need to be aware of the redis keyspace.
Such an application can SCAN once at startup and then listen for "new" events (plus
others associated with DEL, RENAME, etc).
2022-04-13 11:36:38 +03:00
Binbin f25e688e2a
Cleanups in redis.conf (#10452)
Did some cleanups:
1. local local typo
2. replace the only slave word in the file
3. add FUNCTION FLUSH to `lazyfree-lazy-user-flush` description
4. thought it would be better to use these, there are actually "four" options
5. the the typo
6. remove a extra space
7. change comment next to `activedefrag no` to match the default value
2022-03-27 12:03:38 +03:00
Ronald Petty b104f3cabc
Update redis.conf (#10396)
Typo in conf file comment.
2022-03-08 12:52:54 -08:00
Binbin c0ea77f0e1
Show publishshard_sent stat in cluster info (#10314)
publishshard was added in #8621 (7.0 RC1), but the publishshard_sent
stat is not shown in CLUSTER INFO command.

Other changes:
1. Remove useless `needhelp` statements, it was removed in 3dad819.
2. Use `LL_WARNING` log level for some error logs (I/O error, Connection failed).
3. Fix typos that saw by the way.
2022-02-19 21:11:20 -08:00
Yossi Gottlieb 3881f7850f
Fix OpenSSL 3.0.x related issues. (#10291)
* Drop obsolete initialization calls.
* Use decoder API for DH parameters.
* Enable auto DH parameters if not explicitly used, which should be the
  preferred configuration going forward.
2022-02-15 08:37:06 +02:00
Pengcheng Huang 88795f5cba
Fix the misleading description of the option repl-ping-slave-period (#4687)
Co-authored-by: yoav-steinberg <yoav@monfort.co.il>
2022-02-10 09:18:59 +02:00
Harkrishn Patro a43b6922d1
Set default channel permission to resetchannels for 7.0 (#10181)
For backwards compatibility in 6.x, channels default permission was set to `allchannels` however with 7.0,
we should modify it and the default value should be `resetchannels` for better security posture.
Also, with selectors in ACL, a client doesn't have to set channel rules everytime and by default
the value will be `resetchannels`.

Before this change
```
127.0.0.1:6379> acl list
1) "user default on nopass ~* &* +@all"
127.0.0.1:6379>  acl setuser hp on nopass +@all ~*
OK
127.0.0.1:6379> acl list
1) "user default on nopass ~* &* +@all"
2) "user hp on nopass ~* &* +@all"
127.0.0.1:6379>  acl setuser hp1 on nopass -@all (%R~sales*)
OK
127.0.0.1:6379> acl list
1) "user default on nopass ~* &* +@all"
2) "user hp on nopass ~* &* +@all"
3) "user hp1 on nopass &* -@all (%R~sales* &* -@all)"
```

After this change
```
127.0.0.1:6379> acl list
1) "user default on nopass ~* &* +@all"
127.0.0.1:6379> acl setuser hp on nopass +@all ~*
OK
127.0.0.1:6379> acl list
1) "user default on nopass ~* &* +@all"
2) "user hp on nopass ~* resetchannels +@all"
127.0.0.1:6379> acl setuser hp1 on nopass -@all (%R~sales*)
OK
127.0.0.1:6379> acl list
1) "user default on nopass ~* &* +@all"
2) "user hp on nopass ~* resetchannels +@all"
3) "user hp1 on nopass resetchannels -@all (%R~sales* resetchannels -@all)"
```
2022-01-30 12:02:55 +02:00
Oran Agra 75a950cb93
doc improvement about acl first-arg (#10199)
We recently removed capabilities from the first-arg feature of ACL and added a warning.
but we didn't document it.
ref: #10147 and redis/redis-doc#1761
2022-01-29 18:34:59 +02:00
Binbin 495ac8b79a
Fix outdated save key word in redis.conf (#10166)
For some complex data types, server.dirty actually counts
the number of elements that have been changed.
And in FLUSHDB or FLUSHALL, we count the number of keys.

So the word "key" is not strictly correct and is outdated.
Some discussion can be seen at #8140.
2022-01-24 22:02:42 +02:00
Madelyn Olson 55c81f2cd3
ACL V2 - Selectors and key based permissions (#9974)
* Implemented selectors which provide multiple different sets of permissions to users
* Implemented key based permissions 
* Added a new ACL dry-run command to test permissions before execution
* Updated module APIs to support checking key based permissions

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-01-20 13:05:27 -08:00
perryitay c4b788230c
Adding module api for processing commands during busy jobs and allow flagging the commands that should be handled at this status (#9963)
Some modules might perform a long-running logic in different stages of Redis lifetime, for example:
* command execution
* RDB loading
* thread safe context

During this long-running logic Redis is not responsive.

This PR offers 
1. An API to process events while a busy command is running (`RM_Yield`)
2. A new flag (`ALLOW_BUSY`) to mark the commands that should be handled during busy
  jobs which can also be used by modules (`allow-busy`)
3. In slow commands and thread safe contexts, this flag will start rejecting commands with -BUSY only
  after `busy-reply-threshold`
4. During loading (`rdb_load` callback), it'll process events right away (not wait for `busy-reply-threshold`),
  but either way, the processing is throttled to the server hz rate.
5. Allow modules to Yield to redis background tasks, but not to client commands

* rename `script-time-limit` to `busy-reply-threshold` (an alias to the pre-7.0 `lua-time-limit`)

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-01-20 09:05:53 +02:00
Oran Agra ae89958972
Set repl-diskless-sync to yes by default, add repl-diskless-sync-max-replicas (#10092)
1. enable diskless replication by default
2. add a new config named repl-diskless-sync-max-replicas that enables
   replication to start before the full repl-diskless-sync-delay was
   reached.
3. put replica online sooner on the master (see below)
4. test suite uses repl-diskless-sync-delay of 0 to be faster
5. a few tests that use multiple replica on a pre-populated master, are
   now using the new repl-diskless-sync-max-replicas
6. fix possible timing issues in a few cluster tests (see below)

put replica online sooner on the master 
----------------------------------------------------
there were two tests that failed because they needed for the master to
realize that the replica is online, but the test code was actually only
waiting for the replica to realize it's online, and in diskless it could
have been before the master realized it.

changes include two things:
1. the tests wait on the right thing
2. issues in the master, putting the replica online in two steps.

the master used to put the replica as online in 2 steps. the first
step was to mark it as online, and the second step was to enable the
write event (only after getting ACK), but in fact the first step didn't
contains some of the tasks to put it online (like updating good slave
count, and sending the module event). this meant that if a test was
waiting to see that the replica is online form the point of view of the
master, and then confirm that the module got an event, or that the
master has enough good replicas, it could fail due to timing issues.

so now the full effect of putting the replica online, happens at once,
and only the part about enabling the writes is delayed till the ACK.

fix cluster tests 
--------------------
I added some code to wait for the replica to sync and avoid race
conditions.
later realized the sentinel and cluster tests where using the original 5
seconds delay, so changed it to 0.

this means the other changes are probably not needed, but i suppose
they're still better (avoid race conditions)
2022-01-17 14:11:11 +02:00
Madelyn Olson e8e02f900c
Changed latency histogram output to omit trailing 0s and periods (#10075)
Changed latency percentile output to omit trailing 0s and periods
2022-01-09 17:04:18 -08:00
YEONCHEOL JANG b9669829c8
Fixes typo for redis.conf (#10072)
Fixes typo for redis.conf
2022-01-09 08:08:55 +02:00
filipe oliveira 5dd15443ac
Added INFO LATENCYSTATS section: latency by percentile distribution/latency by cumulative distribution of latencies (#9462)
# Short description

The Redis extended latency stats track per command latencies and enables:
- exporting the per-command percentile distribution via the `INFO LATENCYSTATS` command.
  **( percentile distribution is not mergeable between cluster nodes ).**
- exporting the per-command cumulative latency distributions via the `LATENCY HISTOGRAM` command.
  Using the cumulative distribution of latencies we can merge several stats from different cluster nodes
  to calculate aggregate metrics .

By default, the extended latency monitoring is enabled since the overhead of keeping track of the
command latency is very small.
 
If you don't want to track extended latency metrics, you can easily disable it at runtime using the command:
 - `CONFIG SET latency-tracking no`

By default, the exported latency percentiles are the p50, p99, and p999.
You can alter them at runtime using the command:
- `CONFIG SET latency-tracking-info-percentiles "0.0 50.0 100.0"`


## Some details:
- The total size per histogram should sit around 40 KiB. We only allocate those 40KiB when a command
  was called for the first time.
- With regards to the WRITE overhead As seen below, there is no measurable overhead on the achievable
  ops/sec or full latency spectrum on the client. Including also the measured redis-benchmark for unstable
  vs this branch. 
- We track from 1 nanosecond to 1 second ( everything above 1 second is considered +Inf )

## `INFO LATENCYSTATS` exposition format

   - Format: `latency_percentiles_usec_<CMDNAME>:p0=XX,p50....` 

## `LATENCY HISTOGRAM [command ...]` exposition format

Return a cumulative distribution of latencies in the format of a histogram for the specified command names.

The histogram is composed of a map of time buckets:
- Each representing a latency range, between 1 nanosecond and roughly 1 second.
- Each bucket covers twice the previous bucket's range.
- Empty buckets are not printed.
- Everything above 1 sec is considered +Inf.
- At max there will be log2(1000000000)=30 buckets

We reply a map for each command in the format:
`<command name> : { `calls`: <total command calls> , `histogram` : { <bucket 1> : latency , < bucket 2> : latency, ...  } }`

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-01-05 14:01:05 +02:00
Binbin 9538088751
Fix typos in aof.c / redis.conf (#10057) 2022-01-05 12:06:02 +02:00
chenyang8094 87789fae0b
Implement Multi Part AOF mechanism to avoid AOFRW overheads. (#9788)
Implement Multi-Part AOF mechanism to avoid overheads during AOFRW.
Introducing a folder with multiple AOF files tracked by a manifest file.

The main issues with the the original AOFRW mechanism are:
* buffering of commands that are processed during rewrite (consuming a lot of RAM)
* freezes of the main process when the AOFRW completes to drain the remaining part of the buffer and fsync it.
* double disk IO for the data that arrives during AOFRW (had to be written to both the old and new AOF files)

The main modifications of this PR:
1. Remove the AOF rewrite buffer and related code.
2. Divide the AOF into multiple files, they are classified as two types, one is the the `BASE` type,
  it represents the full amount of data (Maybe AOF or RDB format) after each AOFRW, there is only
  one `BASE` file at most. The second is `INCR` type, may have more than one. They represent the
  incremental commands since the last AOFRW.
3. Use a AOF manifest file to record and manage these AOF files mentioned above.
4. The original configuration of `appendfilename` will be the base part of the new file name, for example:
  `appendonly.aof.1.base.rdb` and `appendonly.aof.2.incr.aof`
5. Add manifest-related TCL tests, and modified some existing tests that depend on the `appendfilename`
6. Remove the `aof_rewrite_buffer_length` field in info.
7. Add `aof-disable-auto-gc` configuration. By default we're automatically deleting HISTORY type AOFs.
  It also gives users the opportunity to preserve the history AOFs. just for testing use now.
8. Add AOFRW limiting measure. When the AOFRW failures reaches the threshold (3 times now),
  we will delay the execution of the next AOFRW by 1 minute. If the next AOFRW also fails, it will be
  delayed by 2 minutes. The next is 4, 8, 16, the maximum delay is 60 minutes (1 hour). During the limit
  period, we can still use the 'bgrewriteaof' command to execute AOFRW immediately.
9. Support upgrade (load) data from old version redis.
10. Add `appenddirname` configuration, as the directory name of the append only files. All AOF files and
  manifest file will be placed in this directory.
11. Only the last AOF file (BASE or INCR) can be truncated. Otherwise redis will exit even if
  `aof-load-truncated` is enabled.

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-01-03 19:14:13 +02:00
Madelyn Olson 5460c10047
Implement clusterbus message extensions and cluster hostname support (#9530)
Implement the ability for cluster nodes to advertise their location with extension messages.
2022-01-02 19:48:29 -08:00
Harkrishn Patro 9f8885760b
Sharded pubsub implementation (#8621)
This commit implements a sharded pubsub implementation based off of shard channels.

Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2022-01-02 16:54:47 -08:00
Viktor Söderqvist 45a155bd0f
Wait for replicas when shutting down (#9872)
To avoid data loss, this commit adds a grace period for lagging replicas to
catch up the replication offset.

Done:

* Wait for replicas when shutdown is triggered by SIGTERM and SIGINT.

* Wait for replicas when shutdown is triggered by the SHUTDOWN command. A new
  blocked client type BLOCKED_SHUTDOWN is introduced, allowing multiple clients
  to call SHUTDOWN in parallel.
  Note that they don't expect a response unless an error happens and shutdown is aborted.

* Log warning for each replica lagging behind when finishing shutdown.

* CLIENT_PAUSE_WRITE while waiting for replicas.

* Configurable grace period 'shutdown-timeout' in seconds (default 10).

* New flags for the SHUTDOWN command:

    - NOW disables the grace period for lagging replicas.

    - FORCE ignores errors writing the RDB or AOF files which would normally
      prevent a shutdown.

    - ABORT cancels ongoing shutdown. Can't be combined with other flags.

* New field in the output of the INFO command: 'shutdown_in_milliseconds'. The
  value is the remaining maximum time to wait for lagging replicas before
  finishing the shutdown. This field is present in the Server section **only**
  during shutdown.

Not directly related:

* When shutting down, if there is an AOF saving child, it is killed **even** if AOF
  is disabled. This can happen if BGREWRITEAOF is used when AOF is off.

* Client pause now has end time and type (WRITE or ALL) per purpose. The
  different pause purposes are *CLIENT PAUSE command*, *failover* and
  *shutdown*. If clients are unpaused for one purpose, it doesn't affect client
  pause for other purposes. For example, the CLIENT UNPAUSE command doesn't
  affect client pause initiated by the failover or shutdown procedures. A completed
  failover or a failed shutdown doesn't unpause clients paused by the CLIENT
  PAUSE command.

Notes:

* DEBUG RESTART doesn't wait for replicas.

* We already have a warning logged when a replica disconnects. This means that
  if any replica connection is lost during the shutdown, it is either logged as
  disconnected or as lagging at the time of exit.

Co-authored-by: Oran Agra <oran@redislabs.com>
2022-01-02 09:50:15 +02:00
YaacovHazan ae2f5b7b2e
Protected configs and sensitive commands (#9920)
Block sensitive configs and commands by default.

* `enable-protected-configs` - block modification of configs with the new `PROTECTED_CONFIG` flag.
   Currently we add this flag to `dbfilename`, and `dir` configs,
   all of which are non-mutable configs that can set a file redis will write to.
* `enable-debug-command` - block the `DEBUG` command
* `enable-module-command` - block the `MODULE` command

These have a default value set to `no`, so that these features are not
exposed by default to client connections, and can only be set by modifying the config file.

Users can change each of these to either `yes` (allow all access), or `local` (allow access from
local TCP connections and unix domain connections)

Note that this is a **breaking change** (specifically the part about MODULE command being disabled by default).
I.e. we don't consider DEBUG command being blocked as an issue (people shouldn't have been using it),
and the few configs we protected are unlikely to have been set at runtime anyway.
On the other hand, it's likely to assume some users who use modules, load them from the config file anyway.
Note that's the whole point of this PR, for redis to be more secure by default and reduce the attack surface on
innocent users, so secure defaults will necessarily mean a breaking change.
2021-12-19 10:46:16 +02:00
ny0312 792afb4432
Introduce memory management on cluster link buffers (#9774)
Introduce memory management on cluster link buffers:
 * Introduce a new `cluster-link-sendbuf-limit` config that caps memory usage of cluster bus link send buffers.
 * Introduce a new `CLUSTER LINKS` command that displays current TCP links to/from peers.
 * Introduce a new `mem_cluster_links` field under `INFO` command output, which displays the overall memory usage by all current cluster links.
 * Introduce a new `total_cluster_links_buffer_limit_exceeded` field under `CLUSTER INFO` command output, which displays the accumulated count of cluster links freed due to `cluster-link-sendbuf-limit`.
2021-12-16 21:56:59 -08:00
Madelyn Olson 36ca545286
Fix spelling of sanitization (#9901) 2021-12-06 10:14:13 +02:00
yoav-steinberg 0e5b813ef9
Multiparam config set (#9748)
We can now do: `config set maxmemory 10m repl-backlog-size 5m`

## Basic algorithm to support "transaction like" config sets:

1. Backup all relevant current values (via get).
2. Run "verify" and "set" on everything, if we fail run "restore".
3. Run "apply" on everything (optional optimization: skip functions already run). If we fail run "restore".
4. Return success.

### restore
1. Run set on everything in backup. If we fail log it and continue (this puts us in an undefined
   state but we decided it's better than the alternative of panicking). This indicates either a bug
   or some unsupported external state.
2. Run apply on everything in backup (optimization: skip functions already run). If we fail log
   it (see comment above).
3. Return error.

## Implementation/design changes:
* Apply function are idempotent (have no effect if they are run more than once for the same config).
* No indication in set functions if we're reading the config or running from the `CONFIG SET` command
   (removed `update` argument).
* Set function should set some config variable and assume an (optional) apply function will use that
   later to apply. If we know this setting can be safely applied immediately and can always be reverted
   and doesn't depend on any other configuration we can apply immediately from within the set function
   (and not store the setting anywhere). This is the case of this `dir` config, for example, which has no
   apply function. No apply function is need also in the case that setting the variable in the `server` struct
   is all that needs to be done to make the configuration take effect. Note that the original concept of `update_fn`,
   which received the old and new values was removed and replaced by the optional apply function.
* Apply functions use settings written to the `server` struct and don't receive any inputs.
* I take care that for the generic (non-special) configs if there's no change I avoid calling the setter (possible
   optimization: avoid calling the apply function as well).
* Passing the same config parameter more than once to `config set` will fail. You can't do `config set my-setting
   value1 my-setting value2`.

Note that getting `save` in the context of the conf file parsing to work here as before was a pain.
The conf file supports an aggregate `save` definition, where each `save` line is added to the server's
save params. This is unlike any other line in the config file where each line overwrites any previous
configuration. Since we now support passing multiple save params in a single line (see top comments
about `save` in https://github.com/redis/redis/pull/9644) we should deprecate the aggregate nature of
this config line and perhaps reduce this ugly code in the future.
2021-12-01 10:15:11 +02:00