Commit Graph

729 Commits

Author SHA1 Message Date
Yuan Wang 70a079db5e
Improve multithreaded performance with memory prefetching (#14017)
This PR is based on: https://github.com/valkey-io/valkey/pull/861

> ### Memory Access Amortization
> (Designed and implemented by [dan
touitou](https://github.com/touitou-dan))
> 
> Memory Access Amortization (MAA) is a technique designed to optimize
the performance of dynamic data structures by reducing the impact of
memory access latency. It is applicable when multiple operations need to
be executed concurrently. The principle behind it is that for certain
dynamic data structures, executing operations in a batch is more
efficient than executing each one separately.
> 
> Rather than executing operations sequentially, this approach
interleaves the execution of all operations. This is done in such a way
that whenever a memory access is required during an operation, the
program prefetches the necessary memory and transitions to another
operation. This ensures that when one operation is blocked awaiting
memory access, other memory accesses are executed in parallel, thereby
reducing the average access latency.
> 
> We applied this method in the development of dictPrefetch, which takes
as parameters a vector of keys and dictionaries. It ensures that all
memory addresses required to execute dictionary operations for these
keys are loaded into the L1-L3 caches when executing commands.
Essentially, dictPrefetch is an interleaved execution of dictFind for
all the keys.

### Implementation of Redis
When the main thread processes clients with ready-to-execute commands
(i.e., clients for which the IO thread has parsed the commands), a batch
of up to 16 commands is created. Initially, the command's argv, which
were allocated by the IO thread, is prefetched to the main thread's L1
cache. Subsequently, all the dict entries and values required for the
commands are prefetched from the dictionary before the command
execution.

#### Memory prefetching for main hash table
As shown in the picture, after https://github.com/redis/redis/pull/13806
, we unify key value and the dict uses no_value optimization, so the
memory prefetching has 4 steps:

1. prefetch the bucket of the hash table
2. prefetch the entry associated with the given key's hash
3. prefetch the kv object of the entry
4. prefetch the value data of the kv object

we also need to handle the case that the dict entry is the pointer of kv
object, just skip step 3.

MAA can improves single-threaded memory access efficiency by
interleaving the execution of multiple independent operations, allowing
memory-level parallelism and better CPU utilization. Its key point is
batch-wise interleaved execution. Split a batch of independent
operations (such as multiple key lookups) into multiple state machines,
and interleave their progress within a single thread to hide the memory
access latency of individual requests.

The difference between serial execution and interleaved execution:
**naive serial execution**
```
key1: step1 → wait → step2 → wait → done
key2: step1 → wait → step2 → wait → done
```
**interleaved execution**
```
key1: step1   → step2   → done
key2:   step1 → step2   → done
key3:     step1 → step2 → done
         ↑ While waiting for key1’s memory, progress key2/key3
```

#### New configuration
This PR involves a new configuration `prefetch-batch-max-size`, but we
think it is a low level optimization, so we hide this config:
When multiple commands are parsed by the I/O threads and ready for
execution, we take advantage of knowing the next set of commands and
prefetch their required dictionary entries in a batch. This reduces
memory access costs. The optimal batch size depends on the specific
workflow of the user. The default batch size is 16, which can be
modified using the 'prefetch-batch-max-size' config.
When the config is set to 0, prefetching is disabled.

---------

Co-authored-by: Uri Yagelnik <uriy@amazon.com>
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
2025-06-05 08:57:43 +08:00
Slavomir Kaslev b7c6755b1b
Add thread sanitizer run to daily CI (#13964)
Add thread sanitizer run to daily CI.

Few tests are skipped in tsan runs for two reasons:
* Stack trace producing tests (oom, `unit/moduleapi/crash`, etc) are
tagged `tsan:skip` because redis calls `backtrace()` in signal handler
which turns out to be signal-unsafe since it might allocate memory (e.g.
glibc 2.39 does it through a call to `_dl_map_object_deps()`).
* Few tests become flaky with thread sanitizer builds and don't finish
in expected deadlines because of the additional tsan overhead. Instead
of skipping those tests, this can improved in the future by allowing
more iterations when waiting for tsan builds.

Deadlock detection is disabled for now because of tsan limitation where
max 64 locks can be taken at once.

There is one outstanding (false-positive?) race in jemalloc which is
suppressed in `tsan.sup`.

Fix few races thread sanitizer reported having to do with writes from
signal handlers. Since in multi-threaded setting signal handlers might
be called on any thread (modulo pthread_sigmask) while the main thread
is running, `volatile sig_atomic_t` type is not sufficient and atomics
are used instead.
2025-06-02 10:13:23 +03:00
Yuan Wang 99d30654e8
Let IO threads free argv and rewrite objects (#13968)
Some objects that are allocated in the IO thread, we should let IO
thread free them, so we can avoid memory arena contention and also
reduce the load of the main thread.

These objects include:
- client argv objects
- the rewrite objects that are only `OBJ_ENCODING_RAW` encoding strings,
since only the type object is usually allocated by IO threads.

For the implementation, if the client is assigned to IO threads, we will
create a `deferred_objects` array of size 32. We will put objects into
the `deferred_objects` when main thread wants to free above objects,
and finally they are be freed by IO threads.
2025-05-23 09:09:58 +08:00
Mincho Paskalev 4d9b4d6e51
Input output traffic stats and command process count for each client. (#13944)
CI / build-centos-jemalloc (push) Failing after 6s Details
CI / build-old-chain-jemalloc (push) Failing after 6s Details
CI / build-debian-old (push) Failing after 30s Details
CI / build-libc-malloc (push) Failing after 31s Details
Codecov / code-coverage (push) Failing after 31s Details
Spellcheck / Spellcheck (push) Failing after 31s Details
CI / test-sanitizer-address (push) Failing after 1m26s Details
CI / test-ubuntu-latest (push) Failing after 1m31s Details
CI / build-32bit (push) Failing after 7m7s Details
CI / build-macos-latest (push) Has been cancelled Details
CodeQL / Analyze (cpp) (push) Failing after 32s Details
External Server Tests / test-external-cluster (push) Failing after 0s Details
Coverity Scan / coverity (push) Has been skipped Details
External Server Tests / test-external-standalone (push) Failing after 31s Details
External Server Tests / test-external-nodebug (push) Failing after 31s Details
2025-05-09 16:55:47 +03:00
Pieter Cailliau d65102861f
Adding AGPLv3 as a license option to Redis! (#13997)
Read more about [the new license option](http://redis.io/blog/agplv3/)
and [the Redis 8 release](http://redis.io/blog/redis-8-ga/).
2025-05-01 14:04:22 +01:00
YaacovHazan de16bee70a
Limiting output buffer for unauthenticated client (CVE-2025-21605) (#13993)
CI / build-centos-jemalloc (push) Failing after 18s Details
CI / build-32bit (push) Failing after 30s Details
CI / test-ubuntu-latest (push) Failing after 32s Details
CI / build-old-chain-jemalloc (push) Failing after 41s Details
CI / build-libc-malloc (push) Successful in 1m16s Details
CI / test-sanitizer-address (push) Failing after 1m39s Details
Codecov / code-coverage (push) Failing after 2m47s Details
Spellcheck / Spellcheck (push) Failing after 4m31s Details
CI / build-debian-old (push) Failing after 6m49s Details
Coverity Scan / coverity (push) Has been skipped Details
External Server Tests / test-external-standalone (push) Failing after 1m53s Details
External Server Tests / test-external-nodebug (push) Failing after 1m57s Details
External Server Tests / test-external-cluster (push) Failing after 1m59s Details
CI / build-macos-latest (push) Has been cancelled Details
For unauthenticated clients the output buffer is limited to prevent them
from abusing it by not reading the replies
2025-04-30 09:58:51 +03:00
Yuan Wang 951ec79654
Cluster compatibility check (#13846)
CI / build-macos-latest (push) Waiting to run Details
CI / build-32bit (push) Failing after 31s Details
CI / build-libc-malloc (push) Failing after 31s Details
CI / build-debian-old (push) Failing after 1m32s Details
CI / build-old-chain-jemalloc (push) Failing after 31s Details
Codecov / code-coverage (push) Failing after 31s Details
CI / test-ubuntu-latest (push) Failing after 3m21s Details
Spellcheck / Spellcheck (push) Failing after 31s Details
CI / test-sanitizer-address (push) Failing after 6m36s Details
CI / build-centos-jemalloc (push) Failing after 6m36s Details
External Server Tests / test-external-standalone (push) Failing after 2m10s Details
Coverity Scan / coverity (push) Has been skipped Details
External Server Tests / test-external-nodebug (push) Failing after 2m12s Details
External Server Tests / test-external-cluster (push) Failing after 2m16s Details
### Background
The program runs normally in standalone mode, but migrating to cluster
mode may cause errors, this is because some cross slot commands can not
run in cluster mode. We should provide an approach to detect this issue
when running in standalone mode, and need to expose a metric which
indicates the usage of no incompatible commands.

### Solution
To avoid perf impact, we introduce a new config
`cluster-compatibility-sample-ratio` which define the sampling ratio
(0-100) for checking command compatibility in cluster mode. When a
command is executed, it is sampled at the specified ratio to determine
if it complies with Redis cluster constraints, such as cross-slot
restrictions.

A new metric is exposed: `cluster_incompatible_ops` in `info stats`
output.

The following operations will be considered incompatible operations.

- cross-slot command
   If a command has multiple cross slot keys, it is incompatible
- `swap, copy, move, select` command
These commands involve multi databases in some cases, we don't allow
multiple DB in cluster mode, so there are not compatible
- Module command with `no-cluster` flag
If a module command has `no-cluster` flag, we will encounter an error
when loading module, leading to fail to load module if cluster is
enabled, so this is incompatible.
- Script/function with `no-cluster` flag
Similar with module command, if we declare `no-cluster` in shebang of
script/function, we also can not run it in cluster mode
- `sort` command by/get pattern
When `sort` command has `by/get` pattern option, we must ask that the
pattern slot is equal with the slot of keys, otherwise it is
incompatible in cluster mode.

- The script/function command accesses the keys and declared keys have
different slots
For the script/function command, we not only check the slot of declared
keys, but only check the slot the accessing keys, if they are different,
we think it is incompatible.

**Besides**, commands like `keys, scan, flushall, script/function
flush`, that in standalone mode iterate over all data to perform the
operation, are only valid for the server that executes the command in
cluster mode and are not broadcasted. However, this does not lead to
errors, so we do not consider them as incompatible commands.

### Performance impact test
**cross slot test**
Below are the test commands and results. When using MSET with 8 keys,
performance drops by approximately 3%.

**single key test**
It may be due to the overhead of the sampling function, and single-key
commands could cause a 1-2% performance drop.
2025-03-20 10:35:53 +08:00
Filipe Oliveira (Redis) d7a448f9ae
Avoid redundant calls to sdslen(c->querybuf) in processMultibulkBuffer (#13787)
Optimize processMultibulkBuffer by avoiding redundant calls to
sdslen(c->querybuf).
The cached length is updated only when querybuf is modified.
2025-02-24 20:06:14 +08:00
Raz Monsonego 04589f90d7
Add internal connection and command mechanism (#13740)
# PR: Add Mechanism for Internal Commands and Connections in Redis

This PR introduces a mechanism to handle **internal commands and
connections** in Redis. It includes enhancements for command
registration, internal authentication, and observability.

## Key Features

1. **Internal Command Flag**:
   - Introduced a new **module command registration flag**: `internal`.
- Commands marked with `internal` can only be executed by **internal
connections**, AOF loading flows, and master-replica connections.
- For any other connection, these commands will appear as non-existent.

2. **Support for internal authentication added to `AUTH`**:
- Used by depicting the special username `internal connection` with the
right internal password, i.e.,: `AUTH "internal connection"
<internal_secret>`.
- No user-defined ACL username can have this name, since spaces are not
aloud in the ACL parser.
   - Allows connections to authenticate as **internal connections**.
- Authenticated internal connections can execute internal commands
successfully.

4. **Module API for Internal Secret**:
- Added the `RedisModule_GetInternalSecret()` API, that exposes the
internal secret that should be used as the password for the new `AUTH
"internal connection" <password>` command.
- This API enables the modules to authenticate against other shards as
local connections.

## Notes on Behavior

- **ACL validation**:
- Commands dispatched by internal connections bypass ACL validation, to
give the caller full access regardless of the user with which it is
connected.

- **Command Visibility**:
- Internal commands **do not appear** in `COMMAND <subcommand>` and
`MONITOR` for non-internal connections.
- Internal commands **are logged** in the slow log, latency report and
commands' statistics to maintain observability.

- **`RM_Call()` Updates**:
  - **Non-internal connections**:
- Cannot execute internal commands when the command is sent with the `C`
flag (otherwise can).
- Internal connections bypass ACL validations (i.e., run as the
unrestricted user).

- **Internal commands' success**:
- Internal commands succeed upon being sent from either an internal
connection (i.e., authenticated via the new `AUTH "internal connection"
<internal_secret>` API), an AOF loading process, or from a master via
the replication link.
Any other connections that attempt to execute an internal command fail
with the `unknown command` error message raised.

- **`CLIENT LIST` flags**:
  - Added the `I` flag, to indicate that the connection is internal.

- **Lua Scripts**:
   - Prevented internal commands from being executed via Lua scripts.

---------

Co-authored-by: Meir Shpilraien <meir@redis.com>
2025-02-05 11:48:08 +02:00
Yuan Wang 5b8b58e472
Fix incorrect parameter type reports (#13744)
After upgrading of ubuntu 24.04, clang18 can check runtime error: call
to function XXX through pointer to incorrect function type, our daily CI
reports the errors by UndefinedBehaviorSanitizer (UBSan):

https://github.com/redis/redis/actions/runs/12738281720/job/35500380251#step:6:346

now we add generic version of some existing `free` functions to support
to call function through (void*) pointer, actually, they just are the
wrapper functions that will cast the data type and call the
corresponding functions.
2025-01-14 15:51:05 +08:00
Ozan Tezcan 73a9b916c9
Rdb channel replication (#13732)
This PR is based on:

https://github.com/redis/redis/pull/12109
https://github.com/valkey-io/valkey/pull/60

Closes: https://github.com/redis/redis/issues/11678

**Motivation**

During a full sync, when master is delivering RDB to the replica,
incoming write commands are kept in a replication buffer in order to be
sent to the replica once RDB delivery is completed. If RDB delivery
takes a long time, it might create memory pressure on master. Also, once
a replica connection accumulates replication data which is larger than
output buffer limits, master will kill replica connection. This may
cause a replication failure.

The main benefit of the rdb channel replication is streaming incoming
commands in parallel to the RDB delivery. This approach shifts
replication stream buffering to the replica and reduces load on master.
We do this by opening another connection for RDB delivery. The main
channel on replica will be receiving replication stream while rdb
channel is receiving the RDB.

This feature also helps to reduce master's main process CPU load. By
opening a dedicated connection for the RDB transfer, the bgsave process
has access to the new connection and it will stream RDB directly to the
replicas. Before this change, due to TLS connection restriction, the
bgsave process was writing RDB bytes to a pipe and the main process was
forwarding
it to the replica. This is no longer necessary, the main process can
avoid these expensive socket read/write syscalls. It also means RDB
delivery to replica will be faster as it avoids this step.

In summary, replication will be faster and master's performance during
full syncs will improve.


**Implementation steps**

1. When replica connects to the master, it sends 'rdb-channel-repl' as
part of capability exchange to let master to know replica supports rdb
channel.
2. When replica lacks sufficient data for PSYNC, master sends
+RDBCHANNELSYNC reply with replica's client id. As the next step, the
replica opens a new connection (rdb-channel) and configures it against
the master with the appropriate capabilities and requirements. It also
sends given client id back to master over rdbchannel, so that master can
associate these channels. (initial replica connection will be referred
as main-channel) Then, replica requests fullsync using the RDB channel.
3. Prior to forking, master attaches the replica's main channel to the
replication backlog to deliver replication stream starting at the
snapshot end offset.
4. The master main process sends replication stream via the main
channel, while the bgsave process sends the RDB directly to the replica
via the rdb-channel. Replica accumulates replication stream in a local
buffer, while the RDB is being loaded into the memory.
5. Once the replica completes loading the rdb, it drops the rdb channel
and streams the accumulated replication stream into the db. Sync is
completed.

**Some details**
- Currently, rdbchannel replication is supported only if
`repl-diskless-sync` is enabled on master. Otherwise, replication will
happen over a single connection as in before.
- On replica, there is a limit to replication stream buffering. Replica
uses a new config `replica-full-sync-buffer-limit` to limit number of
bytes to accumulate. If it is not set, replica inherits
`client-output-buffer-limit <replica>` hard limit config. If we reach
this limit, replica stops accumulating. This is not a failure scenario
though. Further accumulation will happen on master side. Depending on
the configured limits on master, master may kill the replica connection.

**API changes in INFO output:**

1. New replica state: `send_bulk_and_stream`. Indicates full sync is
still in progress for this replica. It is receiving replication stream
and rdb in parallel.
```
slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0
```
Replica state changes in steps:
- First, replica sends psync and receives +RDBCHANNELSYNC
:`state=wait_bgsave`
- After replica connects with rdbchannel and delivery starts:
`state=send_bulk_and_stream`
 - After full sync: `state=online`

2. On replica side, replication stream buffering metrics:
- replica_full_sync_buffer_size: Currently accumulated replication
stream data in bytes.
- replica_full_sync_buffer_peak: Peak number of bytes that this instance
accumulated in the lifetime of the process.

```
replica_full_sync_buffer_size:20485             
replica_full_sync_buffer_peak:1048560
```

**API changes in CLIENT LIST**

In `client list` output, rdbchannel clients will have 'C' flag in
addition to 'S' replica flag:
```
id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0
```

**Config changes:**
- `replica-full-sync-buffer-limit`: Controls how much replication data
replica can accumulate during rdbchannel replication. If it is not set,
a value of 0 means replica will inherit `client-output-buffer-limit
<replica>` hard limit config to limit accumulated data.
- `repl-rdb-channel` config is added as a hidden config. This is mostly
for testing as we need to support both rdbchannel replication and the
older single connection replication (to keep compatibility with older
versions and rdbchannel replication will not be enabled if
repl-diskless-sync is not enabled). it affects both the master (not to
respond to rdb channel requests), and the replica (not to declare
capability)

**Internal API changes:**
Changes that were introduced to Redis replication:
- New replication capability is added to replconf command: `capa
rdb-channel-repl`. Indicates replica is capable of rdb channel
replication. Replica sends it when it connects to master along with
other capabilities.
- If replica needs fullsync, master replies `+RDBCHANNELSYNC
<client-id>` to the replica's PSYNC request.
- When replica opens rdbchannel connection, as part of replconf command,
it sends `rdb-channel 1` to let master know this is rdb channel. Also,
it sends `main-ch-client-id <client-id>` as part of replconf command so
master can associate channels.
  
**Testing:**
As rdbchannel replication is enabled by default, we run whole test suite
with it. Though, as we need to support both rdbchannel and single
connection replication, we'll be running some tests twice with
`repl-rdb-channel yes/no` config.

**Replica state diagram**
```
* * Replica state machine *
 *
 * Main channel state
 * ┌───────────────────┐
 * │RECEIVE_PING_REPLY │
 * └────────┬──────────┘
 *          │ +PONG
 * ┌────────▼──────────┐
 * │SEND_HANDSHAKE     │                     RDB channel state
 * └────────┬──────────┘            ┌───────────────────────────────┐
 *          │+OK                ┌───► RDB_CH_SEND_HANDSHAKE         │
 * ┌────────▼──────────┐        │   └──────────────┬────────────────┘
 * │RECEIVE_AUTH_REPLY │        │    REPLCONF main-ch-client-id <clientid>
 * └────────┬──────────┘        │   ┌──────────────▼────────────────┐
 *          │+OK                │   │ RDB_CH_RECEIVE_AUTH_REPLY     │
 * ┌────────▼──────────┐        │   └──────────────┬────────────────┘
 * │RECEIVE_PORT_REPLY │        │                  │ +OK
 * └────────┬──────────┘        │   ┌──────────────▼────────────────┐
 *          │+OK                │   │  RDB_CH_RECEIVE_REPLCONF_REPLY│
 * ┌────────▼──────────┐        │   └──────────────┬────────────────┘
 * │RECEIVE_IP_REPLY   │        │                  │ +OK
 * └────────┬──────────┘        │   ┌──────────────▼────────────────┐
 *          │+OK                │   │ RDB_CH_RECEIVE_FULLRESYNC     │
 * ┌────────▼──────────┐        │   └──────────────┬────────────────┘
 * │RECEIVE_CAPA_REPLY │        │                  │+FULLRESYNC
 * └────────┬──────────┘        │                  │Rdb delivery
 *          │                   │   ┌──────────────▼────────────────┐
 * ┌────────▼──────────┐        │   │ RDB_CH_RDB_LOADING            │
 * │SEND_PSYNC         │        │   └──────────────┬────────────────┘
 * └─┬─────────────────┘        │                  │ Done loading
 *   │PSYNC (use cached-master) │                  │
 * ┌─▼─────────────────┐        │                  │
 * │RECEIVE_PSYNC_REPLY│        │    ┌────────────►│ Replica streams replication
 * └─┬─────────────────┘        │    │             │ buffer into memory
 *   │                          │    │             │
 *   │+RDBCHANNELSYNC client-id │    │             │
 *   ├──────┬───────────────────┘    │             │
 *   │      │ Main channel           │             │
 *   │      │ accumulates repl data  │             │
 *   │   ┌──▼────────────────┐       │     ┌───────▼───────────┐
 *   │   │ REPL_TRANSFER     ├───────┘     │    CONNECTED      │
 *   │   └───────────────────┘             └────▲───▲──────────┘
 *   │                                          │   │
 *   │                                          │   │
 *   │  +FULLRESYNC    ┌───────────────────┐    │   │
 *   ├────────────────► REPL_TRANSFER      ├────┘   │
 *   │                 └───────────────────┘        │
 *   │  +CONTINUE                                   │
 *   └──────────────────────────────────────────────┘
 */
 ```
 -----
 This PR also contains changes and ideas from: 
https://github.com/valkey-io/valkey/pull/837
https://github.com/valkey-io/valkey/pull/1173
https://github.com/valkey-io/valkey/pull/804
https://github.com/valkey-io/valkey/pull/945
https://github.com/valkey-io/valkey/pull/989
---------

Co-authored-by: Yuan Wang <wangyuancode@163.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: Moti Cohen <moticless@gmail.com>
Co-authored-by: naglera <anagler123@gmail.com>
Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 15:09:52 +03:00
Filipe Oliveira (Redis) dc0ee51cb1
Refactor Client Write Preparation and Handling (#13721)
This update refactors prepareClientToWrite by introducing
_prepareClientToWrite for inline checks within network.c file, and
separates replica and non-replica handling for pending replies and
writes (_clientHasPendingRepliesSlave/NonSlave and
_writeToClientSlave/NonSlave).

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: Yuan Wang <wangyuancode@163.com>
2025-01-13 15:40:36 +08:00
debing.sun 21aee83abd
Fix issue with argv not being shrunk (#13698)
Found by @ShooterIT 

## Describe
If a client first creates a command with a very large number of
parameters, such as 10,000 parameters, the argv will be expanded to
accommodate 10,000. If the subsequent commands have fewer than 10,000
parameters, this argv will continue to be reused and will never be
shrunk.

## Solution
When determining whether it is necessary to rebuild argv, if the length
of the previous argv has already exceeded 1024, we will progressively
create argv regardless.

## Free argv in cron
Add a new condition to determine whether argv needs to be resized in
cron. When the number of parameters exceeds 128, we will resize it
regardless to avoid a single client consuming too much memory. It will
now occupy a maximum of (128 * 8 bytes).

---------

Co-authored-by: Yuan Wang <wangyuancode@163.com>
2025-01-08 16:12:52 +08:00
debing.sun 08d714d0e5
Fix crash due to cron argv release (#13725)
Introduced by https://github.com/redis/redis/issues/13521

If the client argv was released due to a timeout before sending the
complete command, `argv_len` will be reset to 0.
When argv is parsed again and resized, requesting a length of 0 may
result in argv being NULL, then leading to a crash.

And fix a bug that `argv_len` is not updated correctly in
`replaceClientCommandVector()`.

---------

Co-authored-by: ShooterIT <wangyuancode@163.com>
Co-authored-by: meiravgri <109056284+meiravgri@users.noreply.github.com>
2025-01-08 09:57:23 +08:00
Yuan Wang 8e9f5146dd
Add reads/writes metrics for IO threads (#13703)
The main job of the IO thread is read queries and write replies, so
reads/writes metrics can reflect the workload of IO threads, now we also
support this metrics `io_threaded_reads/writes_processed` in detail for
each IO thread.

Of course, to avoid break changes, `io_threaded_reads/writes_processed`
is still there. But before async io thread commit, we may sum the IO
done by the main thread if IO threads are active, but now we only sum
the IO done by IO threads.

Now threads section in `info` command output is as follows:
```
# Threads
io_thread_0:clients=0,reads=0,writes=0
io_thread_1:clients=54,reads=6546940,writes=6546919
io_thread_2:clients=54,reads=6513650,writes=6513625
io_thread_3:clients=54,reads=6396571,writes=6396525
io_thread_4:clients=53,reads=6511120,writes=6511097
io_thread_5:clients=53,reads=6539302,writes=6539280
io_thread_6:clients=53,reads=6502269,writes=6502248
```
2025-01-06 15:59:02 +08:00
raffertyyu 04f63d4af7
Fix index error of CRLF when replying with integer-encoded strings (#13711)
close #13709

Fix the index error of CRLF character for integer-encoded strings
in addReplyBulk function

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2024-12-31 21:41:10 +08:00
Yuan Wang 7665bdc91a
Offload `lookupCommand` into IO threads when threaded IO is enabled (#13696)
From flame graph, we could see `lookupCommand` in main thread costs much
CPU, so we can let IO threads to perform `lookupCommand`.

To avoid race condition among multiple IO threads, made the following
changes:
- Pause all IO threads when register or unregister commands
- Force a full rehashing of the command table dict when resizing
2024-12-25 16:03:22 +08:00
Yuan Wang 64a40b20d9
Async IO Threads (#13695)
## Introduction
Redis introduced IO Thread in 6.0, allowing IO threads to handle client
request reading, command parsing and reply writing, thereby improving
performance. The current IO thread implementation has a few drawbacks.
- The main thread is blocked during IO thread read/write operations and
must wait for all IO threads to complete their current tasks before it
can continue execution. In other words, the entire process is
synchronous. This prevents the efficient utilization of multi-core CPUs
for parallel processing.

- When the number of clients and requests increases moderately, it
causes all IO threads to reach full CPU utilization due to the busy wait
mechanism used by the IO threads. This makes it challenging for us to
determine which part of Redis has reached its bottleneck.

- When IO threads are enabled with TLS and io-threads-do-reads, a
disconnection of a connection with pending data may result in it being
assigned to multiple IO threads simultaneously. This can cause race
conditions and trigger assertion failures. Related issue:
redis#12540

Therefore, we designed an asynchronous IO threads solution. The IO
threads adopt an event-driven model, with the main thread dedicated to
command processing, meanwhile, the IO threads handle client read and
write operations in parallel.

## Implementation
### Overall
As before, we did not change the fact that all client commands must be
executed on the main thread, because Redis was originally designed to be
single-threaded, and processing commands in a multi-threaded manner
would inevitably introduce numerous race and synchronization issues. But
now each IO thread has independent event loop, therefore, IO threads can
use a multiplexing approach to handle client read and write operations,
eliminating the CPU overhead caused by busy-waiting.

the execution process can be briefly described as follows:
the main thread assigns clients to IO threads after accepting
connections, IO threads will notify the main thread when clients
finish reading and parsing queries, then the main thread processes
queries from IO threads and generates replies, IO threads handle
writing reply to clients after receiving clients list from main thread,
and then continue to handle client read and write events.

### Each IO thread has independent event loop
We now assign each IO thread its own event loop. This approach
eliminates the need for the main thread to perform the costly
`epoll_wait` operation for handling connections (except for specific
ones). Instead, the main thread processes requests from the IO threads
and hands them back once completed, fully offloading read and write
events to the IO threads.

Additionally, all TLS operations, including handling pending data, have
been moved entirely to the IO threads. This resolves the issue where
io-threads-do-reads could not be used with TLS.

### Event-notified client queue
To facilitate communication between the IO threads and the main thread,
we designed an event-notified client queue. Each IO thread and the main
thread have two such queues to store clients waiting to be processed.
These queues are also integrated with the event loop to enable handling.
We use pthread_mutex to ensure the safety of queue operations, as well
as data visibility and ordering, and race conditions are minimized, as
each IO thread and the main thread operate on independent queues,
avoiding thread suspension due to lock contention. And we implemented an
event notifier based on `eventfd` or `pipe` to support event-driven
handling.

### Thread safety
Since the main thread and IO threads can execute in parallel, we must
handle data race issues carefully.

**client->flags**
The primary tasks of IO threads are reading and writing, i.e.
`readQueryFromClient` and `writeToClient`. However, IO threads and the
main thread may concurrently modify or access `client->flags`, leading
to potential race conditions. To address this, we introduced an io-flags
variable to record operations performed by IO threads, thereby avoiding
race conditions on `client->flags`.

**Pause IO thread**
In the main thread, we may want to operate data of IO threads, maybe
uninstall event handler, access or operate query/output buffer or resize
event loop, we need a clean and safe context to do that. We pause IO
thread in `IOThreadBeforeSleep`, do some jobs and then resume it. To
avoid thread suspended, we use busy waiting to confirm the target
status. Besides we use atomic variable to make sure memory visibility
and ordering. We introduce these functions to pause/resume IO Threads as
below.
```
pauseIOThread, resumeIOThread
pauseAllIOThreads, resumeAllIOThreads
pauseIOThreadsRange, resumeIOThreadsRange
```
Testing has shown that `pauseIOThread` is highly efficient, allowing the
main thread to execute nearly 200,000 operations per second during
stress tests. Similarly, `pauseAllIOThreads` with 8 IO threads can
handle up to nearly 56,000 operations per second. But operations
performed between pausing and resuming IO threads must be quick;
otherwise, they could cause the IO threads to reach full CPU
utilization.

**freeClient and freeClientAsync**
The main thread may need to terminate a client currently running on an
IO thread, for example, due to ACL rule changes, reaching the output
buffer limit, or evicting a client. In such cases, we need to pause the
IO thread to safely operate on the client.

**maxclients and maxmemory-clients updating**
When adjusting `maxclients`, we need to resize the event loop for all IO
threads. Similarly, when modifying `maxmemory-clients`, we need to
traverse all clients to calculate their memory usage. To ensure safe
operations, we pause all IO threads during these adjustments.

**Client info reading**
The main thread may need to read a client’s fields to generate a
descriptive string, such as for the `CLIENT LIST` command or logging
purposes. In such cases, we need to pause the IO thread handling that
client. If information for all clients needs to be displayed, all IO
threads must be paused.

**Tracking redirect**
Redis supports the tracking feature and can even send invalidation
messages to a connection with a specified ID. But the target client may
be running on IO thread, directly manipulating the client’s output
buffer is not thread-safe, and the IO thread may not be aware that the
client requires a response. In such cases, we pause the IO thread
handling the client, modify the output buffer, and install a write event
handler to ensure proper handling.

**clientsCron**
In the `clientsCron` function, the main thread needs to traverse all
clients to perform operations such as timeout checks, verifying whether
they have reached the soft output buffer limit, resizing the
output/query buffer, or updating memory usage. To safely operate on a
client, the IO thread handling that client must be paused.
If we were to pause the IO thread for each client individually, the
efficiency would be very low. Conversely, pausing all IO threads
simultaneously would be costly, especially when there are many IO
threads, as clientsCron is invoked relatively frequently.
To address this, we adopted a batched approach for pausing IO threads.
At most, 8 IO threads are paused at a time. The operations mentioned
above are only performed on clients running in the paused IO threads,
significantly reducing overhead while maintaining safety.

### Observability
In the current design, the main thread always assigns clients to the IO
thread with the least clients. To clearly observe the number of clients
handled by each IO thread, we added the new section in INFO output. The
`INFO THREADS` section can show the client count for each IO thread.
```
# Threads
io_thread_0:clients=0
io_thread_1:clients=2
io_thread_2:clients=2
```

Additionally, in the `CLIENT LIST` output, we also added a field to
indicate the thread to which each client is assigned.

`id=244 addr=127.0.0.1:41870 laddr=127.0.0.1:6379 ... resp=2 lib-name=
lib-ver= io-thread=1`

## Trade-off
### Special Clients
For certain special types of clients, keeping them running on IO threads
would result in severe race issues that are difficult to resolve.
Therefore, we chose not to offload these clients to the IO threads.

For replica, monitor, subscribe, and tracking clients, main thread may
directly write them a reply when conditions are met. Race issues are
difficult to resolve, so we have them processed in the main thread. This
includes the Lua debug clients as well, since we may operate connection
directly.

For blocking client, after the IO thread reads and parses a command and
hands it over to the main thread, if the client is identified as a
blocking type, it will be remained in the main thread. Once the blocking
operation completes and the reply is generated, the client is
transferred back to the IO thread to send the reply and wait for event
triggers.

### Clients Eviction
To support client eviction, it is necessary to update each client’s
memory usage promptly during operations such as read, write, or command
execution. However, when a client operates on an IO thread, it is not
feasible to update the memory usage immediately due to the risk of data
races. As a result, memory usage can only be updated either in the main
thread while processing commands or in the `ClientsCron` periodically.
The downside of this approach is that updates might experience a delay
of up to one second, which could impact the precision of memory
management for eviction.

To avoid incorrectly evicting clients. We adopted a best-effort
compensation solution, when we decide to eviction a client, we update
its memory usage again before evicting, if the memory used by the client
does not decrease or memory usage bucket is not changed, then we will
evict it, otherwise, not evict it.

However, we have not completely solved this problem. Due to the delay in
memory usage updates, it may lead us to make incorrect decisions about
the need to evict clients.

### Defragment
In the majority of cases we do NOT use the data from argv directly in
the db.
1. key names
We store a copy that we allocate in the main thread, see `sdsdup()` in
`dbAdd()`.
2. hash key and value
We store key as hfield and store value as sds, see `hfieldNew()` and
`sdsdup()` in `hashTypeSet()`.
3. other datatypes
   They don't even use SDS, so there is no reference issues.

But in some cases client the data from argv may be retain by the main
thread.
As a result, during fragmentation cleanup, we need to move allocations
from the IO thread’s arena to the main thread’s arena. We always
allocate new memory in the main thread’s arena, but the memory released
by IO threads may not yet have been reclaimed. This ultimately causes
the fragmentation rate to be higher compared to creating and allocating
entirely within a single thread.
The following cases below will lead to memory allocated by the IO thread
being kept by the main thread.
1. string related command: `append`, `getset`, `mset` and `set`.
If `tryObjectEncoding()` does not change argv, we will keep it directly
in the main thread, see the code in `tryObjectEncoding()`(specifically
`trimStringObjectIfNeeded()`)
2. block related command.
    the key names will be kept in `c->db->blocking_keys`.
3. watch command
    the key names will be kept in `c->db->watched_keys`.
4. [s]subscribe command
    channel name will be kept in `serverPubSubChannels`.
5. script load command
    script will be kept in `server.lua_scripts`.
7. some module API: `RM_RetainString`, `RM_HoldString`

Those issues will be handled in other PRs.

## Testing
### Functional Testing
The commit with enabling IO Threads has passed all TCL tests, but we did
some changes:
**Client query buffer**: In the original code, when using a reusable
query buffer, ownership of the query buffer would be released after the
command was processed. However, with IO threads enabled, the client
transitions from an IO thread to the main thread for processing. This
causes the ownership release to occur earlier than the command
execution. As a result, when IO threads are enabled, the client's
information will never indicate that a shared query buffer is in use.
Therefore, we skip the corresponding query buffer tests in this case.
**Defragment**: Add a new defragmentation test to verify the effect of
io threads on defragmentation.
**Command delay**: For deferred clients in TCL tests, due to clients
being assigned to different threads for execution, delays may occur. To
address this, we introduced conditional waiting: the process proceeds to
the next step only when the `client list` contains the corresponding
commands.

### Sanitizer Testing
The commit passed all TCL tests and reported no errors when compiled
with the `fsanitizer=thread` and `fsanitizer=address` options enabled.
But we made the following modifications: we suppressed the sanitizer
warnings for clients with watched keys when updating `client->flags`, we
think IO threads read `client->flags`, but never modify it or read the
`CLIENT_DIRTY_CAS` bit, main thread just only modifies this bit, so
there is no actual data race.

## Others
### IO thread number
In the new multi-threaded design, the main thread is primarily focused
on command processing to improve performance. Typically, the main thread
does not handle regular client I/O operations but is responsible for
clients such as replication and tracking clients. To avoid breaking
changes, we still consider the main thread as the first IO thread.

When the io-threads configuration is set to a low value (e.g., 2),
performance does not show a significant improvement compared to a
single-threaded setup for simple commands (such as SET or GET), as the
main thread does not consume much CPU for these simple operations. This
results in underutilized multi-core capacity. However, for more complex
commands, having a low number of IO threads may still be beneficial.
Therefore, it’s important to adjust the `io-threads` based on your own
performance tests.

Additionally, you can clearly monitor the CPU utilization of the main
thread and IO threads using `top -H -p $redis_pid`. This allows you to
easily identify where the bottleneck is. If the IO thread is the
bottleneck, increasing the `io-threads` will improve performance. If the
main thread is the bottleneck, the overall performance can only be
scaled by increasing the number of shards or replicas.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: oranagra <oran@redislabs.com>
2024-12-23 14:16:40 +08:00
Filipe Oliveira (Redis) 59953d2df6
Improve listpack Handling and Decoding Efficiency: 16.3% improvement on LRANGE command (#13652)
This PR focused on refining listpack encoding/decoding functions and
optimizing reply handling mechanisms related to it.
Each commit has the measured improvement up until the last accumulated
improvement of 16.3% on
[memtier_benchmark-1key-list-100-elements-lrange-all-elements-pipeline-10](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1key-list-100-elements-lrange-all-elements-pipeline-10.yml)
benchmark.

Connection mode | CE Baseline (Nov 14th)
701f06657d | CE PR #13652 | CE PR vs CE
Unstable
-- | -- | -- | --
TCP | 155696 | 178874 | 14.9%
Unix socket | 169743 | 197428 | 16.3%

To test it we can simply focus on the scan.tcl

```
tclsh tests/test_helper.tcl --single unit/replybufsize
```

### Commit details:
- 2e58d048fd +
29c6c86c6b : Eliminate an indirect memory
access on lpCurrentEncodedSizeBytes and completely avoid passing p*
fully to lpCurrentEncodedSizeBytes + Add lpNextWithBytes helper function
and optimize addListListpackRangeReply

**- Improvement of 3.1%, from 168969.88 ops/sec to 174239.75 ops/sec**
- af52aacff8 Refactor lpDecodeBacklen for
loop-based decoding, improving readability and branch efficiency.
**- NO CHANGE. REVERTED in 09f6680ba0d0b5acabca537c651008f0c8ec061b**
    
- 048bfe4eda +
03e8ff3af7 : reducing condition checks in
_addReplyToBuffer, inlining it, and avoid entering it when there are
there already entries in the reply list
and check if the reply length exceeds available buffer space before
calling _addReplyToBuffer
**- accumulated Improvement of 12.4%, from 168969.88 ops/sec to
189726.81 ops/sec**

- 9a63d4d6a9fa946505e31ecce4c7796845fc022c: always update the buf_peak
on _addReplyToBufferOrList
**- accumulated Improvement of 14.2%, from 168969.88 ops/sec to 193887
ops/sec**

- b544ade67628a1feaf714d6cfd114930e0c7670b: Introduce
lpEncodeBacklenBytes to avoid any indirect memory access on previous
usage of lpEncodeBacklen(NULL,...). inline lpEncodeBacklenBytes().
**- accumulated Improvement of 16.3%, from 168969.88 ops/sec to
197427.70 ops/sec**

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2024-12-04 18:04:37 +08:00
Filipe Oliveira (Redis) a106198878
Optimize addReplyBulk on sds/int encoded strings: 2.2% to 4% reduction of CPU Time on GET high pipeline use-cases (#13644)
### Summary

By profing 1KiB 100% GET's use-case, on high pipeline use-cases, we can
see that addReplyBulk and it's inner calls takes 8.30% of the CPU
cycles. This PR reduces from 2.2% to 4% the CPU time spent on
addReplyBulk. Specifically for GET use-cases, we saw an improvement from
2.7% to 9.1% on the achievable ops/sec

### Improvement

By reducing the duplicate work we can improve by around 2.7% on sds
encoded strings, and around 9% on int encoded strings. This PR does the
following:
- Avoid duplicate sdslen on addReplyBulk() for sds enconded objects
- Avoid duplicate sdigits10() call on int incoded objects on
addReplyBulk()
- avoid final "\r\n" addReplyProto() in the OBJ_ENCODING_INT type on
addReplyBulk

Altogether this improvements results in the following improvement on the
achievable ops/sec :

Encoding | unstable (commit 9906daf5c9) |
this PR | % improvement
-- | -- | -- | --
1KiB Values string SDS encoded | 1478081.88 | 1517635.38 | 2.7%
Values string "1" OBJ_ENCODING_INT | 1521139.36 | 1658876.59 | 9.1%

### CPU Time: Total of addReplyBulk

Encoding | unstable (commit 9906daf5c9) |
this PR | reduction of CPU Time: Total
-- | -- | -- | --
1KiB Values string SDS encoded | 8.30% | 6.10% | 2.2%
Values string "1" OBJ_ENCODING_INT | 7.20% | 3.20% | 4.0%

### To reproduce

Run redis with unix socket enabled
```
taskset -c 0 /root/redis/src/redis-server  --unixsocket /tmp/1.socket --save '' --enable-debug-command local
```

#### 1KiB Values string SDS encoded

Load data
```
taskset -c 2-5 memtier_benchmark  --ratio 1:0 -n allkeys --key-pattern P:P --key-maximum 1000000  --hide-histogram  --pipeline 10 -S /tmp/1.socket

```

Benchmark
```
taskset -c 2-6 memtier_benchmark --ratio 0:1 -c 1 -t 5 --test-time 60 --hide-histogram -d 1000 --pipeline 500  -S /tmp/1.socket --key-maximum 1000000 --json-out-file results.json
```

#### Values string "1" OBJ_ENCODING_INT 

Load data
```
$ taskset -c 2-5 memtier_benchmark  --command "SET __key__ 1" -n allkeys --command-key-pattern P --key-maximum 1000000  --hide-histogram -c 1 -t 1  --pipeline 100 -S /tmp/1.socket

# confirm we have the expected reply and format 
$ redis-cli get memtier-1
"1"

$ redis-cli debug object memtier-1
Value at:0x7f14cec57570 refcount:2147483647 encoding:int serializedlength:2 lru:2861503 lru_seconds_idle:8

```

Benchmark
```
taskset -c 2-6 memtier_benchmark --ratio 0:1 -c 1 -t 5 --test-time 60 --hide-histogram -d 1000 --pipeline 500  -S /tmp/1.socket --key-maximum 1000000 --json-out-file results.json
```
2024-11-26 16:11:01 +08:00
debing.sun 701f06657d
Reuse c->argv after command execution to reduce memory allocation overhead (#13521)
inspred by https://github.com/redis/redis/pull/12730

Before this PR, we allocate new memory to store the user command
arguments, however, if the size of the current `c->argv` is larger than
the current command, we can reuse the previously allocated argv to avoid
allocating new memory for the current command.
And we will free `c->argv` in client cron when the client is idle for 2
seconds.

---------

Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
2024-11-14 20:35:31 +08:00
Filipe Oliveira (Redis) 7b69183a8d
Replace usage of _addReplyLongLongWithPrefix with specific bulk/mbulk functions to reduce condition checks in hotpath. (#13520)
Instead of adding runtime logic to decide which prefix/shared object to
use when doing the reply we can simply use an inline method to avoid runtime
overhead of condition checks, and also keep the code change small.
Preliminary data show improvements on commands that heavily rely on
bulk/mbulk replies (example of LRANGE).

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2024-09-15 21:40:09 +08:00
Filipe Oliveira (Redis) 31227f4faf
Optimize client type check on reply hot code paths (#13516)
## Proposed improvement

This PR introduces the static inlined function `clientTypeIsSlave` which
is doing only 1 condition check vs 3 checks of `getClientType`, and also
uses the `unlikely` to tell the compiler that the most common outcome is
for the client not to be a slave.
Preliminary data show 3% improvement on the achievable ops/sec on the
specific LRANGE benchmark. After running the entire suite we see up to
5% improvement in 2 tests.
https://github.com/redis/redis/pull/13516#issuecomment-2331326052

## Context

This optimization efforts comes from analyzing the profile info from the
[memtier_benchmark-1key-list-1K-elements-lrange-all-elements](https://github.com/redis/redis-benchmarks-specification/blob/main/redis_benchmarks_specification/test-suites/memtier_benchmark-1key-list-1K-elements-lrange-all-elements.yml)
benchmark.
 
By going over it, we can see that `getClientType` consumes 2% of the cpu
time, strictly to check if the client is a slave (
https://github.com/redis/redis/blob/unstable/src/networking.c#L397 , and
https://github.com/redis/redis/blob/unstable/src/networking.c#L1254 )


Function | CPU Time: Total | CPU Time: Self | Module | Function (Full)
-- | -- | -- | -- | --
_addReplyToBufferOrList->getClientType | 1.20% | 0.728s | redis-server |
getClientType
clientHasPendingReplies->getClientType | 0.80% | 0.482s | redis-server |
getClientType

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2024-09-06 10:24:30 +08:00
debing.sun ea3e8b79a1
Introduce reusable query buffer for client reads (#13488)
This PR is based on the commits from PR
https://github.com/valkey-io/valkey/pull/258,
https://github.com/valkey-io/valkey/pull/593,
https://github.com/valkey-io/valkey/pull/639

This PR optimizes client query buffer handling in Redis by introducing
a reusable query buffer that is used by default for client reads. This
reduces memory usage by ~20KB per client by avoiding allocations for
most clients using short (<16KB) complete commands. For larger or
partial commands, the client still gets its own private buffer.

The primary changes are:

* Adding a reusable query buffer `thread_shared_qb` that clients use by
default.
* Modifying client querybuf initialization and reset logic.
* Freeing idle client query buffers when empty to allow reuse of the
reusable query buffer.
* Master client query buffers are kept private as their contents need to
be preserved for replication stream.
* When nested commands is executed, only the first user uses the reuse
buffer, and subsequent users will still use the private buffer.

In addition to the memory savings, this change shows a 3% improvement in
latency and throughput when running with 1000 active clients.

The memory reduction may also help reduce the need to evict clients when
reaching max memory limit, as the query buffer is the main memory
consumer per client.

This PR is different from https://github.com/valkey-io/valkey/pull/258
1. When a client is in the mid of requiring a reused buffer and
returning it, regardless of whether the query buffer has changed
(expanded), we do not update the reused query buffer in the middle, but
return the reused query buffer (expanded or with data remaining) or
reset it at the end.
2. Adding a new thread variable `thread_shared_qb_used` to avoid
multiple clients requiring the reusable query buffer at the same time.

---------

Signed-off-by: Uri Yagelnik <uriy@amazon.com>
Signed-off-by: Madelyn Olson <matolson@amazon.com>
Co-authored-by: Uri Yagelnik <uriy@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: oranagra <oran@redislabs.com>
2024-09-04 19:10:40 +08:00
Filipe Oliveira (Redis) 05aed4cab9
Optimize SET/INCR*/DECR*/SETRANGE/APPEND by reducing duplicate computation (#13505)
- Avoid addReplyLongLong (which converts back to string) the value we
already have as a robj, by using addReplyProto + addReply
- Avoid doing dbFind Twice for the same dictEntry on
INCR*/DECR*/SETRANGE/APPEND commands.
- Avoid multiple sdslen calls with the same input on setrangeCommand and
appendCommand
- Introduce setKeyWithDictEntry, which is like setKey(), but accepts an
optional dictEntry input: Avoids the second dictFind in SET command

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2024-09-04 14:51:21 +08:00
Filipe Oliveira (Redis) a31b516e25
Optimize the HELLO command reply (#13490)
# Overall improvement

TBD ( current is approximately 6% on the achievable ops/sec), coming
from:

- In case of no module we can skip 1.3% CPU cycles on dict Iterator
creation/deletion
- Use addReplyBulkCBuffer instead of addReplyBulkCString to avoid
runtime strlen overhead within HELLO reply on string constants.

## Optimization 1: In case of no module we can skip 1.3% CPU cycles on
dict Iterator creation/deletion.

## Optimization 2: Use addReplyBulkCBuffer instead of
addReplyBulkCString to avoid runtime strlen overhead within HELLO reply
on string constants.
2024-09-03 17:27:35 +08:00
Meir Shpilraien (Spielrein) 3264deb24e
Avoid used_memory contention when update from multiple threads. (#13431)
The PR attempt to avoid contention on the `used_memory` global variable
when allocate or free memory from multiple threads at the same time.

Each time a thread is allocating or releasing a memory, it needs to
update the `used_memory` global variable. This update might cause a
contention when done aggressively from multiple threads.

### The solution

Instead of having a single global variable that need to be updated from
multiple thread. We create an array of used_memory, each entry in the
array is updated by a single thread and the main thread summarizes all
the values to accumulate the memory usage.

This solution, though reduces the contention between threads on updating
the `used_memory` global variable, it adds work to the main thread that
need to summarize all the entries at the `used_memory` array. To avoid
increasing the work done by the main thread by too much, we limit the
size of the used memory array to 16. This means that up to 16 threads
can run without any contention between them. If there are more than 16
threads, we will reuse entries on the used_memory array, in this case we
might still have contention between threads, but it will be much less
significant.

Notice, that in order to really avoid contention, the entries in the
`used_memory` array must reside on different cache lines. To achieve
that we create a struct with padding such that its size will be exactly
cache_line size. In addition we make sure the address of the
`used_memory` array will be aligned to cache_line size.

### Benchmark

Some benchmark shows improvement (up to 15%):

| Test Case |Baseline unstable (median obs. +- std.dev)|Comparison
test_used_memory_per_thread_array (median obs. +- std.dev)|% change
(higher-better)| Note |

|-------------------------------------------------------------------------------|------------------------------------------|--------------------------------------------------------------------:|------------------------|------------------------------------|
|memtier_benchmark-1key-list-100-elements-lrange-all-elements | 92657 +-
2.0% (2 datapoints) | 101445|9.5% |IMPROVEMENT |
|memtier_benchmark-1key-list-1K-elements-lrange-all-elements | 14965 +-
1.3% (2 datapoints) | 16296|8.9% |IMPROVEMENT |
|memtier_benchmark-1key-set-10-elements-smembers-pipeline-10 | 431019 +-
5.2% (2 datapoints) | 461039|7.0% |waterline=5.2%. IMPROVEMENT |
|memtier_benchmark-1key-set-100-elements-smembers | 74367 +- 0.0% (2
datapoints) | 80190|7.8% |IMPROVEMENT |
|memtier_benchmark-1key-set-1K-elements-smembers | 11730 +- 0.4% (2
datapoints) | 13519|15.3% |IMPROVEMENT |


Full results:

| Test Case |Baseline unstable (median obs. +- std.dev)|Comparison
test_used_memory_per_thread_array (median obs. +- std.dev)|% change
(higher-better)| Note |

|-------------------------------------------------------------------------------|------------------------------------------|--------------------------------------------------------------------:|------------------------|------------------------------------|
|memtier_benchmark-10Mkeys-load-hash-5-fields-with-1000B-values | 88613
+- 1.0% (2 datapoints) | 88688|0.1% |No Change |

|memtier_benchmark-10Mkeys-load-hash-5-fields-with-1000B-values-pipeline-10
| 124786 +- 1.2% (2 datapoints) | 123671|-0.9% |No Change |
|memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values | 122460
+- 1.4% (2 datapoints) | 122990|0.4% |No Change |

|memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values-pipeline-10
| 333384 +- 5.1% (2 datapoints) | 319221|-4.2% |waterline=5.1%.
potential REGRESSION|
|memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values | 137354
+- 0.3% (2 datapoints) | 138759|1.0% |No Change |

|memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values-pipeline-10
| 401261 +- 4.3% (2 datapoints) | 398524|-0.7% |No Change |
|memtier_benchmark-1Mkeys-100B-expire-use-case | 179058 +- 0.4% (2
datapoints) | 180114|0.6% |No Change |
|memtier_benchmark-1Mkeys-10B-expire-use-case | 180390 +- 0.2% (2
datapoints) | 180401|0.0% |No Change |
|memtier_benchmark-1Mkeys-1KiB-expire-use-case | 175993 +- 0.7% (2
datapoints) | 175147|-0.5% |No Change |
|memtier_benchmark-1Mkeys-4KiB-expire-use-case | 165771 +- 0.0% (2
datapoints) | 164434|-0.8% |No Change |
|memtier_benchmark-1Mkeys-bitmap-getbit-pipeline-10 | 931339 +- 2.1% (2
datapoints) | 929487|-0.2% |No Change |
|memtier_benchmark-1Mkeys-generic-exists-pipeline-10 | 999462 +- 0.4% (2
datapoints) | 963226|-3.6% |potential REGRESSION |
|memtier_benchmark-1Mkeys-generic-expire-pipeline-10 | 905333 +- 1.4% (2
datapoints) | 896673|-1.0% |No Change |
|memtier_benchmark-1Mkeys-generic-expireat-pipeline-10 | 885015 +- 1.0%
(2 datapoints) | 865010|-2.3% |No Change |
|memtier_benchmark-1Mkeys-generic-pexpire-pipeline-10 | 897115 +- 1.2%
(2 datapoints) | 887544|-1.1% |No Change |
|memtier_benchmark-1Mkeys-generic-scan-pipeline-10 | 451103 +- 3.2% (2
datapoints) | 465571|3.2% |potential IMPROVEMENT |
|memtier_benchmark-1Mkeys-generic-touch-pipeline-10 | 996809 +- 0.6% (2
datapoints) | 984478|-1.2% |No Change |
|memtier_benchmark-1Mkeys-generic-ttl-pipeline-10 | 979570 +- 1.7% (2
datapoints) | 958752|-2.1% |No Change |

|memtier_benchmark-1Mkeys-hash-hget-hgetall-hkeys-hvals-with-100B-values
| 180888 +- 0.5% (2 datapoints) | 182295|0.8% |No Change |

|memtier_benchmark-1Mkeys-hash-hmget-5-fields-with-100B-values-pipeline-10
| 717881 +- 1.0% (2 datapoints) | 724814|1.0% |No Change |
|memtier_benchmark-1Mkeys-hash-transactions-multi-exec-pipeline-20 |
1055447 +- 0.4% (2 datapoints) | 1065836|1.0% |No Change |
|memtier_benchmark-1Mkeys-lhash-hexists | 164332 +- 0.1% (2 datapoints)
| 163636|-0.4% |No Change |
|memtier_benchmark-1Mkeys-lhash-hincbry | 171674 +- 0.3% (2 datapoints)
| 172737|0.6% |No Change |
|memtier_benchmark-1Mkeys-list-lpop-rpop-with-100B-values | 180904 +-
1.1% (2 datapoints) | 179467|-0.8% |No Change |
|memtier_benchmark-1Mkeys-list-lpop-rpop-with-10B-values | 181746 +-
0.8% (2 datapoints) | 182416|0.4% |No Change |
|memtier_benchmark-1Mkeys-list-lpop-rpop-with-1KiB-values | 182004 +-
0.7% (2 datapoints) | 180237|-1.0% |No Change |
|memtier_benchmark-1Mkeys-load-hash-5-fields-with-1000B-values | 105191
+- 0.9% (2 datapoints) | 105058|-0.1% |No Change |

|memtier_benchmark-1Mkeys-load-hash-5-fields-with-1000B-values-pipeline-10
| 150683 +- 0.9% (2 datapoints) | 153597|1.9% |No Change |
|memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values |
104122 +- 0.7% (2 datapoints) | 105236|1.1% |No Change |
|memtier_benchmark-1Mkeys-load-list-with-100B-values | 149770 +- 0.9% (2
datapoints) | 150510|0.5% |No Change |
|memtier_benchmark-1Mkeys-load-list-with-10B-values | 165537 +- 1.9% (2
datapoints) | 164329|-0.7% |No Change |
|memtier_benchmark-1Mkeys-load-list-with-1KiB-values | 113315 +- 0.5% (2
datapoints) | 114110|0.7% |No Change |
|memtier_benchmark-1Mkeys-load-stream-1-fields-with-100B-values | 131201
+- 0.7% (2 datapoints) | 129545|-1.3% |No Change |

|memtier_benchmark-1Mkeys-load-stream-1-fields-with-100B-values-pipeline-10
| 352891 +- 2.8% (2 datapoints) | 348338|-1.3% |No Change |
|memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values | 104386
+- 0.7% (2 datapoints) | 105796|1.4% |No Change |

|memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10
| 227593 +- 5.5% (2 datapoints) | 218783|-3.9% |waterline=5.5%.
potential REGRESSION|
|memtier_benchmark-1Mkeys-load-string-with-100B-values | 167552 +- 0.2%
(2 datapoints) | 170282|1.6% |No Change |
|memtier_benchmark-1Mkeys-load-string-with-100B-values-pipeline-10 |
646888 +- 0.5% (2 datapoints) | 639680|-1.1% |No Change |
|memtier_benchmark-1Mkeys-load-string-with-10B-values | 174891 +- 0.7%
(2 datapoints) | 174382|-0.3% |No Change |
|memtier_benchmark-1Mkeys-load-string-with-10B-values-pipeline-10 |
749988 +- 5.1% (2 datapoints) | 769986|2.7% |waterline=5.1%. No Change |
|memtier_benchmark-1Mkeys-load-string-with-1KiB-values | 155929 +- 0.1%
(2 datapoints) | 156387|0.3% |No Change |
|memtier_benchmark-1Mkeys-load-zset-with-10-elements-double-score |
92241 +- 0.2% (2 datapoints) | 92189|-0.1% |No Change |
|memtier_benchmark-1Mkeys-load-zset-with-10-elements-int-score | 114328
+- 1.3% (2 datapoints) | 113154|-1.0% |No Change |
|memtier_benchmark-1Mkeys-string-get-100B | 180685 +- 0.2% (2
datapoints) | 180359|-0.2% |No Change |
|memtier_benchmark-1Mkeys-string-get-100B-pipeline-10 | 991291 +- 3.1%
(2 datapoints) | 1020086|2.9% |No Change |
|memtier_benchmark-1Mkeys-string-get-10B | 181183 +- 0.3% (2 datapoints)
| 177868|-1.8% |No Change |
|memtier_benchmark-1Mkeys-string-get-10B-pipeline-10 | 1032554 +- 0.8%
(2 datapoints) | 1023120|-0.9% |No Change |
|memtier_benchmark-1Mkeys-string-get-1KiB | 180479 +- 0.9% (2
datapoints) | 182215|1.0% |No Change |
|memtier_benchmark-1Mkeys-string-get-1KiB-pipeline-10 | 979286 +- 0.9%
(2 datapoints) | 989888|1.1% |No Change |
|memtier_benchmark-1Mkeys-string-mget-1KiB | 121950 +- 0.4% (2
datapoints) | 120996|-0.8% |No Change |
|memtier_benchmark-1key-geo-60M-elements-geodist | 179404 +- 1.0% (2
datapoints) | 181232|1.0% |No Change |
|memtier_benchmark-1key-geo-60M-elements-geodist-pipeline-10 | 1023797
+- 0.5% (2 datapoints) | 1014980|-0.9% |No Change |
|memtier_benchmark-1key-geo-60M-elements-geohash | 180808 +- 1.2% (2
datapoints) | 180606|-0.1% |No Change |
|memtier_benchmark-1key-geo-60M-elements-geohash-pipeline-10 | 1056458
+- 1.6% (2 datapoints) | 1040050|-1.6% |No Change |
|memtier_benchmark-1key-geo-60M-elements-geopos | 181808 +- 0.2% (2
datapoints) | 175945|-3.2% |potential REGRESSION |
|memtier_benchmark-1key-geo-60M-elements-geopos-pipeline-10 | 1038180 +-
3.4% (2 datapoints) | 1033005|-0.5% |No Change |
|memtier_benchmark-1key-geo-60M-elements-geosearch-fromlonlat | 142614
+- 0.3% (2 datapoints) | 144259|1.2% |No Change |
|memtier_benchmark-1key-geo-60M-elements-geosearch-fromlonlat-bybox |
141008 +- 0.4% (2 datapoints) | 139602|-1.0% |No Change |

|memtier_benchmark-1key-geo-60M-elements-geosearch-fromlonlat-pipeline-10
| 560698 +- 0.8% (2 datapoints) | 548806|-2.1% |No Change |
|memtier_benchmark-1key-list-10-elements-lrange-all-elements | 166132 +-
0.9% (2 datapoints) | 170259|2.5% |No Change |
|memtier_benchmark-1key-list-100-elements-lrange-all-elements | 92657 +-
2.0% (2 datapoints) | 101445|9.5% |IMPROVEMENT |
|memtier_benchmark-1key-list-1K-elements-lrange-all-elements | 14965 +-
1.3% (2 datapoints) | 16296|8.9% |IMPROVEMENT |
|memtier_benchmark-1key-pfadd-4KB-values-pipeline-10 | 264156 +- 0.2% (2
datapoints) | 262582|-0.6% |No Change |
|memtier_benchmark-1key-set-10-elements-smembers | 138916 +- 1.7% (2
datapoints) | 138016|-0.6% |No Change |
|memtier_benchmark-1key-set-10-elements-smembers-pipeline-10 | 431019 +-
5.2% (2 datapoints) | 461039|7.0% |waterline=5.2%. IMPROVEMENT |
|memtier_benchmark-1key-set-10-elements-smismember | 173545 +- 1.1% (2
datapoints) | 173488|-0.0% |No Change |
|memtier_benchmark-1key-set-100-elements-smembers | 74367 +- 0.0% (2
datapoints) | 80190|7.8% |IMPROVEMENT |
|memtier_benchmark-1key-set-100-elements-smismember | 155682 +- 1.6% (2
datapoints) | 151367|-2.8% |No Change |
|memtier_benchmark-1key-set-1K-elements-smembers | 11730 +- 0.4% (2
datapoints) | 13519|15.3% |IMPROVEMENT |
|memtier_benchmark-1key-set-200K-elements-sadd-constant | 181070 +- 1.1%
(2 datapoints) | 180214|-0.5% |No Change |
|memtier_benchmark-1key-set-2M-elements-sadd-increasing | 166364 +- 0.1%
(2 datapoints) | 166944|0.3% |No Change |
|memtier_benchmark-1key-zincrby-1M-elements-pipeline-1 | 46071 +- 0.6%
(2 datapoints) | 44979|-2.4% |No Change |
|memtier_benchmark-1key-zrank-1M-elements-pipeline-1 | 48429 +- 0.4% (2
datapoints) | 49265|1.7% |No Change |
|memtier_benchmark-1key-zrem-5M-elements-pipeline-1 | 48528 +- 0.4% (2
datapoints) | 48869|0.7% |No Change |
|memtier_benchmark-1key-zrevrangebyscore-256K-elements-pipeline-1 |
100580 +- 1.5% (2 datapoints) | 101782|1.2% |No Change |
|memtier_benchmark-1key-zrevrank-1M-elements-pipeline-1 | 48621 +- 2.0%
(2 datapoints) | 48473|-0.3% |No Change |
|memtier_benchmark-1key-zset-10-elements-zrange-all-elements | 83485 +-
0.6% (2 datapoints) | 83095|-0.5% |No Change |

|memtier_benchmark-1key-zset-10-elements-zrange-all-elements-long-scores
| 118673 +- 0.8% (2 datapoints) | 118006|-0.6% |No Change |
|memtier_benchmark-1key-zset-100-elements-zrange-all-elements | 19009 +-
1.1% (2 datapoints) | 19293|1.5% |No Change |
|memtier_benchmark-1key-zset-100-elements-zrangebyscore-all-elements |
18957 +- 0.5% (2 datapoints) | 19419|2.4% |No Change |

|memtier_benchmark-1key-zset-100-elements-zrangebyscore-all-elements-long-scores|
171693 +- 0.5% (2 datapoints) | 172432|0.4% |No Change |
|memtier_benchmark-1key-zset-1K-elements-zrange-all-elements | 3566 +-
0.6% (2 datapoints) | 3672|3.0% |No Change |
|memtier_benchmark-1key-zset-1M-elements-zcard-pipeline-10 | 1067713 +-
0.4% (2 datapoints) | 1071550|0.4% |No Change |
|memtier_benchmark-1key-zset-1M-elements-zrevrange-5-elements | 169195
+- 0.7% (2 datapoints) | 169620|0.3% |No Change |
|memtier_benchmark-1key-zset-1M-elements-zscore-pipeline-10 | 914338 +-
0.2% (2 datapoints) | 905540|-1.0% |No Change |
|memtier_benchmark-2keys-lua-eval-hset-expire | 88346 +- 1.7% (2
datapoints) | 87259|-1.2% |No Change |
|memtier_benchmark-2keys-lua-evalsha-hset-expire | 103273 +- 1.2% (2
datapoints) | 102393|-0.9% |No Change |
|memtier_benchmark-2keys-set-10-100-elements-sdiff | 15418 +- 10.9%
UNSTABLE (2 datapoints) | 14369|-6.8% |UNSTABLE (very high variance) |
|memtier_benchmark-2keys-set-10-100-elements-sinter | 83601 +- 3.6% (2
datapoints) | 82508|-1.3% |No Change |
|memtier_benchmark-2keys-set-10-100-elements-sunion | 14942 +- 11.2%
UNSTABLE (2 datapoints) | 14001|-6.3% |UNSTABLE (very high variance) |
|memtier_benchmark-2keys-stream-5-entries-xread-all-entries | 75938 +-
0.4% (2 datapoints) | 76565|0.8% |No Change |
|memtier_benchmark-2keys-stream-5-entries-xread-all-entries-pipeline-10
| 120781 +- 1.1% (2 datapoints) | 119142|-1.4% |No Change |
2024-08-19 13:22:16 +03:00
debing.sun 7d3545cb16
Reduce redundant call of prepareClientToWrite when call addReply* continuously (#13412)
This PR is based on the commits from PR
https://github.com/valkey-io/valkey/pull/670.

## Description

While exploring hotspots with profiling some benchmark workloads, we
noticed the high cycles ratio of `prepareClientToWrite`, taking about 9%
of the CPU of `smembers`, `lrange` commands. After deep dive the code
logic, we thought we can gain the performance by reducing the redundant
call of `prepareClientToWrite` when call addReply* continuously.

For example: In

https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L1080-L1082,
`prepareClientToWrite` is called three times in a row.

---------

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Co-authored-by: Lipeng Zhu <lipeng.zhu@intel.com>
Co-authored-by: Wangyang Guo <wangyang.guo@intel.com>
2024-07-15 23:19:19 +08:00
debing.sun 52e12d8bac
Don't keep global replication buffer reference for replicas marked CLIENT_CLOSE_ASAP (#13363)
In certain situations, we might generate a large number of propagates
(e.g., multi/exec, Lua script, or a single command generating tons of
propagations) within an event loop.
During the process of propagating to a replica, if the replica is
disconnected(marked as CLIENT_CLOSE_ASAP) due to exceeding the output
buffer limit, we should remove its reference to the global replication
buffer to avoid the global replication buffer being unable to be
properly trimmed due to being referenced.

---------

Co-authored-by: oranagra <oran@redislabs.com>
2024-06-26 08:26:23 +08:00
Valentino Geron 50569a906c
Free current client asynchronously after user permissions changes (#13274)
The crash happens when the user that triggers the permission changes
should be affected (and should be disconnected eventually).

To handle such a scenario, we should use the
`CLIENT_CLOSE_AFTER_COMMAND` flag.

This commit encapsulates all the places that should be handled in the
same way in `deauthenticateAndCloseClient`

Also:
* bugfix: during the ACL LOAD we ignore clients that are marked as
`CLIENT MASTER`
2024-05-30 22:09:30 +08:00
Moti Cohen 33fc0fbfae
HFE to support AOF and replicas (#13285)
* For replica sake, rewrite commands `H*EXPIRE*` , `HSETF`, `HGETF` to
have absolute unix time in msec.
* On active-expiration of field, propagate HDEL to replica
(`propagateHashFieldDeletion()`)
* On lazy-expiration, propagate HDEL to replica (`hashTypeGetValue()`
now calls `hashTypeDelete()`. It also takes care to call
`propagateHashFieldDeletion()`).
* Fix `H*EXPIRE*` command such that if it gets flag `LT` and it doesn’t
have any expiration on the field then it will considered as valid
condition.

Note, replicas doesn’t make any active expiration, and should avoid lazy
expiration. On `hashTypeGetValue()` it doesn't check expiration (As long
as the master didn’t request to delete the field, it is valid)

TODO: 
* Attach `dbid` to HASH metadata. See
[here](https://github.com/redis/redis/pull/13209#discussion_r1593385850)

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2024-05-29 19:47:48 +08:00
Moti Cohen c18ff05665
Hash Field Expiration - Basic support
- Add ebuckets & mstr data structures
- Integrate active & lazy expiration
- Add most of the commands 
- Add support for dict (listpack is missing)
TODOs:  RDB, notification, listpack, HSET, HGETF, defrag, aof
2024-04-18 16:06:30 +03:00
Pieter Cailliau 0b34396924
Change license from BSD-3 to dual RSALv2+SSPLv1 (#13157)
[Read more about the license change
here](https://redis.com/blog/redis-adopts-dual-source-available-licensing/)
Live long and prosper 🖖
2024-03-20 22:38:24 +00:00
Binbin 3c2ea1ea95
Fix wathced client test timing issue caused by late close (#13062)
There is a timing issue in the test, close may arrive late, or in
freeClientAsync we will free the client in async way, which will
lead to errors in watching_clients statistics, since we will only
unwatch all keys when we truly freeClient.

Add a wait here to avoid this problem. Also fixed some outdated
comments i saw. The test was introduced in #12966.
2024-02-20 11:12:19 +02:00
zhaozhao.zz 50d6fe8c4b
Add metrics for WATCH (#12966)
Redis has some special commands that mark the client's state, such as
`subscribe` and `blpop`, which mark the client as `CLIENT_PUBSUB` or
`CLIENT_BLOCKED`, and we have metrics for the special use cases.

However, there are also other special commands, like `WATCH`, which
although do not have a specific flags, and should also be considered
stateful client types. For stateful clients, in many scenarios, the
connections cannot be shared in "connection pool", meaning connection
pool cannot be used. For example, whenever the `WATCH` command is
executed, a new connection is required to put the client into the "watch
state" because the watched keys are stored in the client.

If different business logic requires watching different keys, separate
connections must be used; otherwise, there will be contamination. This
also means that if a user's business heavily relies on the `WATCH`
command, a large number of connections will be required.

Recently we have encountered this situation in our platform, where some
users consume a significant number of connections when using Redis
because of `WATCH`.

I hope we can have a way to observe these special use cases and special
client connections. Here I add a few monitoring metrics:

1. `watching_clients` in `INFO` reply: The number of clients currently
in the "watching" state.
2. `total_watched_keys` in `INFO` reply: The total number of keys being
watched.
3. `watch` in `CLIENT LIST` reply: The number of keys each client is
currently watching.
2024-02-18 10:36:41 +02:00
Slava Koyfman 24f6d08b3f
Implement `CLIENT KILL MAXAGE <maxage>` (#12299)
Adds an ability to kill clients older than a specified age.

Also, fixed the age calculation in `catClientInfoString` to use
`commandTimeSnapshot`
instead of the old `server.unixtime`, and added missing documentation
for
`CLIENT KILL ID` to output of `CLIENT help`.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
2024-01-30 20:24:36 +02:00
Binbin 4cb5ad85a5
Fix unauthenticated client query buffer 1MB limit (#12989)
Code incorrectly set the limit value to 1024MB.
Introduced in #12961.
2024-01-25 14:56:21 +02:00
zhaozhao.zz 85a834bfa2
Revert multi OOM limit and add multi buffer limit (#12961)
Fix #9926 , and introduce an alternative method to prevent abuse of
transactions:

1. revert #5454 (which was blocking read-only transactions in OOM
state), and break the tie of MULTI state memory usage and the server OOM
state. Meaning that we'll limit the total memory a single client can
queue, and do that unconditionally regardless of the server being OOM or
not.
2. to prevent abuse of transactions, we use the
`client-query-buffer-limit` to restrict the size of the transaction.
Because the commands cached in the MULTI/EXEC queue have not been
executed yet, so they are also considered a part of the "query buffer"
in a broader sense. In other words, the commands in the MULTI queue and
the `querybuf` of the client together constitute the "query buffer".
When they exceed the limit, the connection will be disconnected.

The reasoning is that it's sensible to sends a single command with a
huge (1GB) argument, and it's sensible to sends a transaction with many
small commands, but it's probably not common to sends a long transaction
with many huge arguments (will consume a lot of memory before even being
executed).

If anyone runs into that, they can simply increase the
`client-query-buffer-limit` config.

P.S. To prevent DDoS attacks, unauthenticated clients have a separate
hard limit. Their query buffer should not exceed a maximum of 1MB. In
other words, if the query buffer of an unauthenticated client exceeds
1MB or the `client-query-buffer-limit` (if it is set to a value smaller
than 1MB,), the connection will be disconnected.
2024-01-25 11:17:39 +02:00
debing.sun d0640029dc
Fix race condition issues between the main thread and module threads (#12817)
Fix #12785 and other race condition issues.
See the following isolated comments.

The following report was obtained using SANITIZER thread.
```sh
make SANITIZER=thread
./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate
```

1. Fixed thread-safe issue in RM_UnblockClient()
Related discussion:
https://github.com/redis/redis/pull/12817#issuecomment-1831181220
* When blocking a client in a module using `RM_BlockClientOnKeys()` or
`RM_BlockClientOnKeysWithFlags()`
with a timeout_callback, calling RM_UnblockClient() in module threads
can lead to race conditions
     in `updateStatsOnUnblock()`.

     - Introduced: 
        Version: 6.2
        PR: #7491

     - Touch:
`server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`,
and `server.latency_events`
     
     - Harm Level: High
Potentially corrupts the memory data of `cmd->latency_histogram`,
`server.slowlog`, and `server.latency_events`

     - Solution:
Differentiate whether the call to moduleBlockedClientTimedOut() comes
from the module or the main thread.
Since we can't know if RM_UnblockClient() comes from module threads, we
always assume it does and
let `updateStatsOnUnblock()` asynchronously update the unblock status.
     
* When error reply is called in timeout_callback(), ctx is not
thread-safe, eventually lead to race conditions in `afterErrorReply`.

     - Introduced: 
        Version: 6.2
        PR: #8217

     - Touch
       `server.stat_total_error_replies`, `server.errors`, 

     - Harm Level: High
       Potentially corrupts the memory data of `server.errors`
   
      - Solution: 
Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`,
and asynchronously reply errors to the client.

2. Made RM_Reply*() family API thread-safe
Related discussion:
https://github.com/redis/redis/pull/12817#discussion_r1408707239
Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch
server.current_client

    - Introduced: 
       Version: 7.2.0
       PR: #12326

   - Harm Level: None
Since the module fake client won't have the `CLIENT_PUSHING` flag, even
if we touch server.current_client,
     we can still exit after `c->flags & CLIENT_PUSHING`.

   - Solution
      Checking `c->flags & CLIENT_PUSHING` earlier.

3. Made freeClient() thread-safe
    Fix #12785

    - Introduced: 
       Version: 4.0
Commit:
3fcf959e60

    - Harm Level: Moderate
       * Trigger assertion
It happens when the module thread calls freeClient while the io-thread
is in progress,
which just triggers an assertion, and doesn't make any race condiaions.

* Touch `server.current_client`, `server.stat_clients_type_memory`, and
`clientMemUsageBucket->clients`.
It happens between the main thread and the module threads, may cause
data corruption.
1. Error reset `server.current_client` to NULL, but theoretically this
won't happen,
because the module has already reset `server.current_client` to old
value before entering freeClient.
2. corrupts `clientMemUsageBucket->clients` in
updateClientMemUsageAndBucket().
3. Causes server.stat_clients_type_memory memory statistics to be
inaccurate.
    
    - Solution:
* No longer counts memory usage on fake clients, to avoid updating
`server.stat_clients_type_memory` in freeClient.
* No longer resetting `server.current_client` in unlinkClient, because
the fake client won't be evicted or disconnected in the mid of the
process.
* Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is
not a fake client.

4. Fixed free client args without GIL
Related discussion:
https://github.com/redis/redis/pull/12817#discussion_r1408706695
When freeing retained strings in the module thread (refcount decr), or
using them in some way (refcount incr), we should do so while holding
the GIL,
otherwise, they might be simultaneously freed while the main thread is
processing the unblock client state.

    - Introduced: 
       Version: 6.2.0
       PR: #8141

   - Harm Level: Low
     Trigger assertion or double free or memory leak. 

   - Solution:
Documenting that module API users need to ensure any access to these
retained strings is done with the GIL locked

5. Fix adding fake client to server.clients_pending_write
    It will incorrectly log the memory usage for the fake client.
Related discussion:
https://github.com/redis/redis/pull/12817#issuecomment-1851899163

    - Introduced: 
       Version: 4.0
Commit:
9b01b64430

    - Harm Level: None
      Only result in NOP

    - Solution:
       * Don't add fake client into server.clients_pending_write
* Add c->conn assertion for updateClientMemUsageAndBucket() and
updateClientMemoryUsage() to avoid same
         issue in the future.
So now it will be the responsibility of the caller of both of them to
avoid passing in fake client.

6. Fix calling RM_BlockedClientMeasureTimeStart() and
RM_BlockedClientMeasureTimeEnd() without GIL
    - Introduced: 
       Version: 6.2
       PR: #7491

   - Harm Level: Low
Causes inaccuracies in command latency histogram and slow logs, but does
not corrupt memory.

   - Solution:
Module API users, if know that non-thread-safe APIs will be used in
multi-threading, need to take responsibility for protecting them with
their own locks instead of the GIL, as using the GIL is too expensive.

### Other issue
1. RM_Yield is not thread-safe, fixed via #12905.

### Summarize
1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and
`RM_Yield`, potentially preventing memory corruption, data disorder, or
assertion.
2. Updated docs and module test to clarify module API users'
responsibility for locking non-thread-safe APIs in multi-threading, such
as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(),
RM_RetainString(), and RM_HoldString().

### About backpot to 7.2
1. The implement of (1) is not too satisfying, would like to get more
eyes.
2. (2), (3) can be safely for backport
3. (4), (6) just modifying the module tests and updating the
documentation, no need for a backpot.
4. (5) is harmless, no need for a backpot.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
2024-01-19 15:12:49 +02:00
Guillaume Koenig 967fb3c6e8
Extend rax usage by allowing any long long value (#12837)
The raxFind implementation uses a special pointer value (the address of
a static string) as the "not found" value. It works as long as actual
pointers were used. However we've seen usages where long long,
non-pointer values have been used. It creates a risk that one of the
long long value precisely is the address of the special "not found"
value. This commit changes raxFind to return 1 or 0 to indicate
elementhood, and take in a new void **value to optionally return the
associated value.

By extension, this also allow the RedisModule_DictSet/Replace operations
to also safely insert integers instead of just pointers.
2023-12-14 14:50:18 -08:00
Chen Tianjie f9cc25c1dd
Add metric to INFO CLIENTS: pubsub_clients. (#12849)
In INFO CLIENTS section, we already have blocked_clients and
tracking_clients. We should add a new metric showing the number of
pubsub connections, which helps performance monitoring and trouble
shooting.
2023-12-13 13:44:13 +08:00
Josh Hershberg d9a0478599 Cluster refactor: Make clusterNode private
Move clusterNode into cluster_legacy.h.
In order to achieve this some accessor methods
were added and also a refactor of how debugCommand
handles cluster related subcommands.

Signed-off-by: Josh Hershberg <yehoshua@redis.com>
2023-11-22 05:50:46 +02:00
Chen Tianjie e9f312e087
Change stat_client_qbuf_limit_disconnections to atomic. (#12711)
In #12476 server.stat_client_qbuf_limit_disconnections was added. It is written in readQueryFromClient, which may be called by multiple threads when io-threads and io-threads-do-reads are turned on. Somehow we missed to make it an atomic variable.
2023-11-01 10:57:24 +08:00
Viktor Söderqvist f924bebd83
Rewrite huge printf calls to smaller ones for readability (#12257)
In a long printf call with many placeholders, it's hard to see which argument
belongs to which placeholder.

The long printf-like calls in the INFO and CLIENT commands are rewritten into
pairs of (format, argument). These pairs are then rewritten to a single call with
a long format string and a long list of arguments, using a macro called FMTARGS.

The file `fmtargs.h` is added to the repo.

Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
2023-09-28 09:21:23 +03:00
Chen Tianjie e3d4b30d09
Add two stats to count client input and output buffer oom. (#12476)
Add these INFO metrics:
* client_query_buffer_limit_disconnections
* client_output_buffer_limit_disconnections

Sometimes it is useful to monitor whether clients reaches size limit of
query buffer and output buffer, to decide whether we need to adjust the
buffer size limit or reduce client query payload.
2023-08-30 21:51:14 +03:00
zhaozhao.zz 01eb939a06
update monitor client's memory and evict correctly (#12420)
A bug introduced in #11657 (7.2 RC1), causes client-eviction (#8687)
and INFO to have inaccurate memory usage metrics of MONITOR clients.

Because the type in `c->type` and the type in `getClientType()` are confusing
(in the later, `CLIENT_TYPE_NORMAL` not `CLIENT_TYPE_SLAVE`), the comment
we wrote in `updateClientMemUsageAndBucket` was wrong, and in fact that function
didn't skip monitor clients.
And since it doesn't skip monitor clients, it was wrong to delete the call for it from
`replicationFeedMonitors` (it wasn't a NOP).
That deletion could mean that the monitor client memory usage is not always up to
date (updated less frequently, but still a candidate for client eviction).
2023-07-25 16:10:38 +03:00
Oran Agra 8ad8f0f9d8
Fix broken protocol when PUBLISH emits local push inside MULTI (#12326)
When a connection that's subscribe to a channel emits PUBLISH inside MULTI-EXEC,
the push notification messes up the EXEC response.

e.g. MULTI, PING, PUSH foo bar, PING, EXEC
the EXEC's response will contain: PONG, {message foo bar}, 1. and the second PONG
will be delivered outside the EXEC's response.

Additionally, this PR changes the order of responses in case of a plain PUBLISH (when
the current client also subscribed to it), by delivering the push after the command's
response instead of before it.
This also affects modules calling RM_PublishMessage in a similar way, so that we don't
run the risk of getting that push mixed together with the module command's response.
2023-06-20 20:41:41 +03:00
Binbin b510624978
Optimize PSUBSCRIBE and PUNSUBSCRIBE from O(N*M) to O(N) (#12298)
In the original implementation, the time complexity of the commands
is actually O(N*M), where N is the number of patterns the client is
already subscribed and M is the number of patterns to subscribe to.
The docs are all wrong about this.

Specifically, because the original client->pubsub_patterns is a list,
so we need to do listSearchKey which is O(N). In this PR, we change it
to a dict, so the search becomes O(1).

At the same time, both pubsub_channels and pubsubshard_channels are dicts.
Changing pubsub_patterns to a dictionary improves the readability and
maintainability of the code.
2023-06-19 16:31:18 +03:00
Oran Agra f228ec1ea5
flushSlavesOutputBuffers should not write to replicas scheduled to drop (#12242)
This will increase the size of an already large COB (one already passed
the threshold for disconnection)

This could also mean that we'll attempt to write that data to the socket
and the replica will manage to read it, which will result in an
undesired partial sync (undesired for the test)
2023-06-12 14:05:34 +03:00
zhenwei pi cb78acb865
Support maxiov per connection type (#12234)
Rather than a fixed iovcnt for connWritev, support maxiov per connection type instead.
A minor change to reduce memory for struct connection.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
2023-05-28 08:35:27 +03:00