Commit Graph

12574 Commits

Author SHA1 Message Date
antirez e3243819ef Don't mess with node attributes without protection.
The background VSIMs use the node attributes (via the callback)
so we can't modify them without waiting for the background
operations to terminate.
2025-03-26 23:36:14 +01:00
antirez a6c8a15cad VADD: fix leak on thread creation failure. 2025-03-26 22:50:47 +01:00
antirez 3e2649f1f1 hnsw_insert() should never fail in practice.
We pass our aborting allocation function to the HNSW lib, the
only other reason for it to fail is pthread mutex locking failing
but this is also practically impossible AFAIK in modern systems,
and if it happens (for kernel reosurces shortage) anyway to
abort is the best thing to do: otherwise we would have to return
that we could not complete the operation for some reason, which
is not uniform with everything Redis does. In Redis under
normal conditions writes must succeed if they are semantically
correct, or the server crash for OOM.
2025-03-26 22:46:00 +01:00
Ozan Tezcan a0da8390a2
Fix use-after-free when diskless load config is not swapdb (#13887)
CI / build-macos-latest (push) Waiting to run Details
CI / test-sanitizer-address (push) Failing after 32s Details
CI / build-debian-old (push) Failing after 31s Details
CI / build-centos-jemalloc (push) Failing after 32s Details
CI / build-libc-malloc (push) Failing after 32s Details
CI / build-32bit (push) Failing after 32s Details
CI / build-old-chain-jemalloc (push) Failing after 32s Details
Codecov / code-coverage (push) Failing after 31s Details
External Server Tests / test-external-standalone (push) Failing after 32s Details
External Server Tests / test-external-cluster (push) Failing after 32s Details
External Server Tests / test-external-nodebug (push) Failing after 32s Details
CI / test-ubuntu-latest (push) Failing after 1m37s Details
Spellcheck / Spellcheck (push) Failing after 32s Details
When the diskless load configuration is set to on-empty-db, we retain a
pointer to the function library context. When emptyData() is called, it
frees this function library context pointer, leading to a use-after-free
situation.

I refactored code to ensure that emptyData() is called first, followed
by retrieving the valid pointer to the function library context.

Refactored code should not introduce any runtime implications.

Bug introduced by https://github.com/redis/redis/pull/13495 (Redis 8.0)

Co-authored-by: Oran Agra <oran@redislabs.com>
2025-03-26 21:50:10 +03:00
antirez 8dfc501fb8 VSIM: fix double free if thread creation fails. 2025-03-26 19:43:59 +01:00
antirez 9d4325ee25 VSIM NOTHREAD, mainly for testing goals. 2025-03-26 16:52:28 +01:00
antirez 707c132392 Count threaded exec time in stats. 2025-03-26 16:48:02 +01:00
antirez 08e3f958fa README: remove no longer valid RP issue.
now the projection matrix is deterministic.
2025-03-26 11:33:32 +01:00
antirez 23b3e21817 README: suggest using FP32 vs VALUES. 2025-03-26 11:28:05 +01:00
Cong Chen 981aa5c12f
Fix timing issue in HEXPIREAT test (#13873)
CI / build-macos-latest (push) Waiting to run Details
CI / build-debian-old (push) Failing after 7s Details
CI / build-centos-jemalloc (push) Failing after 3s Details
CI / build-old-chain-jemalloc (push) Failing after 3s Details
CI / build-32bit (push) Failing after 21s Details
Codecov / code-coverage (push) Failing after 8s Details
CI / build-libc-malloc (push) Successful in 50s Details
CI / test-ubuntu-latest (push) Failing after 2m9s Details
CI / test-sanitizer-address (push) Failing after 2m40s Details
Spellcheck / Spellcheck (push) Successful in 9m2s Details
External Server Tests / test-external-standalone (push) Failing after 32s Details
External Server Tests / test-external-cluster (push) Failing after 32s Details
Coverity Scan / coverity (push) Has been skipped Details
External Server Tests / test-external-nodebug (push) Failing after 31s Details
This fixes an error that occurs in the job
[test-valgrind-no-malloc-usable-size-test](https://github.com/redis/redis/actions/runs/13912357739/job/38929051397)
of the Daily workflow:

```
*** [err]: HEXPIREAT - Set time and then get TTL (listpackex) in tests/unit/type/hash-field-expire.tcl
Expected '999' to be between to '1000' and '2000' (context: type eval line 6 cmd {assert_range [r hpttl myhash FIELDS 1 field1] 1000 2000} proc ::test)
```
2025-03-26 10:00:38 +08:00
antirez 16e3c5a8f9 Locks error checking improved. 2025-03-24 19:10:28 +01:00
antirez adfd2dc7c0 Remove useless OOM checks, but handle mutex creation failure. 2025-03-24 12:54:41 +01:00
antirez 8bf9b8abc1 Use Hadamard-based projection.
Works better and being deterministic (only relative to the projection
size) the replicas will have the same matrix automatically.
2025-03-24 12:48:04 +01:00
Oran Agra 2a189709e0
avoid possible use-after-free with module KSN changes (#13875)
CI / build-debian-old (push) Failing after 4s Details
CI / build-centos-jemalloc (push) Failing after 3s Details
CI / build-old-chain-jemalloc (push) Failing after 3s Details
CI / build-32bit (push) Failing after 18s Details
CI / build-libc-malloc (push) Successful in 53s Details
CI / test-sanitizer-address (push) Failing after 1m6s Details
CI / test-ubuntu-latest (push) Failing after 2m57s Details
Spellcheck / Spellcheck (push) Successful in 9m5s Details
Coverity Scan / coverity (push) Has been skipped Details
External Server Tests / test-external-cluster (push) Failing after 31s Details
External Server Tests / test-external-standalone (push) Failing after 6m35s Details
External Server Tests / test-external-nodebug (push) Failing after 15m1s Details
CI / build-macos-latest (push) Has been cancelled Details
in #13505, we changed the code to use the string value of the key rather
than the integer value on the stack, but we have a test in
unit/moduleapi/keyspace_events that uses keyspace notification hook to
modify the value with RM_StringDMA, which can cause this value to be
released before used. the reason it didn't happen so far is because we
were using shared integers, so releasing the object doesn't free it.
2025-03-24 12:24:52 +02:00
antirez 958ebee091 README: specify how to add REDUCE in VADD. 2025-03-24 09:55:45 +01:00
Yuan Wang 319bbcc1a7
Fix sdscatprintf error of the in output of `info stats` (#13871)
CI / build-macos-latest (push) Waiting to run Details
CI / build-debian-old (push) Failing after 4s Details
CI / build-32bit (push) Failing after 15s Details
CI / build-centos-jemalloc (push) Failing after 3s Details
CI / build-old-chain-jemalloc (push) Failing after 2s Details
CI / test-sanitizer-address (push) Failing after 1m2s Details
Codecov / code-coverage (push) Failing after 33s Details
CI / build-libc-malloc (push) Successful in 48s Details
CI / test-ubuntu-latest (push) Failing after 2m51s Details
Spellcheck / Spellcheck (push) Failing after 9s Details
Coverity Scan / coverity (push) Has been skipped Details
External Server Tests / test-external-standalone (push) Failing after 33s Details
External Server Tests / test-external-nodebug (push) Failing after 32s Details
External Server Tests / test-external-cluster (push) Failing after 9m29s Details
CI failed: https://github.com/redis/redis/actions/runs/13981749993/job/39148249096,
since i don't reassign `info` after `sdscatprintf(info, xxx)`
Thanks to @sundb for spotting this
introduced in https://github.com/redis/redis/pull/13846
2025-03-24 09:17:58 +08:00
debing.sun 87b7c3ac1a
Fix rax node defragmentaion being skipped (#13847)
First, when we do `raxSeek()` and then call raxNext, we will get the
`RAX_ITER_JUST_SEEKED` flag and return success directly.
We always set the node defrag callback after `raxSeek()`, which means
that when we break from defragmentation, the first node that comes in
again will never be defragged.

In this PR, we save the last as the next node to be processed, not the
last node to be completed.
This way we defrag the next node when we exit to avoid it being skipped
on the next resume.

---------

Co-authored-by: oranagra <oran@redislabs.com>
2025-03-24 08:57:08 +08:00
antirez 8007ccd51b Use RESP3-friendly bool replies. 2025-03-23 20:14:40 +01:00
antirez 9cc750fd66 Test: projection regression test fixed. 2025-03-23 15:04:58 +01:00
antirez aa92b37589 VINFO: use a single field for random projection info. 2025-03-23 14:49:52 +01:00
antirez 8f479b22b9 Tests: replication test. 2025-03-23 14:45:34 +01:00
Salvatore Sanfilippo 854c7fdddb
Merge pull request #6 from rowantrollope/main
Fix possible crash with random projection
2025-03-23 14:44:53 +01:00
Rowan Trollope 31bc07955c Fix possible crash with random projection 2025-03-22 09:11:20 -07:00
antirez f330d6175a Clarify HNSW_MAX_THREADS vs one thread per request. 2025-03-20 15:42:11 +01:00
Benson-li 427c36888e
Fix potential infinite loop of RANDOMKEY during client pause (#13863)
CI / test-ubuntu-latest (push) Failing after 31s Details
CI / build-debian-old (push) Failing after 32s Details
CI / build-libc-malloc (push) Failing after 31s Details
CI / build-centos-jemalloc (push) Failing after 31s Details
CI / build-old-chain-jemalloc (push) Failing after 32s Details
Codecov / code-coverage (push) Failing after 32s Details
Spellcheck / Spellcheck (push) Failing after 32s Details
CI / test-sanitizer-address (push) Failing after 4m35s Details
CI / build-32bit (push) Failing after 5m35s Details
CI / build-macos-latest (push) Has been cancelled Details
CodeQL / Analyze (cpp) (push) Failing after 32s Details
Coverity Scan / coverity (push) Has been skipped Details
External Server Tests / test-external-standalone (push) Failing after 31s Details
External Server Tests / test-external-cluster (push) Failing after 31s Details
External Server Tests / test-external-nodebug (push) Failing after 6m47s Details
The bug mentioned in this
[#13862](https://github.com/redis/redis/issues/13862) has been fixed.

---------

Signed-off-by: li-benson <1260437731@qq.com>
Signed-off-by: youngmore1024 <youngmore1024@outlook.com>
Co-authored-by: youngmore1024 <youngmore1024@outlook.com>
2025-03-20 21:32:12 +08:00
debing.sun cb02bd190b
Fix timing issue in module defrag test (#13870)
After #13840, the data we populate becomes more complex and slower, we
always wait for a defragmentation cycle to end before verifying that the
test is okay.
However, in some slow environments, an entire defragmentation cycle can
exceed 5 seconds, and in my local test using 'taskset -c 0' it can reach
6 seconds, so increase the threshold to avoid test failures.
2025-03-20 21:22:47 +08:00
Yuan Wang 951ec79654
Cluster compatibility check (#13846)
CI / build-macos-latest (push) Waiting to run Details
CI / build-32bit (push) Failing after 31s Details
CI / build-libc-malloc (push) Failing after 31s Details
CI / build-debian-old (push) Failing after 1m32s Details
CI / build-old-chain-jemalloc (push) Failing after 31s Details
Codecov / code-coverage (push) Failing after 31s Details
CI / test-ubuntu-latest (push) Failing after 3m21s Details
Spellcheck / Spellcheck (push) Failing after 31s Details
CI / test-sanitizer-address (push) Failing after 6m36s Details
CI / build-centos-jemalloc (push) Failing after 6m36s Details
External Server Tests / test-external-standalone (push) Failing after 2m10s Details
Coverity Scan / coverity (push) Has been skipped Details
External Server Tests / test-external-nodebug (push) Failing after 2m12s Details
External Server Tests / test-external-cluster (push) Failing after 2m16s Details
### Background
The program runs normally in standalone mode, but migrating to cluster
mode may cause errors, this is because some cross slot commands can not
run in cluster mode. We should provide an approach to detect this issue
when running in standalone mode, and need to expose a metric which
indicates the usage of no incompatible commands.

### Solution
To avoid perf impact, we introduce a new config
`cluster-compatibility-sample-ratio` which define the sampling ratio
(0-100) for checking command compatibility in cluster mode. When a
command is executed, it is sampled at the specified ratio to determine
if it complies with Redis cluster constraints, such as cross-slot
restrictions.

A new metric is exposed: `cluster_incompatible_ops` in `info stats`
output.

The following operations will be considered incompatible operations.

- cross-slot command
   If a command has multiple cross slot keys, it is incompatible
- `swap, copy, move, select` command
These commands involve multi databases in some cases, we don't allow
multiple DB in cluster mode, so there are not compatible
- Module command with `no-cluster` flag
If a module command has `no-cluster` flag, we will encounter an error
when loading module, leading to fail to load module if cluster is
enabled, so this is incompatible.
- Script/function with `no-cluster` flag
Similar with module command, if we declare `no-cluster` in shebang of
script/function, we also can not run it in cluster mode
- `sort` command by/get pattern
When `sort` command has `by/get` pattern option, we must ask that the
pattern slot is equal with the slot of keys, otherwise it is
incompatible in cluster mode.

- The script/function command accesses the keys and declared keys have
different slots
For the script/function command, we not only check the slot of declared
keys, but only check the slot the accessing keys, if they are different,
we think it is incompatible.

**Besides**, commands like `keys, scan, flushall, script/function
flush`, that in standalone mode iterate over all data to perform the
operation, are only valid for the server that executes the command in
cluster mode and are not broadcasted. However, this does not lead to
errors, so we do not consider them as incompatible commands.

### Performance impact test
**cross slot test**
Below are the test commands and results. When using MSET with 8 keys,
performance drops by approximately 3%.

**single key test**
It may be due to the overhead of the sampling function, and single-key
commands could cause a 1-2% performance drop.
2025-03-20 10:35:53 +08:00
Filipe Oliveira (Redis) 3e012c9260
Fix string2d usage in case of hexadecimal strings parsing and overflow (#13845)
CI / build-macos-latest (push) Waiting to run Details
CI / build-debian-old (push) Failing after 6s Details
CI / build-centos-jemalloc (push) Failing after 5s Details
CI / build-old-chain-jemalloc (push) Failing after 3s Details
Codecov / code-coverage (push) Failing after 7s Details
CI / build-libc-malloc (push) Successful in 56s Details
CI / test-sanitizer-address (push) Failing after 1m8s Details
CI / test-ubuntu-latest (push) Failing after 2m13s Details
CI / build-32bit (push) Failing after 3m28s Details
Coverity Scan / coverity (push) Has been skipped Details
External Server Tests / test-external-nodebug (push) Failing after 1m48s Details
External Server Tests / test-external-standalone (push) Failing after 2m9s Details
External Server Tests / test-external-cluster (push) Failing after 2m14s Details
Spellcheck / Spellcheck (push) Successful in 9m3s Details
Since https://github.com/redis/redis/pull/11884, what was previously
accepted as a valid input (hexadecimal string) before 8.0 returned an
error. This PR addresses it. To avoid performance penalties if hints the
compiler that the fallbacks are not likely to happen.
Furthermore, we were ignoring std::result_out_of_range outputs from
fast_float. This PR addresses it as well and includes tests for both
identified scenarios.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2025-03-19 20:08:45 +08:00
antirez 758e963a4e VRANDMEMBER documentation. 2025-03-19 09:02:15 +01:00
debing.sun 26dcec4812
Fix messed-up unblocked clients in flush command (#13865)
CI / build-macos-latest (push) Waiting to run Details
CI / build-debian-old (push) Failing after 6s Details
CI / build-centos-jemalloc (push) Failing after 5s Details
CI / build-old-chain-jemalloc (push) Failing after 3s Details
Codecov / code-coverage (push) Failing after 7s Details
CI / build-32bit (push) Failing after 20s Details
CI / build-libc-malloc (push) Successful in 47s Details
CI / test-sanitizer-address (push) Failing after 2m6s Details
CI / test-ubuntu-latest (push) Failing after 2m11s Details
External Server Tests / test-external-standalone (push) Failing after 2m13s Details
External Server Tests / test-external-nodebug (push) Failing after 2m11s Details
External Server Tests / test-external-cluster (push) Failing after 2m18s Details
Spellcheck / Spellcheck (push) Successful in 9m4s Details
Fix https://github.com/redis/redis/pull/13853#pullrequestreview-2675227138

This PR ensures that the client's current command is not reset by
unblockClient(), while still needing to be handled after `unblockclient()`.
The FLUSH command still requires reprocessing (update the replication
offset) after unblockClient(). Therefore, we mark such blocked clients
with the CLIENT_PENDING_COMMAND flag to prevent the command from being
reset during unblockClient().
2025-03-19 10:22:47 +08:00
antirez 3424757f4d Test: added another threading stress test.
This access pattern triggered the bug fixed
about VADD and CAS in 70ffa8c.
2025-03-18 23:18:26 +01:00
antirez 70ffa8ce5c Fix VADD_CASReply() NULL reference on ID mismatch.
This bug was fixed thanks to the kind help of Dvir Dukhan
(@DvirDukhan) that found it and provided useful context.
2025-03-18 21:37:06 +01:00
antirez 99176b3e04 Test: VRANDMEMBER test added. 2025-03-18 16:49:27 +01:00
antirez 22ce9f3fad VRANDMEMBER command implemented. 2025-03-17 23:52:15 +01:00
debing.sun a5a3afd923
Fix crash during SLAVEOF when clients are blocked on lazyfree (#13853)
CI / build-debian-old (push) Failing after 7s Details
CI / build-centos-jemalloc (push) Failing after 6s Details
CI / build-old-chain-jemalloc (push) Failing after 4s Details
Codecov / code-coverage (push) Failing after 7s Details
CI / build-libc-malloc (push) Successful in 53s Details
CI / test-sanitizer-address (push) Failing after 1m4s Details
CI / test-ubuntu-latest (push) Failing after 2m9s Details
CI / build-32bit (push) Failing after 9m50s Details
Spellcheck / Spellcheck (push) Successful in 9m0s Details
Coverity Scan / coverity (push) Has been skipped Details
External Server Tests / test-external-standalone (push) Failing after 31s Details
External Server Tests / test-external-cluster (push) Failing after 6m36s Details
External Server Tests / test-external-nodebug (push) Failing after 9m54s Details
CI / build-macos-latest (push) Has been cancelled Details
After https://github.com/redis/redis/pull/13167, when a client calls
`FLUSHDB` command, we still async empty database, and the client was
blocked until the lazyfree completes.

1) If another client calls `SLAVEOF` command during this time, the
server will unblock all blocked clients, including those blocked by the
lazyfree. However, when unblocking a lazyfree blocked client, we forgot
to call `updateStatsOnUnblock()`, which ultimately triggered the
following assertion.

2) If a client blocked by Lazyfree is unblocked midway, and at this
point the `bio_comp_list` has already received the completion
notification for the bio, we might end up processing a client that has
already been unblocked in `flushallSyncBgDone()`. Therefore, we need to
filter it out.

---------

Co-authored-by: oranagra <oran@redislabs.com>
2025-03-17 20:27:05 +08:00
kei-nan 752576ce47
Use Search v7.99.5 (#13859)
CI / build-macos-latest (push) Waiting to run Details
CI / test-ubuntu-latest (push) Failing after 32s Details
CI / test-sanitizer-address (push) Failing after 31s Details
CI / build-debian-old (push) Failing after 31s Details
CI / build-32bit (push) Failing after 32s Details
CI / build-libc-malloc (push) Failing after 31s Details
CI / build-centos-jemalloc (push) Failing after 31s Details
CI / build-old-chain-jemalloc (push) Failing after 31s Details
Codecov / code-coverage (push) Failing after 31s Details
Spellcheck / Spellcheck (push) Failing after 31s Details
Coverity Scan / coverity (push) Has been skipped Details
External Server Tests / test-external-standalone (push) Failing after 2m7s Details
External Server Tests / test-external-nodebug (push) Failing after 2m5s Details
External Server Tests / test-external-cluster (push) Failing after 8m25s Details
2025-03-16 10:00:51 +02:00
antirez 706721f8c8 HSNW: random node. 2025-03-16 00:08:43 +01:00
antirez 8a5cf17cb2 HNSW: cursor fixes and thread safety. 2025-03-15 23:31:24 +01:00
antirez a363e5fe6d README: memory usage section. 2025-03-15 23:16:28 +01:00
antirez 6e434bcaaf HNSW: use node max link property.
This is both more correct in formal terms, and in practical
terms as well, as we could over-allocate nodes sometimes.
2025-03-15 10:30:14 +01:00
antirez 68d3067125 w2v test: fix recall EF usage. 2025-03-15 10:24:20 +01:00
antirez d94058fad9 w2v test: recall histograms + configurable M. 2025-03-15 09:46:42 +01:00
antirez c1c7eeaa69 Document VADD M parameter. 2025-03-15 09:28:55 +01:00
antirez 542736ce25 w2v test: proper recall test added. 2025-03-15 00:24:10 +01:00
antirez 13a0a63bef Copyright Sanfilipo -> Redis Ltd. 2025-03-14 23:06:22 +01:00
antirez d996eb82ef VADD: make M configurable at creation time. 2025-03-13 16:58:55 +01:00
antirez 4e57d3f76f README: grammar. 2025-03-13 15:56:05 +01:00
antirez 2fcf389f2a README: troubleshooting and understandability. 2025-03-13 13:25:48 +01:00
antirez 9500539c55 HNSW: implement last resort node reallocation. 2025-03-13 11:30:07 +01:00
antirez 095842a748 README: scaling information. 2025-03-12 22:58:33 +01:00