* man-like consistent long formatting
* Uppercases commands, subcommands and options
* Adds 'HELP' to HELP for all
* Lexicographical order
* Uses value notation and other .md likeness
* Moves const char *help to top
* Keeps it under 80 chars
* Misc help typos, consistent conjuctioning (i.e return and not returns)
* Uses addReplySubcommandSyntaxError(c) all over
Signed-off-by: Itamar Haber <itamar@redislabs.com>
In the distant history there was only the read flag for commands, and whatever
command that didn't have the read flag was a write one.
Then we added the write flag, but some portions of the code still used !read
Also some commands that don't work on the keyspace at all, still have the read
flag.
Changes in this commit:
1. remove the read-only flag from TIME, ECHO, ROLE and LASTSAVE
2. EXEC command used to decides if it should propagate a MULTI by looking at
the command flags (!read & !admin).
When i was about to change it to look at the write flag instead, i realized
that this would cause it not to propagate a MULTI for PUBLISH, EVAL, and
SCRIPT, all 3 are not marked as either a read command or a write one (as
they should), but all 3 are calling forceCommandPropagation.
So instead of introducing a new flag to denote a command that "writes" but
not into the keyspace, and still needs propagation, i decided to rely on
the forceCommandPropagation, and just fix the code to propagate MULTI when
needed rather than depending on the command flags at all.
The implication of my change then is that now it won't decide to propagate
MULTI when it sees one of these: SELECT, PING, INFO, COMMAND, TIME and
other commands which are neither read nor write.
3. Changing getNodeByQuery and clusterRedirectBlockedClientIfNeeded in
cluster.c to look at !write rather than read flag.
This should have no implications, since these code paths are only reachable
for commands which access keys, and these are always marked as either read
or write.
This commit improve MULTI propagation tests, for modules and a bunch of
other special cases, all of which used to pass already before that commit.
the only one that test change that uncovered a change of behavior is the
one that DELs a non-existing key, it used to propagate an empty
multi-exec block, and no longer does.
* Allow runtest-moduleapi use a different 'make', for systems where GNU Make is 'gmake'.
* Fix issue with builds on Solaris re-building everything from scratch due to CFLAGS/LDFLAGS not stored.
* Fix compile failure on Solaris due to atomicvar and a bunch of warnings.
* Fix garbled log timestamps on Solaris.
The test creates keys with various encodings, DUMP them, corrupt the payload
and RESTORES it.
It utilizes the recently added use-exit-on-panic config to distinguish between
asserts and segfaults.
If the restore succeeds, it runs random commands on the key to attempt to
trigger a crash.
It runs in two modes, one with deep sanitation enabled and one without.
In the first one we don't expect any assertions or segfaults, in the second one
we expect assertions, but no segfaults.
We also check for leaks and invalid reads using valgrind, and if we find them
we print the commands that lead to that issue.
Changes in the code (other than the test):
- Replace a few NPD (null pointer deference) flows and division by zero with an
assertion, so that it doesn't fail the test. (since we set the server to use
`exit` rather than `abort` on assertion).
- Fix quite a lot of flows in rdb.c that could have lead to memory leaks in
RESTORE command (since it now responds with an error rather than panic)
- Add a DEBUG flag for SET-SKIP-CHECKSUM-VALIDATION so that the test don't need
to bother with faking a valid checksum
- Remove a pile of code in serverLogObjectDebugInfo which is actually unsafe to
run in the crash report (see comments in the code)
- fix a missing boundary check in lzf_decompress
test suite infra improvements:
- be able to run valgrind checks before the process terminates
- rotate log files when restarting servers
Turns out this was broken since version 4.0 when we added sds size
classes.
The cluster code uses sds for the receive buffer, and then casts it to a
struct and accesses a 64 bit variable.
This commit replaces the use of sds with a simple reallocated buffer.
This commit deals with manual failover as well as non-manual failover.
We did tests with manual failover as follows:
1, Setup redis cluster which holds 16 partions, each having only
1 corresponding replica.
2, Write a batch of data to redis cluster and make sure the redis is doing
a active expire in serverCron.
3, Do a manual failover sequentially to each partions with a time interval
of 3 minutes.
4, Collect logs and do some computaiton work.
The result:
case avgTime maxTime minTime
C1 95.8ms 227ms 25ms
C2 47.9ms 96ms 12ms
C3 12.6ms 27ms 7ms
Explanation
case C1: All nodes use the version before optimization
case C2: Masters use the elder version while replicas use the optimized version
case C3: All nodes use the optimized version
failover time: The time between when replica got a `manual failover request` and
when it `won the failover election`.
avgTime: average failover time
maxTime: maximum failover time
minTime: mimimum failover time
ms: millisecond
Co-authored-by: chendq8 <c.d_q@163.com>
Avoid using a static buffer for short key index responses, and make it
caller's responsibility to stack-allocate a result type. Responses that
don't fit are still allocated on the heap.
I suppose that it was overlooked, since till recently none of the blocked commands were readonly.
other changes:
- add test for the above.
- add better support for additional (and deferring) clients for
cluster tests
- improve a test which left the client in MULTI state.
There was a bug. Although cluster replicas would allow read commands,
they would not allow a MULTI-EXEC that's composed solely of read commands.
Adds tests for coverage.
Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Eran Liberty <eranl@amazon.com>
After fork, the child process(redis-aof-rewrite) will get the fd opened
by the parent process(redis), when redis killed by kill -9, it will not
graceful exit(call prepareForShutdown()), so redis-aof-rewrite thread may still
alive, the fd(lock) will still be held by redis-aof-rewrite thread, and
redis restart will fail to get lock, means fail to start.
This issue was causing failures in the cluster tests in github actions.
Co-authored-by: Oran Agra <oran@redislabs.com>
The connection API may create an accepted connection object in an error
state, and callers are expected to check it before attempting to use it.
Co-authored-by: mrpre <mrpre@163.com>
Similarly to EXPIREAT with TTL in the past, which implicitly deletes the
key and return success, RESTORE should not store key that are already
expired into the db.
When used together with REPLACE it should emit a DEL to keyspace
notification and replication stream.
`clusterStartHandshake` will start hand handshake
and eventually send CLUSTER MEET message, which is strictly prohibited
in the REDIS CLUSTER SPEC.
Only system administrator can initiate CLUSTER MEET message.
Futher, according to the SPEC, rather than IP/PORT pairs, only nodeid
can be trusted.
We want to send pings and pongs at specific intervals, since our packets
also contain information about the configuration of the cluster and are
used for gossip. However since our cluster bus is used in a mixed way
for data (such as Pub/Sub or modules cluster messages) and metadata,
sometimes a very busy channel may delay the reception of pong packets.
So after discussing it in #7216, this commit introduces a new field that
is not exposed in the cluster, is only an internal information about
the last time we received any data from a given node: we use this field
in order to avoid detecting failures, claiming data reception of new
data from the node is a proof of liveness.
Reloading of the RDB generated by
DEBUG POPULATE 5000000
SAVE
is now 25% faster.
This commit also prepares the ability to have more flexibility when
loading stuff from the RDB, since we no longer use dbAdd() but can
control exactly how things are added in the database.
- the API name was odd, separated to two apis one for LRU and one for LFU
- the LRU idle time was in 1 second resolution, which might be ok for RDB
and RESTORE, but i think modules may need higher resolution
- adding tests for LFU and for handling maxmemory policy mismatch
* Introduce a connection abstraction layer for all socket operations and
integrate it across the code base.
* Provide an optional TLS connections implementation based on OpenSSL.
* Pull a newer version of hiredis with TLS support.
* Tests, redis-cli updates for TLS support.
cluster.c - stack buffer memory alignment
The pointer 'buf' is cast to a more strictly aligned pointer type
evict.c - lazyfree_lazy_eviction, lazyfree_lazy_eviction always called
defrag.c - bug in dead code
server.c - casting was missing parenthesis
rax.c - indentation / newline suggested an 'else if' was intended
RESTORE now supports:
1. Setting LRU/LFU
2. Absolute-time TTL
Other related changes:
1. RDB loading will not override LRU bits when RDB file
does not contain the LRU opcode.
2. RDB loading will not set LRU/LFU bits if the server's
maxmemory-policy does not match.
This commit, in some parts derived from PR #3041 which is no longer
possible to merge (because the user deleted the original branch),
implements the ability of slaves to have a special configuration
preventing that they try to start a failover when the master is failing.
There are multiple reasons for wanting this, and the feautre was
requested in issue #3021 time ago.
The differences between this patch and the original PR are the
following:
1. The flag is saved/loaded on the nodes configuration.
2. The 'myself' node is now flag-aware, the flag is updated as needed
when the configuration is changed via CONFIG SET.
3. The flag name uses NOFAILOVER instead of NO_FAILOVER to be consistent
with existing NOADDR.
4. The redis.conf documentation was rewritten.
Thanks to @deep011 for the original patch.
Add AE_BARRIER to the writable event loop so that slaves requesting
votes can't be served before we re-enter the event loop in the next
iteration, so clusterBeforeSleep() will fsync to disk in time.
Also add the call to explicitly fsync, given that we modified the last
vote epoch variable.
The main change introduced by this commit is pretending that help
arrays are more text than code, thus indenting them at level 0. This
improves readability, and is an old practice when defining arrays of
C strings describing text.
Additionally a few useless return statements are removed, and the HELP
subcommand capitalized when printed to the user.
There was not enough sanity checking in the code loading the slots of
Redis Cluster from the nodes.conf file, this resulted into the
attacker's ability to write data at random addresses in the process
memory, by manipulating the index of the array. The bug seems
exploitable using the following techique: the config file may be altered so
that one of the nodes gets, as node ID (which is the first field inside the
structure) some data that is actually executable: then by writing this
address in selected places, this node ID part can be executed after a
jump. So it is mostly just a matter of effort in order to exploit the
bug. In practice however the issue is not very critical because the
bug requires an unprivileged user to be able to modify the Redis cluster
nodes configuration, and at the same time this should result in some
gain. However Redis normally is unprivileged as well. Yet much better to
have this fixed indeed.
Fix#4278.
This function failed when an internal-only flag was set as an only flag
in a node: the string was trimmed expecting a final comma before
exiting the function, causing a crash. See issue #4142.
Moreover generation of flags representation only needed at DEBUG log
level was always performed: a waste of CPU time. This is fixed as well
by this commit.
1. brpop last key index, thus checking all keys for slots.
2. Memory leak in clusterRedirectBlockedClientIfNeeded.
3. Remove while loop in clusterRedirectBlockedClientIfNeeded.
However we allow for 500 milliseconds of tolerance, in order to
avoid often discarding semantically valid info (the node is up)
because of natural few milliseconds desync among servers even when
NTP is used.
Note that anyway we should ping the node from time to time regardless and
discover if it's actually down from our point of view, since no update
is accepted while we have an active ping on the node.
Related to #3929.
To rely on the fact that nodes in PFAIL state will be shared around by
randomly adding them in the gossip section is a weak assumption,
especially after changes related to sending less ping/pong packets.
We want to always include gossip entries for all the nodes that are in
PFAIL state, so that the PFAIL -> FAIL state promotion can happen much
faster and reliably.
Related to #3929.
The gossip section times are 32 bit, so cannot store the milliseconds
time but just the seconds approximation, which is good enough for our
uses. At the same time however, when comparing the gossip section times
of other nodes with our node's view, we need to convert back to
milliseconds.
Related to #3929. Without this change the patch to reduce the traffic in
the bus message does not work.
Cluster of bigger sizes tend to have a lot of traffic in the cluster bus
just for failure detection: a node will try to get a ping reply from
another node no longer than when the half the node timeout would elapsed,
in order to avoid a false positive.
However this means that if we have N nodes and the node timeout is set
to, for instance M seconds, we'll have to ping N nodes every M/2
seconds. This N*M/2 pings will receive the same number of pongs, so
a total of N*M packets per node. However given that we have a total of N
nodes doing this, the total number of messages will be N*N*M.
In a 100 nodes cluster with a timeout of 60 seconds, this translates
to a total of 100*100*30 packets per second, summing all the packets
exchanged by all the nodes.
This is, as you can guess, a lot... So this patch changes the
implementation in a very simple way in order to trust the reports of
other nodes: if a node A reports a node B as alive at least up to
a given time, we update our view accordingly.
The problem with this approach is that it could result into a subset of
nodes being able to reach a given node X, and preventing others from
detecting that is actually not reachable from the majority of nodes.
So the above algorithm is refined by trusting other nodes only if we do
not have currently a ping pending for the node X, and if there are no
failure reports for that node.
Since each node, anyway, pings 10 other nodes every second (one node
every 100 milliseconds), anyway eventually even trusting the other nodes
reports, we will detect if a given node is down from our POV.
Now to understand the number of packets that the cluster would exchange
for failure detection with the patch, we can start considering the
random PINGs that the cluster sent anyway as base line:
Each node sends 10 packets per second, so the total traffic if no
additioal packets would be sent, including PONG packets, would be:
Total messages per second = N*10*2
However by trusting other nodes gossip sections will not AWALYS prevent
pinging nodes for the "half timeout reached" rule all the times. The
math involved in computing the actual rate as N and M change is quite
complex and depends also on another parameter, which is the number of
entries in the gossip section of PING and PONG packets. However it is
possible to compare what happens in cluster of different sizes
experimentally. After applying this patch a very important reduction in
the number of packets exchanged is trivial to observe, without apparent
impacts on the failure detection performances.
Actual numbers with different cluster sizes should be published in the
Reids Cluster documentation in the future.
Related to #3929.
After investigating issue #3796, it was discovered that MIGRATE
could call migrateCloseSocket() after the original MIGRATE c->argv
was already rewritten as a DEL operation. As a result the host/port
passed to migrateCloseSocket() could be anything, often a NULL pointer
that gets deferenced crashing the server.
Now the socket is closed at an earlier time when there is a socket
error in a later stage where no retry will be performed, before we
rewrite the argument vector. Moreover a check was added so that later,
in the socket_err label, there is no further attempt at closing the
socket if the argument was rewritten.
This fix should resolve the bug reported in #3796.
After the fix for #3673 the ttl var is always initialized inside the
loop itself, so the early initialization is not needed.
Variables declaration also moved to a more local scope.
BACKGROUND AND USE CASEj
Redis slaves are normally write only, however the supprot a "writable"
mode which is very handy when scaling reads on slaves, that actually
need write operations in order to access data. For instance imagine
having slaves replicating certain Sets keys from the master. When
accessing the data on the slave, we want to peform intersections between
such Sets values. However we don't want to intersect each time: to cache
the intersection for some time often is a good idea.
To do so, it is possible to setup a slave as a writable slave, and
perform the intersection on the slave side, perhaps setting a TTL on the
resulting key so that it will expire after some time.
THE BUG
Problem: in order to have a consistent replication, expiring of keys in
Redis replication is up to the master, that synthesize DEL operations to
send in the replication stream. However slaves logically expire keys
by hiding them from read attempts from clients so that if the master did
not promptly sent a DEL, the client still see logically expired keys
as non existing.
Because slaves don't actively expire keys by actually evicting them but
just masking from the POV of read operations, if a key is created in a
writable slave, and an expire is set, the key will be leaked forever:
1. No DEL will be received from the master, which does not know about
such a key at all.
2. No eviction will be performed by the slave, since it needs to disable
eviction because it's up to masters, otherwise consistency of data is
lost.
THE FIX
In order to fix the problem, the slave should be able to tag keys that
were created in the slave side and have an expire set in some way.
My solution involved using an unique additional dictionary created by
the writable slave only if needed. The dictionary is obviously keyed by
the key name that we need to track: all the keys that are set with an
expire directly by a client writing to the slave are tracked.
The value in the dictionary is a bitmap of all the DBs where such a key
name need to be tracked, so that we can use a single dictionary to track
keys in all the DBs used by the slave (actually this limits the solution
to the first 64 DBs, but the default with Redis is to use 16 DBs).
This solution allows to pay both a small complexity and CPU penalty,
which is zero when the feature is not used, actually. The slave-side
eviction is encapsulated in code which is not coupled with the rest of
the Redis core, if not for the hook to track the keys.
TODO
I'm doing the first smoke tests to see if the feature works as expected:
so far so good. Unit tests should be added before merging into the
4.0 branch.
Before, if a previous key had a TTL set but the current one didn't, the
TTL was reused and thus resulted in wrong expirations set.
This behaviour was experienced, when `MigrateDefaultPipeline` in
redis-trib was set to >1
Fixes#3655
Reference issue #3218.
Checking the code I can't find a reason why the original RESTORE
code was so opinionated about restoring only the current version. The
code in to `rdb.c` appears to be capable as always to restore data from
older versions of Redis, and the only places where it is needed the
current version in order to correctly restore data, is while loading the
opcodes, not the values itself as it happens in the case of RESTORE.
For the above reasons, this commit enables RESTORE to accept older
versions of values payloads.
This fixes a bug introduced by d827dbf, and makes the code consistent
with the logic of always allowing, while the cluster is down, commands
that don't target any key.
As a side effect the code is also simpler now.
This fixes issue #3043.
Before this fix, after a complete resharding of a master slots
to other nodes, the master remains empty and the slaves migrate away
to other masters with non-zero nodes. However the old master now empty,
is no longer considered a target for migration, because the system has
no way to tell it had slaves in the past.
This fix leaves the algorithm used in the past untouched, but adds a
new rule. When a new or old master which is empty and without slaves,
are assigend with their first slot, if other masters in the cluster have
slaves, they are automatically considered to be targets for replicas
migration.
CLUSTER SLOTS now includes IDs in the nodes description associated with
a given slot range. Certain client libraries implementations need a way
to reference a node in an unique way, so they were relying on CLUSTER
NODES, that is not a stable API and may change frequently depending on
Redis Cluster future requirements.
The change covers the case where:
1. There is a node we can't reach (in fail or pfail state).
2. We see a different address for this node, in the gossip section sent
to us by a node that, instead, is able to talk with the node we cannot
talk to.
In this case it's a good bet to switch to the address reported by this
node, since there was an address switch and it is able to talk with the
node and we are not.
However previosuly this was done in a dangerous way, by initiating an
handshake. The handshake, using the MEET packet, forces the receiver to
join our cluster, and this is not a good idea. If the node in question
really just switched address, but is the same node, it already knows about
us, so we just need to perform an address update and a reconnection.
So with this commit instead we just update the address of the node,
release the node link if any, and attempt to reconnect in the next
clusterCron() cycle.
The commit also improves debugging messages printed by Cluster during
address or ID switches.
Another leak was fixed in the case of syntax error by restructuring the
allocation strategy for the two dynamic vectors.
We also make sure to always close the cached socket on I/O errors so that
all the I/O errors are handled the same, even if we had a previously
queued error of a different kind from the destination server.
Thanks to Kevin McGehee. Related to issue #3016.
In issue #3016 Kevin McGehee identified multiple very serious issues in
the new implementation of MIGRATE. This commit attempts to restructure
the code in oder to avoid mistakes, an analysis of the new
implementation is in progress in order to check for possible edge cases.
With this commit we preserve the list of nodes that have .slaveof set
to the node, even when the node is turned into a slave, and make sure to
fix the .slaveof pointers to NULL when a node is freed from memory,
regardless of the fact it's a slave or a master.
Basically we try to remember the logical master in the current
configuration even if the logical master advertised it as a slave
already. However we still remember the associations, so that when a node
is freed we can fix them.
This should fix issue #3002.
Sometimes during "fixes" we have to setup a new configuration and assign
slots to nodes. With BUMPEPOCH we can make sure the new configuration of
the node will win if there are conflicting configurations (for example
another node is *also* claiming the same slot because the cluster is
totally messed up).
Extend the MIGRATE extra freedom to be able to be called in the context
of the local slot, anytime there is a slot open in one or the other
direction (importing or migrating). This is useful for redis-trib to fix
the cluster when it has in an odd state.
Thix fix allows "redis-trib fix" to make its work in certain cases where
previously an error was reported.
For non existing keys, we don't want to send -ASK redirections to
MIGRATE, since when moving slots from the migrating node to the
importing node, we want just to ignore keys that are no longer there.
They may be expired or deleted between the GETKEYSINSLOT call and the
MIGRATE call. Otherwise this causes an error during migrations with
redis-trib (or equivalent cluster management tools).
We need to process replies after errors in order to delete keys
successfully transferred. Also argument rewriting was fixed since
it was broken in several ways. Now a fresh argument vector is created
and set if we are acknowledged of at least one key.
We wait a fixed amount of time (5 seconds currently) much greater than
the usual Cluster node to node communication latency, before migrating.
This way when a failover occurs, before detecting the new master as a
target for migration, we give the time to its natural slaves (the slaves
of the failed over master) to announce they switched to the new master,
preventing an useless migration operation.
Some time ago I broken replicas migration (reported in #2924).
The idea was to prevent masters without replicas from getting replicas
because of replica migration, I remember it to create issues with tests,
but there is no clue in the commit message about why it was so
undesirable.
However my patch as a side effect totally ruined the concept of replicas
migration since we want it to work also for instances that, technically,
never had slaves in the past: promoted slaves.
So now instead the ability to be targeted by replicas migration, is a
new flag "migrate-to". It only applies to masters, and is set in the
following two cases:
1. When a master gets a slave, it is set.
2. When a slave turns into a master because of fail over, it is set.
This way replicas migration targets are only masters that used to have
slaves, and slaves of masters (that used to have slaves... obviously)
and are promoted.
The new flag is only internal, and is never exposed in the output nor
persisted in the nodes configuration, since all the information to
handle it are implicit in the cluster configuration we already have.
There was a bug in Redis Cluster caused by clients blocked in a blocking
list pop operation, for keys no longer handled by the instance, or
in a condition where the cluster became down after the client blocked.
A typical situation is:
1) BLPOP <somekey> 0
2) <somekey> hash slot is resharded to another master.
The client will block forever int this case.
A symmentrical non-cluster-specific bug happens when an instance is
turned from master to slave. In that case it is more serious since this
will desynchronize data between slaves and masters. This other bug was
discovered as a side effect of thinking about the bug explained and
fixed in this commit, but will be fixed in a separated commit.
This commit moves the process of generating a new config epoch without
consensus out of the clusterCommand() implementation, in order to make
it reusable for other reasons (current target is to have a CLUSTER
FAILOVER option forcing the failover when no master majority is
reachable).
Moreover the commit moves other functions which are similarly related to
config epochs in a new logical section of the cluster.c file, just for
clarity.
Before we relied on the global cluster state to make sure all the hash
slots are linked to some node, when getNodeByQuery() is called. So
finding the hash slot unbound was checked with an assertion. However
this is fragile. The cluster state is often updated in the
clusterBeforeSleep() function, and not ASAP on state change, so it may
happen to process clients with a cluster state that is 'ok' but yet
certain hash slots set to NULL.
With this commit the condition is also checked in getNodeByQuery() and
reported with a identical error code of -CLUSTERDOWN but slightly
different error message so that we have more debugging clue in the
future.
Root cause of issue #2288.
1. Remove useless "cs" initialization.
2. Add a "select" var to capture a condition checked multiple times.
3. Avoid duplication of the same if (!copy) conditional.
4. Don't increment dirty if copy is given (no deletion is performed),
otherwise we propagate MIGRATE when not needed.
This improves PFAIL -> FAIL switch. Too late at this point in the RC
releases to add proper PFAIL/FAIL separate dictionary to do this in a
less randomized way. Tested in practice with experiments that this
helps. PFAIL -> FAIL average with 20 nodes and node-timeout set to 5
seconds takes 2.5 seconds without this commit, 1 second with this
commit.
Otherwise it is impossible to receive the majority of failure reports in
the node_timeout*2 window in larger clusters.
Still with a 200 nodes cluster, 20 gossip sections are a very reasonable
amount of bytes to send.
A side effect of this change is also fater cluster nodes joins for large
clusters, because the cluster layout makes less time to propagate.
Otherwise we risk sending not initialized data to other nodes, that may
contain anything. This was actually not possible only because the
initialization of the buffer where the cluster packets header is created
was larger than the 3 gossip sections we use, so the memory was already
all filled with zeroes by the memset().
Fixes valgrind error:
48 bytes in 1 blocks are definitely lost in loss record 196 of 373
at 0x4910D3: je_malloc (jemalloc.c:944)
by 0x42807D: zmalloc (zmalloc.c:125)
by 0x41FA0D: dictGetIterator (dict.c:543)
by 0x41FA48: dictGetSafeIterator (dict.c:555)
by 0x459B73: clusterHandleSlaveMigration (cluster.c:2776)
by 0x45BF27: clusterCron (cluster.c:3123)
by 0x423344: serverCron (redis.c:1239)
by 0x41D6CD: aeProcessEvents (ae.c:311)
by 0x41D8EA: aeMain (ae.c:455)
by 0x41A84B: main (redis.c:3832)
If array has N elements, we can't read +1 if we are already at N.
Also, we need to move elements by their storage size in the array,
not just by individual bytes.
[maybe] Fixes valgrind errors:
32 bytes in 4 blocks are definitely lost in loss record 107 of 228
at 0x80EA447: je_malloc (jemalloc.c:944)
by 0x806E59C: zrealloc (zmalloc.c:125)
by 0x80A9AFC: clusterSetMaster (cluster.c:801)
by 0x80AEDC9: clusterCommand (cluster.c:3994)
by 0x80682A5: call (redis.c:2049)
by 0x8068A20: processCommand (redis.c:2309)
by 0x8076497: processInputBuffer (networking.c:1143)
by 0x8073BAF: readQueryFromClient (networking.c:1208)
by 0x8060E98: aeProcessEvents (ae.c:412)
by 0x806123B: aeMain (ae.c:455)
by 0x806C3DB: main (redis.c:3832)
64 bytes in 8 blocks are definitely lost in loss record 143 of 228
at 0x80EA447: je_malloc (jemalloc.c:944)
by 0x806E59C: zrealloc (zmalloc.c:125)
by 0x80AAB40: clusterProcessPacket (cluster.c:801)
by 0x80A847F: clusterReadHandler (cluster.c:1975)
by 0x30000FF: ???
80 bytes in 10 blocks are definitely lost in loss record 148 of 228
at 0x80EA447: je_malloc (jemalloc.c:944)
by 0x806E59C: zrealloc (zmalloc.c:125)
by 0x80AAB40: clusterProcessPacket (cluster.c:801)
by 0x80A847F: clusterReadHandler (cluster.c:1975)
by 0x2FFFFFF: ???
Fixes valgrind error:
Syscall param write(buf) points to uninitialised byte(s)
at 0x514C35D: ??? (syscall-template.S:81)
by 0x456B81: clusterWriteHandler (cluster.c:1907)
by 0x41D596: aeProcessEvents (ae.c:416)
by 0x41D8EA: aeMain (ae.c:455)
by 0x41A84B: main (redis.c:3832)
Address 0x5f268e2 is 2,274 bytes inside a block of size 8,192 alloc'd
at 0x4932D1: je_realloc (jemalloc.c:1297)
by 0x428185: zrealloc (zmalloc.c:162)
by 0x4269E0: sdsMakeRoomFor.part.0 (sds.c:142)
by 0x426CD7: sdscatlen (sds.c:251)
by 0x4579E7: clusterSendMessage (cluster.c:1995)
by 0x45805A: clusterSendPing (cluster.c:2140)
by 0x45BB03: clusterCron (cluster.c:2944)
by 0x423344: serverCron (redis.c:1239)
by 0x41D6CD: aeProcessEvents (ae.c:311)
by 0x41D8EA: aeMain (ae.c:455)
by 0x41A84B: main (redis.c:3832)
Uninitialised value was created by a stack allocation
at 0x457810: nodeUpdateAddressIfNeeded (cluster.c:1236)
read() and write() return ssize_t (signed long), not int.
For other offsets, we can use the unsigned size_t type instead
of a signed offset (since our replication offsets and buffer
positions are never negative).
In order to avoid that misconfigured cluster nodes at some time may
force an IP update on other nodes, it is required that nodes update
their own address only on MEET messages. However it does not make sense
to do this the first time a node is contacted and yet does not have an
IP, we just risk that myself->ip remains not assigned if there are
messages lost or cluster creation procedures that don't make sure
everybody is targeted by at least one incoming MEET message.
Also fix the logging of the IP switch avoiding the :-1 tail.
Also explicitly set version to 0, add a protocol version define, improve
comments in the gossip structure.
Note that the structure layout is the same after the change, we are just
making the padding explicit with an additional not used 16 bits field.
So this commit is still able to talk with the previous versions of
cluster nodes.
Valgrind checks that the buffers we transfer via syscalls are all
composed of bytes actually initialized. This is useful, it makes we able
to avoid leaking informations in non initialized parts fo messages
transferred to other hosts. This commit fixes one of such issues.
Can't be initialized by resetManualFailover() since it's actual state
the function uses, so we need to initialize it at startup time. Not
really a bug in practical terms, but showed up into valgrind and is not
technically correct anyway.