Commit Graph

777 Commits

Author SHA1 Message Date
Suraj Narkhede f85f36f50d Fix following issues in blocking commands:
1. brpop last key index, thus checking all keys for slots.
2. Memory leak in clusterRedirectBlockedClientIfNeeded.
3. Remove while loop in clusterRedirectBlockedClientIfNeeded.
2017-06-23 00:30:21 -07:00
Suraj Narkhede d303bca587 Fix brpop command table entry and redirect blocked clients. 2017-06-22 23:52:00 -07:00
Antonio Mallia 2d1d57eb47 Removed duplicate 'sys/socket.h' include 2017-06-04 15:26:53 +01:00
antirez 271733f4f8 Cluster: discard pong times in the future.
However we allow for 500 milliseconds of tolerance, in order to
avoid often discarding semantically valid info (the node is up)
because of natural few milliseconds desync among servers even when
NTP is used.

Note that anyway we should ping the node from time to time regardless and
discover if it's actually down from our point of view, since no update
is accepted while we have an active ping on the node.

Related to #3929.
2017-04-15 10:12:08 +02:00
antirez 02777bb252 Cluster: always add PFAIL nodes at end of gossip section.
To rely on the fact that nodes in PFAIL state will be shared around by
randomly adding them in the gossip section is a weak assumption,
especially after changes related to sending less ping/pong packets.

We want to always include gossip entries for all the nodes that are in
PFAIL state, so that the PFAIL -> FAIL state promotion can happen much
faster and reliably.

Related to #3929.
2017-04-14 13:39:49 +02:00
antirez 8c829d9e43 Cluster: fix gossip section ping/pong times encoding.
The gossip section times are 32 bit, so cannot store the milliseconds
time but just the seconds approximation, which is good enough for our
uses. At the same time however, when comparing the gossip section times
of other nodes with our node's view, we need to convert back to
milliseconds.

Related to #3929. Without this change the patch to reduce the traffic in
the bus message does not work.
2017-04-14 11:01:22 +02:00
antirez 6878a3fedd Cluster: add clean-logs command to create-cluster script. 2017-04-14 10:52:00 +02:00
antirez 8f7bf2841a Cluster: decrease ping/pong traffic by trusting other nodes reports.
Cluster of bigger sizes tend to have a lot of traffic in the cluster bus
just for failure detection: a node will try to get a ping reply from
another node no longer than when the half the node timeout would elapsed,
in order to avoid a false positive.

However this means that if we have N nodes and the node timeout is set
to, for instance M seconds, we'll have to ping N nodes every M/2
seconds. This N*M/2 pings will receive the same number of pongs, so
a total of N*M packets per node. However given that we have a total of N
nodes doing this, the total number of messages will be N*N*M.

In a 100 nodes cluster with a timeout of 60 seconds, this translates
to a total of 100*100*30 packets per second, summing all the packets
exchanged by all the nodes.

This is, as you can guess, a lot... So this patch changes the
implementation in a very simple way in order to trust the reports of
other nodes: if a node A reports a node B as alive at least up to
a given time, we update our view accordingly.

The problem with this approach is that it could result into a subset of
nodes being able to reach a given node X, and preventing others from
detecting that is actually not reachable from the majority of nodes.
So the above algorithm is refined by trusting other nodes only if we do
not have currently a ping pending for the node X, and if there are no
failure reports for that node.

Since each node, anyway, pings 10 other nodes every second (one node
every 100 milliseconds), anyway eventually even trusting the other nodes
reports, we will detect if a given node is down from our POV.

Now to understand the number of packets that the cluster would exchange
for failure detection with the patch, we can start considering the
random PINGs that the cluster sent anyway as base line:
Each node sends 10 packets per second, so the total traffic if no
additioal packets would be sent, including PONG packets, would be:

    Total messages per second = N*10*2

However by trusting other nodes gossip sections will not AWALYS prevent
pinging nodes for the "half timeout reached" rule all the times. The
math involved in computing the actual rate as N and M change is quite
complex and depends also on another parameter, which is the number of
entries in the gossip section of PING and PONG packets. However it is
possible to compare what happens in cluster of different sizes
experimentally. After applying this patch a very important reduction in
the number of packets exchanged is trivial to observe, without apparent
impacts on the failure detection performances.

Actual numbers with different cluster sizes should be published in the
Reids Cluster documentation in the future.

Related to #3929.
2017-04-14 10:43:53 +02:00
antirez c5d6f577f0 Cluster: collect more specific bus messages stats.
First step in order to change Cluster in order to use less messages.
Related to issue #3929.
2017-04-13 19:22:35 +02:00
antirez 1409c545da Cluster: hash slots tracking using a radix tree. 2017-03-27 16:37:22 +02:00
antirez f917e0da4c Fix MIGRATE closing of cached socket on error.
After investigating issue #3796, it was discovered that MIGRATE
could call migrateCloseSocket() after the original MIGRATE c->argv
was already rewritten as a DEL operation. As a result the host/port
passed to migrateCloseSocket() could be anything, often a NULL pointer
that gets deferenced crashing the server.

Now the socket is closed at an earlier time when there is a socket
error in a later stage where no retry will be performed, before we
rewrite the argument vector. Moreover a check was added so that later,
in the socket_err label, there is no further attempt at closing the
socket if the argument was rewritten.

This fix should resolve the bug reported in #3796.
2017-02-09 09:58:38 +01:00
Salvatore Sanfilippo 6cf1a325d6 Merge pull request #3643 from andyli028/unstable
Modify MIN->MAX
2016-12-19 08:19:10 +01:00
antirez b53e73e159 MIGRATE: Remove upfront ttl initialization.
After the fix for #3673 the ttl var is always initialized inside the
loop itself, so the early initialization is not needed.

Variables declaration also moved to a more local scope.
2016-12-14 12:43:55 +01:00
Salvatore Sanfilippo c9f0456d81 Merge pull request #3673 from badboy/reset-ttl-on-migrating
Reset the ttl for additional keys
2016-12-14 12:41:00 +01:00
antirez 04542cff92 Replication: fix the infamous key leakage of writable slaves + EXPIRE.
BACKGROUND AND USE CASEj

Redis slaves are normally write only, however the supprot a "writable"
mode which is very handy when scaling reads on slaves, that actually
need write operations in order to access data. For instance imagine
having slaves replicating certain Sets keys from the master. When
accessing the data on the slave, we want to peform intersections between
such Sets values. However we don't want to intersect each time: to cache
the intersection for some time often is a good idea.

To do so, it is possible to setup a slave as a writable slave, and
perform the intersection on the slave side, perhaps setting a TTL on the
resulting key so that it will expire after some time.

THE BUG

Problem: in order to have a consistent replication, expiring of keys in
Redis replication is up to the master, that synthesize DEL operations to
send in the replication stream. However slaves logically expire keys
by hiding them from read attempts from clients so that if the master did
not promptly sent a DEL, the client still see logically expired keys
as non existing.

Because slaves don't actively expire keys by actually evicting them but
just masking from the POV of read operations, if a key is created in a
writable slave, and an expire is set, the key will be leaked forever:

1. No DEL will be received from the master, which does not know about
such a key at all.

2. No eviction will be performed by the slave, since it needs to disable
eviction because it's up to masters, otherwise consistency of data is
lost.

THE FIX

In order to fix the problem, the slave should be able to tag keys that
were created in the slave side and have an expire set in some way.

My solution involved using an unique additional dictionary created by
the writable slave only if needed. The dictionary is obviously keyed by
the key name that we need to track: all the keys that are set with an
expire directly by a client writing to the slave are tracked.

The value in the dictionary is a bitmap of all the DBs where such a key
name need to be tracked, so that we can use a single dictionary to track
keys in all the DBs used by the slave (actually this limits the solution
to the first 64 DBs, but the default with Redis is to use 16 DBs).

This solution allows to pay both a small complexity and CPU penalty,
which is zero when the feature is not used, actually. The slave-side
eviction is encapsulated in code which is not coupled with the rest of
the Redis core, if not for the hook to track the keys.

TODO

I'm doing the first smoke tests to see if the feature works as expected:
so far so good. Unit tests should be added before merging into the
4.0 branch.
2016-12-13 10:59:54 +01:00
Jan-Erik Rediger 2a32f0371e Reset the ttl for additional keys
Before, if a previous key had a TTL set but the current one didn't, the
TTL was reused and thus resulted in wrong expirations set.

This behaviour was experienced, when `MigrateDefaultPipeline` in
redis-trib was set to >1

Fixes #3655
2016-12-08 14:27:21 +01:00
andyli 8abf9729f0 Modify MIN->MAX 2016-11-29 16:34:41 +08:00
antirez cfdb3a2214 Cluster: handle zero bytes at the end of nodes.conf. 2016-11-16 14:13:18 +01:00
antirez a3f893b800 RESTORE: accept RDB dumps with older versions.
Reference issue #3218.

Checking the code I can't find a reason why the original RESTORE
code was so opinionated about restoring only the current version. The
code in to `rdb.c` appears to be capable as always to restore data from
older versions of Redis, and the only places where it is needed the
current version in order to correctly restore data, is while loading the
opcodes, not the values itself as it happens in the case of RESTORE.

For the above reasons, this commit enables RESTORE to accept older
versions of values payloads.
2016-06-16 15:53:57 +02:00
antirez 971e3c51b6 Cluster: make getNodeByQuery() responsible of -CLUSTERDOWN errors.
This fixes a bug introduced by d827dbf, and makes the code consistent
with the logic of always allowing, while the cluster is down, commands
that don't target any key.

As a side effect the code is also simpler now.
2016-05-05 11:33:43 +02:00
Salvatore Sanfilippo 330715afd8 Merge pull request #3039 from itamarhaber/patch-3
Fixes a typo in the comments
2016-05-05 10:15:17 +02:00
antirez 4fdde78c72 New masters with slots are now targets of migration if others are.
This fixes issue #3043.

Before this fix, after a complete resharding of a master slots
to other nodes, the master remains empty and the slaves migrate away
to other masters with non-zero nodes. However the old master now empty,
is no longer considered a target for migration, because the system has
no way to tell it had slaves in the past.

This fix leaves the algorithm used in the past untouched, but adds a
new rule. When a new or old master which is empty and without slaves,
are assigend with their first slot, if other masters in the cluster have
slaves, they are automatically considered to be targets for replicas
migration.
2016-05-02 18:37:30 +02:00
antirez b841f3ad1a Cluster: store busport with different separator in CLUSTER NODES.
We need to be able to correctly parse the node address in the case of
IPv6 addresses.
2016-02-02 08:20:04 +01:00
antirez 92b9de2417 Cluster announce: WIP, allow building again. 2016-02-01 18:16:25 +01:00
antirez e27b9b1cec Merge branch 'cluster-docker' into unstable 2016-02-01 18:01:22 +01:00
antirez c285862621 Cluster: include node IDs in SLOTS output.
CLUSTER SLOTS now includes IDs in the nodes description associated with
a given slot range. Certain client libraries implementations need a way
to reference a node in an unique way, so they were relying on CLUSTER
NODES, that is not a stable API and may change frequently depending on
Redis Cluster future requirements.
2016-01-29 12:00:40 +01:00
antirez d0a8512eda Cluster anounce-ip/port WIP. 2016-01-29 09:06:37 +01:00
antirez 4abf486ca3 Cluster announce port: set port/bport for myself at startup. 2016-01-29 09:06:37 +01:00
antirez 1c038379f7 Cluster: persist bus port in nodes.conf. 2016-01-29 09:06:37 +01:00
antirez dc98907e50 Cluster announce ip: take myself->ip always in sync. 2016-01-29 09:06:37 +01:00
antirez 11436b1449 Cluster announce ip / port initial implementation. 2016-01-29 09:06:37 +01:00
Itamar Haber 9e46bf22ed Fixes a typo 2016-01-28 21:47:18 +02:00
antirez 5bbb09ed2c Cluster: check packets length before accessing far fields. 2016-01-27 16:35:21 +01:00
antirez fe44a7cb60 Cluster: mismatch sender ID log put back at DEBUG level. 2016-01-26 14:21:18 +01:00
antirez d6c5922f75 Cluster: fix missing ntohs() call to access gossip section port. 2016-01-26 14:18:13 +01:00
antirez 592419b4ca Better address udpate strategy when processing gossip sections.
The change covers the case where:

1. There is a node we can't reach (in fail or pfail state).
2. We see a different address for this node, in the gossip section sent
to us by a node that, instead, is able to talk with the node we cannot
talk to.

In this case it's a good bet to switch to the address reported by this
node, since there was an address switch and it is able to talk with the
node and we are not.

However previosuly this was done in a dangerous way, by initiating an
handshake. The handshake, using the MEET packet, forces the receiver to
join our cluster, and this is not a good idea. If the node in question
really just switched address, but is the same node, it already knows about
us, so we just need to perform an address update and a reconnection.

So with this commit instead we just update the address of the node,
release the node link if any, and attempt to reconnect in the next
clusterCron() cycle.

The commit also improves debugging messages printed by Cluster during
address or ID switches.
2016-01-26 12:32:53 +01:00
antirez 83b862a30e Minor MIGRATE refactoring.
Centralize cleanup of newargv in a single place.
Add more comments to help a bit following a complex function.

Related to issue #3016.
2016-01-19 09:53:04 +01:00
antirez f5a1e608cc More variadic MIGRATE fixes.
Another leak was fixed in the case of syntax error by restructuring the
allocation strategy for the two dynamic vectors.

We also make sure to always close the cached socket on I/O errors so that
all the I/O errors are handled the same, even if we had a previously
queued error of a different kind from the destination server.

Thanks to Kevin McGehee. Related to issue #3016.
2016-01-19 09:28:43 +01:00
antirez 00d3a40f82 Various fixes to MIGRATE with multiple keys.
In issue #3016 Kevin McGehee identified multiple very serious issues in
the new implementation of MIGRATE. This commit attempts to restructure
the code in oder to avoid mistakes, an analysis of the new
implementation is in progress in order to check for possible edge cases.
2016-01-18 16:49:21 +01:00
antirez fc3ca8ff87 Cluster: fix setting nodes slaveof pointer to NULL on node release.
With this commit we preserve the list of nodes that have .slaveof set
to the node, even when the node is turned into a slave, and make sure to
fix the .slaveof pointers to NULL when a node is freed from memory,
regardless of the fact it's a slave or a master.

Basically we try to remember the logical master in the current
configuration even if the logical master advertised it as a slave
already. However we still remember the associations, so that when a node
is freed we can fix them.

This should fix issue #3002.
2016-01-14 17:34:49 +01:00
antirez 02c40c9dc2 CLUSTER BUMPEPOCH initial implementation fixed. 2016-01-11 15:39:11 +01:00
antirez b58796f520 Cluster: CLUSTER BUMPEPOCH introduced to help redis-trib fix.
Sometimes during "fixes" we have to setup a new configuration and assign
slots to nodes. With BUMPEPOCH we can make sure the new configuration of
the node will win if there are conflicting configurations (for example
another node is *also* claiming the same slot because the cluster is
totally messed up).
2016-01-11 15:01:14 +01:00
antirez 524be1e465 Cluster: don't allow CLUSTER SETSLOT with slaves. 2016-01-11 15:00:45 +01:00
antirez e15e518a67 Allow MIGRATE to always be called on local keys for open slots.
Extend the MIGRATE extra freedom to be able to be called in the context
of the local slot, anytime there is a slot open in one or the other
direction (importing or migrating). This is useful for redis-trib to fix
the cluster when it has in an odd state.

Thix fix allows "redis-trib fix" to make its work in certain cases where
previously an error was reported.
2016-01-08 15:04:16 +01:00
antirez 36704d653b Fix typos & grammar in clusterBumpConfigEpochWithoutConsensus() comment. 2016-01-08 12:07:54 +01:00
antirez 00d637f2cc Cluster: don't send -ASK to MIGRATE.
For non existing keys, we don't want to send -ASK redirections to
MIGRATE, since when moving slots from the migrating node to the
importing node, we want just to ignore keys that are no longer there.
They may be expired or deleted between the GETKEYSINSLOT call and the
MIGRATE call. Otherwise this causes an error during migrations with
redis-trib (or equivalent cluster management tools).
2016-01-06 12:14:49 +01:00
antirez b9aeb98156 Suppress harmless warnings. 2015-12-16 12:36:32 +01:00
antirez ac0a731057 MIGRATE: Fix new argument rewriting refcount handling. 2015-12-11 14:26:41 +01:00
antirez d85fc1e9cf MIGRATE: fix replies processing and argument rewriting.
We need to process replies after errors in order to delete keys
successfully transferred. Also argument rewriting was fixed since
it was broken in several ways. Now a fresh argument vector is created
and set if we are acknowledged of at least one key.
2015-12-11 14:04:47 +01:00
antirez 9ebf7a6776 Pipelined multiple keys MIGRATE. 2015-12-11 13:38:26 +01:00
antirez adc2fe6993 Cluster: replica migration with delay.
We wait a fixed amount of time (5 seconds currently) much greater than
the usual Cluster node to node communication latency, before migrating.
This way when a failover occurs, before detecting the new master as a
target for migration, we give the time to its natural slaves (the slaves
of the failed over master) to announce they switched to the new master,
preventing an useless migration operation.
2015-12-11 09:19:06 +01:00
antirez 4159055f83 Remove debugging message left there for error. 2015-12-10 08:56:33 +01:00
antirez e0f22df995 Fix replicas migration by adding a new flag.
Some time ago I broken replicas migration (reported in #2924).
The idea was to prevent masters without replicas from getting replicas
because of replica migration, I remember it to create issues with tests,
but there is no clue in the commit message about why it was so
undesirable.

However my patch as a side effect totally ruined the concept of replicas
migration since we want it to work also for instances that, technically,
never had slaves in the past: promoted slaves.

So now instead the ability to be targeted by replicas migration, is a
new flag "migrate-to". It only applies to masters, and is set in the
following two cases:

1. When a master gets a slave, it is set.
2. When a slave turns into a master because of fail over, it is set.

This way replicas migration targets are only masters that used to have
slaves, and slaves of masters (that used to have slaves... obviously)
and are promoted.

The new flag is only internal, and is never exposed in the output nor
persisted in the nodes configuration, since all the information to
handle it are implicit in the cluster configuration we already have.
2015-12-09 23:03:18 +01:00
antirez a0d41e51c2 Redis Cluster: hint about validity factor when slave can't failover. 2015-11-27 08:59:17 +01:00
antirez c69c6c80fb Lazyfree: ability to free whole DBs in background. 2015-10-01 13:02:26 +02:00
antirez a7c5be18a8 Lazyfree: Sorted sets convereted to plain SDS. (several commits squashed) 2015-10-01 13:02:24 +02:00
antirez 02b1d5213d RDMF: use representClusterNodeFlags() generic name. 2015-07-27 15:08:58 +02:00
antirez 3325a9b11f RDMF: more names updated. 2015-07-27 15:03:10 +02:00
antirez 32f80e2f1b RDMF: More consistent define names. 2015-07-27 14:37:58 +02:00
antirez 40eb548a80 RDMF: REDIS_OK REDIS_ERR -> C_OK C_ERR. 2015-07-26 23:17:55 +02:00
antirez 2d9e3eb107 RDMF: redisAssert -> serverAssert. 2015-07-26 15:29:53 +02:00
antirez 14ff572482 RDMF: OBJ_ macros for object related stuff. 2015-07-26 15:28:00 +02:00
antirez 554bd0e7bd RDMF: use client instead of redisClient, like Disque. 2015-07-26 15:20:52 +02:00
antirez 424fe9afd9 RDMF: redisLog -> serverLog. 2015-07-26 15:17:43 +02:00
antirez cef054e868 RDMF (Redis/Disque merge friendlyness) refactoring WIP 1. 2015-07-26 15:17:18 +02:00
Jan-Erik Rediger d28c51d166 Do not attempt to lock on Solaris 2015-06-24 14:57:15 +02:00
antirez a401a84eb2 Don't try to bind the source address for MIGRATE
Related to issues #2609 and #2612.
2015-06-11 14:34:38 +02:00
antirez 9b7f8b1c9b Cluster: redirection refactoring + handling of blocked clients.
There was a bug in Redis Cluster caused by clients blocked in a blocking
list pop operation, for keys no longer handled by the instance, or
in a condition where the cluster became down after the client blocked.

A typical situation is:

1) BLPOP <somekey> 0
2) <somekey> hash slot is resharded to another master.

The client will block forever int this case.

A symmentrical non-cluster-specific bug happens when an instance is
turned from master to slave. In that case it is more serious since this
will desynchronize data between slaves and masters. This other bug was
discovered as a side effect of thinking about the bug explained and
fixed in this commit, but will be fixed in a separated commit.
2015-03-24 11:56:24 +01:00
antirez 94030fa4d7 Two cluster.c comments improved. 2015-03-21 12:12:23 +01:00
antirez 2950824ab6 Cluster: TAKEOVER option for manual failover. 2015-03-21 11:54:32 +01:00
antirez a7010ae208 Cluster: non-conditional steps of slave failover refactored into a function. 2015-03-20 17:56:21 +01:00
antirez 230d141420 Cluster: separate unknown master check from the rest.
In no case we should try to attempt to failover if myself->slaveof is
NULL.
2015-03-20 16:56:59 +01:00
antirez 4f2555aa17 Cluster: refactoring around configEpoch handling.
This commit moves the process of generating a new config epoch without
consensus out of the clusterCommand() implementation, in order to make
it reusable for other reasons (current target is to have a CLUSTER
FAILOVER option forcing the failover when no master majority is
reachable).

Moreover the commit moves other functions which are similarly related to
config epochs in a new logical section of the cluster.c file, just for
clarity.
2015-03-20 16:42:52 +01:00
antirez 25c0f5ac63 Cluster: better cluster state transiction handling.
Before we relied on the global cluster state to make sure all the hash
slots are linked to some node, when getNodeByQuery() is called. So
finding the hash slot unbound was checked with an assertion. However
this is fragile. The cluster state is often updated in the
clusterBeforeSleep() function, and not ASAP on state change, so it may
happen to process clients with a cluster state that is 'ok' but yet
certain hash slots set to NULL.

With this commit the condition is also checked in getNodeByQuery() and
reported with a identical error code of -CLUSTERDOWN but slightly
different error message so that we have more debugging clue in the
future.

Root cause of issue #2288.
2015-03-20 09:59:28 +01:00
antirez 438a1a84e8 Cluster: more robust slave check in CLUSTER REPLICATE.
There are rare conditions where node->slaveof may be NULL even if the
node is a slave. To check by flag is much more robust.
2015-03-18 12:10:14 +01:00
antirez 93b1320fac Cluster: fix CLUSTER NODES optimization error in 'j' increment. 2015-03-13 13:16:35 +01:00
antirez e1b6c9dd18 Cluster: CLUSTER NODES speedup. 2015-03-13 11:26:04 +01:00
Michel Martens 6201eb0c55 Add command CLUSTER MYID 2015-03-10 16:43:19 +00:00
antirez c77081a45a Migrate: replace conditional with pre-computed value. 2015-02-27 22:33:54 +01:00
antirez 832b0c7cce Improvements to PR #2425
1. Remove useless "cs" initialization.
2. Add a "select" var to capture a condition checked multiple times.
3. Avoid duplication of the same if (!copy) conditional.
4. Don't increment dirty if copy is given (no deletion is performed),
   otherwise we propagate MIGRATE when not needed.
2015-02-26 10:27:56 +01:00
Tommy Wang 7fda935ad3 Add last_dbid to migrateCachedSocket to avoid redundant SELECT
Avoid redundant SELECT calls when continuously migrating keys to
the same dbid within a target Redis instance.
2015-02-26 10:18:43 +01:00
Salvatore Sanfilippo d83c810265 Merge pull request #2301 from mattsta/fix/lengths
Improve type correctness
2015-02-24 17:22:53 +01:00
antirez 233729fe7f Cluster: some bias towwards FAIL/PFAIL nodes in gossip sections.
This improves PFAIL -> FAIL switch. Too late at this point in the RC
releases to add proper PFAIL/FAIL separate dictionary to do this in a
less randomized way. Tested in practice with experiments that this
helps. PFAIL -> FAIL average with 20 nodes and node-timeout set to 5
seconds takes 2.5 seconds without this commit, 1 second with this
commit.
2015-01-30 11:55:36 +01:00
antirez 69b4f00d28 More correct wanted / maxiterations values in clusterSendPing(). 2015-01-30 11:23:27 +01:00
antirez e5a22064cc Cluster: magical 10% of nodes explained in comments. 2015-01-29 15:43:35 +01:00
antirez 1efacfe53d CLUSTER count-failure-reports command added. 2015-01-29 15:02:10 +01:00
antirez 3fd43062c8 Cluster: use a number of gossip sections proportional to cluster size.
Otherwise it is impossible to receive the majority of failure reports in
the node_timeout*2 window in larger clusters.

Still with a 200 nodes cluster, 20 gossip sections are a very reasonable
amount of bytes to send.

A side effect of this change is also fater cluster nodes joins for large
clusters, because the cluster layout makes less time to propagate.
2015-01-29 14:20:59 +01:00
antirez 9802ec3c83 Cluster: initialized not used fileds in gossip section.
Otherwise we risk sending not initialized data to other nodes, that may
contain anything. This was actually not possible only because the
initialization of the buffer where the cluster packets header is created
was larger than the 3 gossip sections we use, so the memory was already
all filled with zeroes by the memset().
2015-01-24 07:52:24 +01:00
Matt Stancliff 051a43e03a Fix cluster migrate memory leak
Fixes valgrind error:
48 bytes in 1 blocks are definitely lost in loss record 196 of 373
   at 0x4910D3: je_malloc (jemalloc.c:944)
   by 0x42807D: zmalloc (zmalloc.c:125)
   by 0x41FA0D: dictGetIterator (dict.c:543)
   by 0x41FA48: dictGetSafeIterator (dict.c:555)
   by 0x459B73: clusterHandleSlaveMigration (cluster.c:2776)
   by 0x45BF27: clusterCron (cluster.c:3123)
   by 0x423344: serverCron (redis.c:1239)
   by 0x41D6CD: aeProcessEvents (ae.c:311)
   by 0x41D8EA: aeMain (ae.c:455)
   by 0x41A84B: main (redis.c:3832)
2015-01-21 18:47:16 +01:00
Matt Stancliff 29049507ec Fix potential invalid read past end of array
If array has N elements, we can't read +1 if we are already at N.

Also, we need to move elements by their storage size in the array,
not just by individual bytes.
2015-01-21 18:01:03 +01:00
Matt Stancliff 30152554ea Fix cluster reset memory leak
[maybe] Fixes valgrind errors:
32 bytes in 4 blocks are definitely lost in loss record 107 of 228
   at 0x80EA447: je_malloc (jemalloc.c:944)
   by 0x806E59C: zrealloc (zmalloc.c:125)
   by 0x80A9AFC: clusterSetMaster (cluster.c:801)
   by 0x80AEDC9: clusterCommand (cluster.c:3994)
   by 0x80682A5: call (redis.c:2049)
   by 0x8068A20: processCommand (redis.c:2309)
   by 0x8076497: processInputBuffer (networking.c:1143)
   by 0x8073BAF: readQueryFromClient (networking.c:1208)
   by 0x8060E98: aeProcessEvents (ae.c:412)
   by 0x806123B: aeMain (ae.c:455)
   by 0x806C3DB: main (redis.c:3832)

64 bytes in 8 blocks are definitely lost in loss record 143 of 228
   at 0x80EA447: je_malloc (jemalloc.c:944)
   by 0x806E59C: zrealloc (zmalloc.c:125)
   by 0x80AAB40: clusterProcessPacket (cluster.c:801)
   by 0x80A847F: clusterReadHandler (cluster.c:1975)
   by 0x30000FF: ???

80 bytes in 10 blocks are definitely lost in loss record 148 of 228
   at 0x80EA447: je_malloc (jemalloc.c:944)
   by 0x806E59C: zrealloc (zmalloc.c:125)
   by 0x80AAB40: clusterProcessPacket (cluster.c:801)
   by 0x80A847F: clusterReadHandler (cluster.c:1975)
   by 0x2FFFFFF: ???
2015-01-21 17:51:57 +01:00
Matt Stancliff 72b8574cca Fix sending uninitialized bytes
Fixes valgrind error:
Syscall param write(buf) points to uninitialised byte(s)
   at 0x514C35D: ??? (syscall-template.S:81)
   by 0x456B81: clusterWriteHandler (cluster.c:1907)
   by 0x41D596: aeProcessEvents (ae.c:416)
   by 0x41D8EA: aeMain (ae.c:455)
   by 0x41A84B: main (redis.c:3832)
 Address 0x5f268e2 is 2,274 bytes inside a block of size 8,192 alloc'd
   at 0x4932D1: je_realloc (jemalloc.c:1297)
   by 0x428185: zrealloc (zmalloc.c:162)
   by 0x4269E0: sdsMakeRoomFor.part.0 (sds.c:142)
   by 0x426CD7: sdscatlen (sds.c:251)
   by 0x4579E7: clusterSendMessage (cluster.c:1995)
   by 0x45805A: clusterSendPing (cluster.c:2140)
   by 0x45BB03: clusterCron (cluster.c:2944)
   by 0x423344: serverCron (redis.c:1239)
   by 0x41D6CD: aeProcessEvents (ae.c:311)
   by 0x41D8EA: aeMain (ae.c:455)
   by 0x41A84B: main (redis.c:3832)
 Uninitialised value was created by a stack allocation
   at 0x457810: nodeUpdateAddressIfNeeded (cluster.c:1236)
2015-01-21 17:50:17 +01:00
antirez 2601e3e461 Cluster: node deletion cleanup / centralization. 2015-01-21 16:03:43 +01:00
antirez 59ad6ac5fe Cluster: set the slaves->slaveof filed to NULL when master is freed.
Related to issue #2289.
2015-01-21 15:55:53 +01:00
Matt Stancliff 53c082ec39 Improve networking type correctness
read() and write() return ssize_t (signed long), not int.

For other offsets, we can use the unsigned size_t type instead
of a signed offset (since our replication offsets and buffer
positions are never negative).
2015-01-19 14:10:12 -05:00
antirez cf76af6b9f Cluster: fetch my IP even if msg is not MEET for the first time.
In order to avoid that misconfigured cluster nodes at some time may
force an IP update on other nodes, it is required that nodes update
their own address only on MEET messages. However it does not make sense
to do this the first time a node is contacted and yet does not have an
IP, we just risk that myself->ip remains not assigned if there are
messages lost or cluster creation procedures that don't make sure
everybody is targeted by at least one incoming MEET message.

Also fix the logging of the IP switch avoiding the :-1 tail.
2015-01-13 10:50:34 +01:00
antirez 5b0f4a83ac Cluster: clusterMsgDataGossip structure, explict padding + minor stuff.
Also explicitly set version to 0, add a protocol version define, improve
comments in the gossip structure.

Note that the structure layout is the same after the change, we are just
making the padding explicit with an additional not used 16 bits field.
So this commit is still able to talk with the previous versions of
cluster nodes.
2015-01-13 10:40:09 +01:00
antirez 237ab727b9 Suppress valgrind error about write sending uninitialized data.
Valgrind checks that the buffers we transfer via syscalls are all
composed of bytes actually initialized. This is useful, it makes we able
to avoid leaking informations in non initialized parts fo messages
transferred to other hosts. This commit fixes one of such issues.
2015-01-13 09:31:37 +01:00
antirez 6274a6789d Cluster: initialize mf_end.
Can't be initialized by resetManualFailover() since it's actual state
the function uses, so we need to initialize it at startup time. Not
really a bug in practical terms, but showed up into valgrind and is not
technically correct anyway.
2015-01-12 15:55:00 +01:00
Matt Stancliff ad41a7c404 Add addReplyBulkSds() function
Refactor a common pattern into one function so we don't
end up with copy/paste programming.
2014-12-23 09:31:02 -05:00
Matt Stancliff a772747ffc Cluster: Notify user on accept error
If we woke up to accept a connection, but we can't
accept it, inform the user of the error going on
with their networking.

(The previous message was the same for success or error!)
2014-12-17 10:49:32 -05:00
antirez 1aef29e079 Fix comment in clusterHandleSlaveFailover(). 2014-12-16 15:03:12 +01:00
antirez 90c7d8cfa1 Make sure buffer is enough in clusterSendPing(). 2014-12-15 10:18:22 +01:00
antirez ce269ad3c5 AnetFormatIP(): renamed, commented, now sticks to IP:port format.
A few code style changes + consistent format: not nice for humans but
better for parsers.
2014-12-11 18:20:30 +01:00
Matt Stancliff 491881e13b Cleanup all IP formatting code
Instead of manually checking for strchr(n,':') everywhere,
we can use our new centralized IP formatting functions.
2014-12-11 10:12:18 -05:00
antirez 06e76bc3e2 Better read-only behavior for expired keys in slaves.
Slaves key expire is orchestrated by the master. Sometimes the master
will send the synthesized DEL to expire keys on the slave with a non
trivial delay (when the key is not accessed, only the incremental expiry
algorithm will expire it in background).

During that time, a key is logically expired, but slaves still return
the key if you GET (or whatever) it. This is a bad behavior.

However we can't simply trust the slave view of the key, since we need
the master to be able to send write commands to update the slave data
set, and DELs should only happen when the key is expired in the master
in order to ensure consistency.

However 99.99% of the issues with this behavior is when a client which
is not a master sends a read only command. In this case we are safe and
can consider the key as non existing.

This commit does a few changes in order to make this sane:

1. lookupKeyRead() is modified in order to return NULL if the above
conditions are met.
2. Calls to lookupKeyRead() in commands actually writing to the data set
are repliaced with calls to lookupKeyWrite().

There are redundand checks, so for example, if in "2" something was
overlooked, we should be still safe, since anyway, when the master
writes the behavior is to don't care about what expireIfneeded()
returns.

This commit is related to  #1768, #1770, #2131.
2014-12-10 16:10:21 +01:00
antirez 669aa2a210 Cluster PUBLISH message: fix totlen count.
bulk_data field size was not removed from the count. It is not possible
to declare it simply as 'char bulk_data[]' since the structure is nested
into another structure.
2014-11-28 10:21:47 +01:00
Salvatore Sanfilippo 5a526c22cc Merge pull request #2096 from mattsta/cluster-ipv6
Enable Cluster IPv6 Support
2014-10-31 10:38:22 +01:00
Matt Stancliff 0014966c1e Networking: add more outbound IP binding fixes
Same as the original bind fixes (we just missed these the
first time around).

This helps Redis not automatically send
connections from the first IP on an interface if we are bound
to a specific IP address (e.g. with multiple IP aliases on one
interface, you want to send from _your_ IP, not from the first IP
on the interface).
2014-10-29 15:09:09 -04:00
Matt Stancliff daca1edb6e Parse cluster state file in IPv6 compatible way
We need to pick the port based on the _last_ colon, not the first one.
2014-10-29 15:08:35 -04:00
antirez 5f6950caa8 Cluster: process gossip section only for known nodes.
With the exception of nodes sending MEET packets: we have to trust them
since they can send us MEET packets only when the cluster is initially
created or because sysadmin manual action.
2014-10-08 16:58:12 +02:00
antirez 36e34a656a Cluster: fix logic to detect we are among a minority.
In the cluster evaluation function we are supposed to set the cluster
state as "fail" if we are among a minority, however the code was not
detecting to be into a minority partition if exactly half the masters
were reachable, which is a minority.
2014-10-08 16:27:07 +02:00
antirez edb3987a06 Cluster: more chatty slaves when failover is stalled. 2014-10-07 09:51:55 +02:00
Matt Stancliff 12d0195b30 Clean up text throughout project
- Remove trailing newlines from redis.conf
  - Fix comment misspelling
  - Clarifies zipEncodeLength usage and a C API mention (#1243, #1242)
  - Fix cluster typos (inspired by @papanikge #1507)
  - Fix rewite -> rewrite in a few places (inspired by #682)

Closes #1243, #1242, #1507
2014-09-29 06:49:07 -04:00
antirez 2374496799 Cluster: claim ping_sent time even if we can't connect.
This fixes a potential bug that was never observed in practice since
what happens is that the asynchronous connect returns ok (to fail later,
calling the handler) every time, so a ping is queued, and sent_ping
happens to always be populated.

Howver technically connect(2) with a non blocking socket may return an
error synchronously, so before this fix the code was not correct.
2014-09-17 16:39:41 +02:00
antirez c89afc8e5d Cluster: new option to work with partial slots coverage. 2014-09-17 11:10:09 +02:00
Matt Stancliff 60c448b584 Cluster: Fix segfault if cluster config corrupt
This commit adds a size check after initial config
line parsing to make sure we have *at least* 8 arguments
per line.

Also, instead of asserting for cluster->myself, we just test
and error out normally (since the error does a hard exit anyway).

Closes #1597
2014-08-25 10:11:38 +02:00
Matt Stancliff 879e18b7ec Fix memory leak in cluster config parsing
The continue stop us from triggering the
free after the long line for loop, so add it
earlier.
2014-08-18 11:27:19 +02:00
Matt Stancliff 6a7a32a806 Clarify existing slot wording on cluster start 2014-08-18 10:58:00 +02:00
antirez edca2b14d2 Remove warnings and improve integer sign correctness. 2014-08-13 11:44:38 +02:00
antirez ded57795ff representRedisNodeFlags() moved into right code section.
The funciton was also modified in order to be more standalone and
produce an output without trailing spaces, making the reuse simpler.
The global variable was renamed in cammel case as most other Redis
globals, except the main ones we refer too many times, like 'server'.
2014-08-08 15:53:42 +02:00
charsyam de5465baf7 Refactor cluster flag printing
Less copy/paste code duplication.

Closes #952
2014-08-08 15:39:44 +02:00
SungBin_Hong dec58464d8 Free memory in clusterLoadConfig error handler
Closes #1327
2014-08-08 14:40:32 +02:00
antirez 0d9bcb1c12 Cluster: don't migrate to a master that never had slaves.
Replica migration algorithm modified so that slaves never try to migrate
to masters that were never configured to have slaves in the past.
We want the algorithm to take care of masters that remained without
*working* slaves, but that used to have slaves according to the cluster
configuration.
2014-07-25 11:02:09 +02:00
antirez 89af463124 CLUSTER RESET: Flush dataset if node is a slave.
For non-empty masters, CLUSTER RESET is denied, and the user requires to
start to reset a node by explicitly clearing it with FLUSHALL.
However CLUSTER RESET when executed with slaves don't have this
restrictions since data is just a replica of the master, and with
read-only slaves it is also not possible to remove the data set. However
the node was turned from slave to master after a reset, without touching
the old slave data. This is 99.99% of times not appropriate and forces
full resets to follow this path to work with both slave and master
nodes:

    FLUSHALL
    CLUSTER RESET HARD
    FLUSHALL

Since we need the first flushall for masters, and the second for slaves.

This commit changes the behavior so that CLUSTER RESET removes the data set
of a slave node during a reset, in the moment it gets turned into a master,
so the new pattern is simply:

    FLUSHALL (that may fail for slaves)
    CLUSTER RESET
2014-07-22 15:29:57 +02:00
antirez 95b1979c32 No more trailing spaces in Redis source code. 2014-06-26 18:48:40 +02:00
antirez 75c57d53ea CLUSTER SLOTS: don't output failing slaves.
While we have to output failing masters in order to provide an accurate
map (that may be the one of a Redis Cluster in down state because not
all slots are served by a working master), to provide slaves in FAIL
state is not a good idea since those are not necesarely needed, and the
client will likely incur into a latency penalty trying to connect with a
slave which is down.

Note that this means that CLUSTER SLOTS does not provide a *complete*
map of slaves, however this would not be of any help since slaves may be
added later, and a client that needs to scale reads and requires to
stay updated with the list of slaves, need to do a refresh of the map
from time to time, anyway.
2014-06-25 15:19:35 +02:00
antirez a6fe4ca321 CLUSTER SLOTS: main loop should skip only slaves and zero slot masters. 2014-06-25 15:08:33 +02:00
Matt Stancliff e14829de30 Cluster: Add CLUSTER SLOTS command
CLUSTER SLOTS returns a Redis-formatted mapping from
slot ranges to IP/Port pairs serving that slot range.

The outer return elements group return values by slot ranges.

The first two entires in each result are the min and max slots for the range.

The third entry in each result is guaranteed to be either
an IP/Port of the master for that slot range - OR - null
if that slot range, for some reason, has no master

The 4th and higher entries in each result are replica instances
for the slot range.

Output comparison:
127.0.0.1:7001> cluster nodes
f853501ec8ae1618df0e0f0e86fd7abcfca36207 127.0.0.1:7001 myself,master - 0 0 2 connected 4096-8191
5a2caa782042187277647661ffc5da739b3e0805 127.0.0.1:7005 slave f853501ec8ae1618df0e0f0e86fd7abcfca36207 0 1402622415859 6 connected
6c70b49813e2ffc9dd4b8ec1e108276566fcf59f 127.0.0.1:7007 slave 26f4729ca0a5a992822667fc16b5220b13368f32 0 1402622415357 8 connected
2bd5a0e3bb7afb2b56a2120d3fef2f2e4333de1d 127.0.0.1:7006 slave 32adf4b8474fdc938189dba00dc8ed60ce635b0f 0 1402622419373 7 connected
5a9450e8279df36ff8e6bb1c139ce4d5268d1390 127.0.0.1:7000 master - 0 1402622418872 1 connected 0-4095
32adf4b8474fdc938189dba00dc8ed60ce635b0f 127.0.0.1:7002 master - 0 1402622419874 3 connected 8192-12287
5db7d05c245267afdfe48c83e7de899348d2bdb6 127.0.0.1:7004 slave 5a9450e8279df36ff8e6bb1c139ce4d5268d1390 0 1402622417867 5 connected
26f4729ca0a5a992822667fc16b5220b13368f32 127.0.0.1:7003 master - 0 1402622420877 4 connected 12288-16383

127.0.0.1:7001> cluster slots
1) 1) (integer) 0
   2) (integer) 4095
   3) 1) "127.0.0.1"
      2) (integer) 7000
   4) 1) "127.0.0.1"
      2) (integer) 7004
2) 1) (integer) 12288
   2) (integer) 16383
   3) 1) "127.0.0.1"
      2) (integer) 7003
   4) 1) "127.0.0.1"
      2) (integer) 7007
3) 1) (integer) 4096
   2) (integer) 8191
   3) 1) "127.0.0.1"
      2) (integer) 7001
   4) 1) "127.0.0.1"
      2) (integer) 7005
4) 1) (integer) 8192
   2) (integer) 12287
   3) 1) "127.0.0.1"
      2) (integer) 7002
   4) 1) "127.0.0.1"
      2) (integer) 7006
2014-06-25 15:03:41 +02:00
antirez f29b12d0bf Cluster: myself->ip autodiscovery.
Instead of having an hardcoded IP address in the node configuration, we
autodiscover it via MEET messages for automatic update when the node is
restarted with a different IP address.

This mechanism was discussed in the context of PR #1782.
2014-06-25 11:28:57 +02:00
Matt Stancliff d830dcb12d Add REDIS_BIND_ADDR access macro
We need to access (bindaddr[0] || NULL) in a few places, so centralize
access with a nice macro.
2014-06-23 11:44:34 +02:00
antirez 22d17bc14f Cluster: clear NOADDR flag when updating node address. 2014-06-20 09:32:47 +02:00
antirez 8ef79e72ac Cluster: fix an error message when logging failover auth denied. 2014-06-10 17:39:42 +02:00
antirez 58799718be Cluster: better comment for clusterSendFailoverAuthIfNeeded() epoch test. 2014-06-10 17:20:21 +02:00
antirez 61eb0eae83 Cluster: log granted failover authorizations. 2014-06-10 16:56:08 +02:00
antirez d5d92deb6c Cluster: log configEpoch updates to myself. 2014-06-10 16:38:36 +02:00
antirez 8204ab0098 Cluster: log when a master denies a failover auth. 2014-06-10 16:07:26 +02:00
antirez 9b3bc82c1a Cluster: cluster_my_epoch added to CLUSTER INFO output. 2014-06-10 11:35:40 +02:00
antirez 32d0a79f78 Cluster: check that configEpoch never goes back.
Since there are ways to alter the configEpoch outside of the failover
procedure (for exampel CLUSTER SET-CONFIG-EPOCH and via the configEpoch
collision resolution algorithm), make always sure, before replacing our
configEpoch with a new one, that it is greater than the current one.
2014-06-07 14:37:09 +02:00
antirez a2c2ef7de5 Cluster: SET-CONFIG-EPOCH should update currentEpoch.
SET-CONFIG-EPOCH, used by redis-trib at cluster creation time, failed to
update the currentEpoch, making it possible after a failover for a
server to set its configEpoch to a value smaller than the current one
(since configEpochs are obtained using currentEpoch).

The bug totally break the Redis Cluster algorithms and protocols
allowing for permanent split brain conditions about the slots
configuration as shown in issue #1799.
2014-06-07 14:25:47 +02:00
antirez 88c2307535 Cluster: always allow ok -> fail switch in clusterUpdateState().
There is a time defined by REDIS_CLUSTER_WRITABLE_DELAY where fail -> ok
switch is not possible after startup as a master for some time, however
the contrary (ok -> fail) should always be possible.
2014-05-26 16:24:12 +02:00
antirez 39603a7e31 Cluster: slave validity factor is now user configurable.
Check the commit changes in the example redis.conf for more information.
2014-05-22 16:57:54 +02:00
antirez 67133d2f48 Cluster: use clusterSetNodeAsMaster() during slave failover.
clusterHandleSlaveFailover() was reimplementing what
clusterSetNodeAsMaster() without any good reason.
2014-05-15 17:03:28 +02:00
antirez 8c6e92c3bc Cluster: clear todo_before_sleep flags when executing actions.
Thanks to this change, when there is some code like:

    clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE|...);
    ... and later before returning to the event loop ...
    clusterUpdateState();

The clusterUpdateState() function will clar the flag and will not be
repeated in the clusterBeforeSleep() function. This especially important
for config save/fsync flags which are slow to execute and not a good
idea to repeat without a good reason.

This is implemented for all the CLUSTER_TODO flags.
2014-05-15 16:33:13 +02:00
antirez 7b87cda70e Fixed typo in CLUSTER RESET implementation. 2014-05-15 12:33:57 +02:00
antirez 796f4ae9f7 CLUSTER RESET implemented.
The new command is able to reset a cluster node so that it starts again
as a fresh node. By default the command performs a soft reset (the same
as calling it as CLUSTER RESET SOFT), and the following steps are
performed:

1) All slots are set as unassigned.
2) The list of known nodes is flushed.
3) Node is set as master if it is a slave.

When an hard reset is performed with CLUSTER RESET HARD the following
additional operations are performed:

4) A new Node ID is created at random.
5) Epochs are set to 0.

CLUSTER RESET is useful both when the sysadmin wants to reconfigure a
node with a different role (for example turning a slave into a master)
and for testing purposes.

It also may play a role in automatically provisioned Redis Clusters,
since it allows to reset a node back to the initial state in order to be
reconfigured.
2014-05-15 11:43:06 +02:00
antirez 8b9d5ecbd1 Remove trailing spaces from cluster.c file. 2014-05-15 10:18:36 +02:00
antirez 60e5d1724c Cluster: don't accept cluster bus connections during startup. 2014-05-14 12:05:00 +02:00
antirez 6baac558d8 Cluster: better handling of stolen slots.
The previous code handling a lost slot (by another master with an higher
configuration for the slot) was defensive, considering it an error and
putting the cluster in an odd state requiring redis-cli fix.

This was changed, because actually this only happens either in a
legitimate way, with failovers, or when the admin messed with the config
in order to reconfigure the cluster. So the new code instead will try to
make sure that the keys stored match the new slots map, by removing all
the keys in the slots we lost ownership from.

The function that deletes the keys from the lost slots is called only
if the node does not lose all its slots (resulting in a reconfiguration
as a slave of the node that got ownership). This is an optimization
since the replication code will anyway flush all the instance data in
a faster way.
2014-05-14 10:46:37 +02:00
antirez 832a298005 Cluster: fixed data_age computation / check integer overflow. 2014-05-12 17:46:15 +02:00
antirez 2692339138 Cluster: forced failover implemented.
Using CLUSTER FAILOVER FORCE it is now possible to failover a master in
a forced way, which means:

1) No check to understand if the master is up is performed.
2) No data age of the slave is checked. Evan a slave with very old data
   can manually failover a master in this way.
3) No chat with the master is attempted to reach its replication offset:
   the master can just be down.
2014-05-12 16:34:20 +02:00
antirez 005f564eb3 Cluster: bypass data_age check for manual failovers.
Automatic failovers only happen in Redis Cluster if the slave trying to
be elected was disconnected from its master for no more than 10 times
the node-timeout value. However there should be no such a check for
manual failovers, since these are initiated by the sysadmin that, in
theory, knows what she is doing when a slave is selected to be promoted.
2014-05-12 16:12:12 +02:00
antirez 5c78f87666 RESTORE: reply with -BUSYKEY special error code.
The error when the target key is busy was a generic one, while it makes
sense to be able to distinguish between the target key busy error and
the others easily.
2014-05-12 10:01:59 +02:00
antirez 71d0e7e0ea CLUSTER MEET: better error messages when address is invalid.
Fixes issue #1734.
2014-05-09 16:36:59 +02:00
antirez 8a170c817d Cluster: bulk-accept new nodes connections.
The same change was operated for normal client connections. This is
important for Cluster as well, since when a node rejoins the cluster,
when a partition heals or after a restart, it gets flooded with new
connection attempts by all the other nodes trying to form a full
mesh again.
2014-05-09 11:52:59 +02:00
antirez 3625b52791 Cluster: clusterAcceptHandler() comments updated to match the code. 2014-05-09 11:44:46 +02:00
antirez 11d9ecb71d CLUSTER SET-CONFIG-EPOCH implemented.
Initially Redis Cluster accepted that after cluster creation all the
nodes were at configEpoch 0, evolving from zero as failovers happen.

However later the semantic was made more strict in order to make sure a
cluster has always all the master nodes with a different configEpoch,
which is more robust in some corner case (especially resulting from
errors by the system administrator).

To assign different configEpochs to different nodes at startup was a
task performed naturally by the config conflicts resolution algorithm
(see the Cluster specification). However this works well only for small
clusters or when there are actually just a few collisions, since it is
designed for exceptional cases.

When a large cluster is created hundred of nodes can be at epoch 0, so
the conflict resolution code is slow to provide an unique config to each
node. For this reason this new command was introduced. It can be called
only when a node is totally fresh: no other nodes known, and configEpoch
set to zero, so it is safe even against misuses.

redis-trib will use the new command in order to start the cluster
already setting an incremental unique config to every node.
2014-04-29 19:15:16 +02:00
antirez e3cf812c9e clusterLoadConfig() REDIS_ERR retval semantics refined.
We should return REDIS_ERR to signal we can't read the configuration
because there is no config file only after checking errno, othewise
we risk to rewrite an existing file that was not accessible for some
other reason.
2014-04-24 16:23:03 +02:00
antirez db06108bc1 Lock nodes.conf to avoid multiple processes using the same file.
This was a common source of problems among users.
The solution adopted is not bullet-proof as if the user deletes the
nodes.conf file manually, and starts a new instance with the same
nodes.conf file path, two instances will use the same file. However
following this reasoning the user may drop a nuclear bomb into the
datacenter as well.
2014-04-24 16:04:10 +02:00
kingsumos a69178fdd2 fix cluster node description showing wrong slot allocation 2014-04-22 11:44:53 -04:00
antirez 67bb2c46b2 Add casting to match printf format.
adjustOpenFilesLimit() and clusterUpdateSlotsWithConfig() that were
assuming uint64_t is the same as unsigned long long, which is true
probably for all the systems out there that we target, but still GCC
emitted a warning since technically they are two different types.
2014-04-07 08:58:06 +02:00
antirez 8f52173b2c Cluster: last_vote_epoch -> lastVoteEpoch.
Use cammel case for epochs that are persisted on disk.
2014-03-27 15:01:24 +01:00
antirez 7fb14b73ba Cluster: save/restore vars that must persist after recovery.
This fixes issue #1479.
2014-03-27 14:56:29 +01:00
antirez 6dd2dbbd36 Cluster: handshake "already known" error logged to VERBOSE.
This is not really an error but something that always happens for
example when creating a new cluster, or if the sysadmin rejoins manually
a node that is already known.

Since useless logs don't help, moved to VERBOSE level.
2014-03-26 16:35:38 +01:00
antirez 3cf6f1f54f Cluster: clusterHandleConfigEpochCollision() fixed.
New config epochs must always be obtained incrementing the currentEpoch,
that is itself guaranteed to be >= the max configEpoch currently known
to the node.
2014-03-26 12:31:28 +01:00
antirez 80d4c52cdf Cluster: better logging for clusterUpdateSlotsConfigWith(). 2014-03-26 12:09:38 +01:00
antirez eb746ec408 Cluster: CLUSTER SETSLOT implementation comment updated.
Update the comment since the implementation details changed.
2014-03-25 17:50:46 +01:00
antirez 6c527a89a0 Cluster: configEpoch collisions resolution.
The slave election in Redis Cluster guarantees that slaves promoted to
masters always end with unique config epochs, however failures during
manual reshardings, software bugs and operational errors may in theory
cause two nodes to have the same configEpoch.

This commit introduces a mechanism to eventually always end with different
configEpochs if a collision ever happens.

As a (wanted) side effect, this also ensures that after a new cluster
is created, all nodes will end with a different configEpoch automatically.
2014-03-25 17:19:58 +01:00
antirez c1041c570f Cluster: stay within 80 cols. 2014-03-25 16:07:14 +01:00
antirez 82b53c650c struct dictEntry -> dictEntry. 2014-03-20 16:20:37 +01:00
antirez e26f4486b0 Cluster: update node configEpoch on UPDATE messages.
The UPDATE message contains the configEpoch of the node configuration
advertised in the packet. Update it if needed.
2014-03-11 11:53:09 +01:00
antirez a2ff90919f Cluster: set slot error if we receive an update for a busy slot.
By manually modifying nodes configurations in random ways, it is possible
to create the following scenario:

A is serving keys for slot 10
B is manually configured to serve keys for slot 10

A receives an update from B (or another node) where it is informed that
the slot 10 is now claimed by B with a greater configuration epoch,
however A still has keys from slot 10.

With this commit A will put the slot in error setting it in IMPORTING
state, so that redis-trib can detect the issue.
2014-03-11 11:49:47 +01:00
antirez 1ed0ad77f0 Cluster: clarified a comment in clusterUpdateSlotsConfigWith(). 2014-03-11 11:32:40 +01:00
antirez 8287945ff8 Cluster: flush importing/migrating state when master is turned into slave. 2014-03-11 11:22:06 +01:00
antirez 2e8e0ad44e Cluster: clusterCloseAllSlots() added. 2014-03-11 11:16:18 +01:00
antirez 787b297046 Cluster: getKeysFromCommand() API cleaned up.
This API originated from the "diskstore" experiment, not for Redis
Cluster itself, so there were legacy/useless things trying to
differentiate between keys that are going to be overwritten and keys
that need to be fetched from disk (preloaded).

All useless with Cluster, so removed with the result of code
simplification.
2014-03-10 13:18:41 +01:00
antirez c1a7d3e61f Cluster: abort on port too high error.
It also fixes multi-line comment style to be consistent with the rest of
the code base.

Related to #1555.
2014-03-10 10:41:27 +01:00
Salvatore Sanfilippo 442b06db54 Merge pull request #1555 from mattsta/cluster-port-error-out
Cluster port error out
2014-03-10 10:37:50 +01:00
antirez ed8c55237b Cluster: be explicit about passing NULL as bind addr for connect.
The code was already correct but it was using that bindaddr[0] is set to
NULL as a side effect of current implementation if no bind address is
configured. This is not guarnteed to hold true in the future.
2014-03-10 10:33:53 +01:00
antirez 3e8a92ef8d Cluster: log error when anetTcpNonBlockBindConnect() fails. 2014-03-10 10:32:28 +01:00
Salvatore Sanfilippo 3b0edb80ec Merge pull request #1567 from mattsta/fix-cluster-join
Bind source address for cluster communication
2014-03-10 10:28:32 +01:00
antirez 0f1f25784f Cluster: better timeout and retry time for failover.
When node-timeout is too small, in the order of a few milliseconds,
there is no way the voting process can terminate during that time, so we
set a lower limit for the failover timeout of two seconds.

The retry time is set to two times the failover timeout time, so it is
at least 4 seconds.
2014-03-10 09:57:52 +01:00
antirez 6984692060 Cluster: fix conditional generating TRYAGAIN error. 2014-03-07 16:18:00 +01:00
antirez 36676c2318 Redis Cluster: support for multi-key operations. 2014-03-07 13:19:09 +01:00
Matt Stancliff 385c25f70f Remove redundant IP length definition
REDIS_CLUSTER_IPLEN had the same value as
REDIS_IP_STR_LEN.  They were both #define'd
to the same INET6_ADDRSTRLEN.
2014-03-06 17:55:43 +01:00
Matt Stancliff d2040ab9b1 Remove some redundant code
Function nodeIp2String in cluster.c is exactly
anetPeerToString with a pre-extracted fd.
2014-03-06 17:55:39 +01:00
Matt Stancliff 59cf0b1902 Fix return value check for anetTcpAccept
anetTcpAccept returns ANET_ERR, not AE_ERR.

This isn't a physical error since both ANET_ERR
and AE_ERR are -1, but better to be consistent.
2014-03-06 17:55:31 +01:00
Matt Stancliff e5b1e7be64 Bind source address for cluster communication
The first address specified as a bind parameter
(server.bindaddr[0]) gets used as the source IP
for cluster communication.

If no bind address is specified by the user, the
behavior is unchanged.

This patch allows multiple Redis Cluster instances
to communicate when running on the same interface
of the same host.
2014-03-04 17:36:45 -05:00
antirez 8dea2029a4 Fix configEpoch assignment when a cluster slot gets "closed".
This is still code to rework in order to use agreement to obtain a new
configEpoch when a slot is migrated, however this commit handles the
special case that happens when the nodes are just started and everybody
has a configEpoch of 0. In this special condition to have the maximum
configEpoch is not enough as the special epoch 0 is not unique (all the
others are).

This does not fixes the intrinsic race condition of a failover happening
while we are resharding, that will be addressed later.
2014-03-03 11:12:11 +01:00
Matt Stancliff ce68caea37 Cluster: error out quicker if port is unusable
The default cluster control port is 10,000 ports higher than
the base Redis port.  If Redis is started on a too-high port,
Cluster can't start and everything will exit later anyway.
2014-02-19 17:30:07 -05:00
antirez db6d628c3e Cluster: clusterDelNode(): remove node from master's slaves. 2014-02-11 10:34:25 +01:00
antirez 5e0e03be41 Cluster: UPDATE messages are the norm and verbose.
Logging them at WARNING level was of little utility and of sure disturb.
2014-02-11 10:18:24 +01:00
antirez 4a64286c36 Cluster: configEpoch assignment in SETNODE improved.
Avoid to trash a configEpoch for every slot migrated if this node has
already the max configEpoch across the cluster.

Still work to do in this area but this avoids both ending with a very
high configEpoch without any reason and to flood the system with fsyncs.
2014-02-11 10:09:17 +01:00
antirez 72f7abf6a2 Cluster: clusterSetStartupEpoch() made more generally useful.
The actual goal of the function was to get the max configEpoch found in
the cluster, so make it general by removing the assignment of the max
epoch to currentEpoch that is useful only at startup.
2014-02-11 10:00:14 +01:00
antirez 44f7afe28a Cluster: always increment the configEpoch in SETNODE after import.
Removed a stale conditional preventing the configEpoch from incrementing
after the import in certain conditions. Since the master got a new slot
it should always claim a new configuration.
2014-02-11 09:50:37 +01:00
antirez a1349728ea Cluster: on resharding upgrade version of receiving node.
The node receiving the hash slot needs to have a version that wins over
the other versions in order to force the ownership of the slot.

However the current code is far from perfect since a failover can happen
during the manual resharding. The fix is a work in progress but the
bottom line is that the new version must either be voted as usually,
set by redis-trib manually after it makes sure can't be used by other
nodes, or reserved configEpochs could be used for manual operations (for
example odd versions could be never used by slaves and are always used
by CLUSTER SETSLOT NODE).
2014-02-11 00:36:05 +01:00
antirez 6dc26795aa Cluster: fsync at every SETSLOT command puts too pressure on disks.
During slots migration redis-trib can send a number of SETSLOT commands.
Fsyncing every time is a bit too much in production as verified
empirically.

To make sure configs are fsynced on all nodes after a resharding
redis-trib may send something like CLUSTER CONFSYNC.

In this case fsyncs were not providing too much value since anyway
processes can crash in the middle of the resharding of an hash slot, and
redis-trib should be able to recover from this condition anyway.
2014-02-10 23:54:08 +01:00
antirez 218358bbbd Cluster: conditions to clear "migrating" on slot for SETSLOT ... NODE changed.
If the slot is manually assigned to another node, clear the migrating
status regardless of the fact it was previously assigned to us or not,
as long as we no longer have keys for this slot.

This avoid a race during slots migration that may leave the slot in
migrating status in the source node, since it received an update message
from the destination node that is already claiming the slot.

This way we are sure that redis-trib at the end of the slot migration is
always able to close the slot correctly.
2014-02-10 23:51:47 +01:00
antirez bf670e0745 Cluster: don't update slave's master if we don't know it.
There is no way we can update the slave's node->slaveof pointer if we
don't know the master (no node with such an ID in our tables).
2014-02-10 18:33:34 +01:00
antirez a3755ae9ee Cluster: ignore slot config changes if we are importing it. 2014-02-10 18:04:43 +01:00
antirez 6fc53e16ad Cluster: update configEpoch after manually messing with slots. 2014-02-10 18:01:58 +01:00
antirez 1a73c992a3 Cluster: fixed inverted arguments in logging function call. 2014-02-10 17:21:10 +01:00
antirez 32563b4a5f Cluster: clear the FAIL status for masters without slots.
Masters without slots don't participate to the cluster but just do
redirections, no need to take them in FAIL state if they are back
reachable.
2014-02-10 17:18:27 +01:00
antirez 5b2082ead3 Cluster: replica migration should only work for masters serving slots. 2014-02-10 17:08:37 +01:00
antirez f885fa8bac Cluster: clusterReadHandler() fixed to work with new message header. 2014-02-10 16:27:37 +01:00
antirez 7bf7b7350c Cluster: signature changed to "RCmb" (Redis Cluster message bus).
Sounds better after all.
2014-02-10 15:55:21 +01:00
antirez dced9c0619 Cluster: discard bus messages with version != 0. 2014-02-10 15:54:22 +01:00
antirez 007e1c7cb2 Cluster: added signature + version in bus packets. 2014-02-10 15:53:09 +01:00
antirez 142281dc79 Cluster: keys slot computation now supports hash tags.
Currently this is marginally useful, only to make sure two keys are in
the same hash slot when the cluster is stable (no rehashing in
progress).

In the future it is possible that support will be added to run
mutli-keys operations with keys in the same hash slot.
2014-02-07 17:39:01 +01:00
antirez 04fe000bf8 Cluster: fixed MF condition in clusterHandleSlaveFailover().
For manual failover we need a manual failover in progress, and that
mf_can_start is true (master offset received and matched).
2014-02-05 16:01:56 +01:00
antirez c6f02fd67a Cluster: CLUSTER FAILOVER replies with OK and logs the event. 2014-02-05 15:52:38 +01:00
antirez c72449af30 Cluster: check that a MF is in progress in manualFailoverCheckTimeout().
Otherwise it is always detected as a manual failover timed out.
2014-02-05 15:45:24 +01:00
antirez b7402bcad5 Cluster: force AUTH ACK on manual failover.
When a slave requests masters vote for a manual failover, the
REQUEST_AUTH message is flagged in a special way in order to force the
masters to give the authorization even if the master is not marked as
failing.
2014-02-05 13:10:03 +01:00
antirez 4cf0cd5719 Cluster: manual failover initial implementation. 2014-02-05 13:01:24 +01:00
antirez a7d30681c9 Cluster: configurable replicas migration barrier.
It is possible to configure the min number of additional working slaves
a master should be left with, for a slave to migrate to an orphaned
master.
2014-01-31 11:26:36 +01:00
antirez 6c9359add1 Cluster: perform orphaned masters check before continue statements.
The check was placed in a way that conflicted with the continue
statements used by the node hearth beat code later that needs to skip
the current node sometimes. Moved at the start of the function so that's
always executed.
2014-01-30 18:23:31 +01:00
antirez c2507b0ff6 Cluster: replica migration implementation.
This feature allows slaves to migrate to orphaned masters (masters
without working slaves), as long as a set of conditions are met,
including the fact that the migrating slave needs to be in a
master-slaves ring with at least another slave working.
2014-01-30 18:05:11 +01:00
antirez 5b4020fb42 Cluster: swap two code blocks to have a more obvious flow. 2014-01-30 16:34:23 +01:00
antirez 4beaaff8ea Cluster: remove not needed return statement breaking failover. 2014-01-29 17:28:46 +01:00
antirez 3582054982 Cluster: broadcast pong to other slaves in the same ring.
When we schedule a failover, broadcast a PONG to the slaves.
The other slaves that plan to get elected will do the same too, this way
it is likely that every slave will have a good picture of its own rank.

Note that this is N*N messages where N is the number of slaves for the
failing master, however usually even large clusters have many master
nodes but a limited number of replicas per node, so this is harmless.
2014-01-29 17:19:55 +01:00
antirez e2b59621a8 Cluster: log offset when announcing the failover election delay. 2014-01-29 17:16:10 +01:00
antirez 940531e9b7 Cluster: added progressive election delay according to slave rank.
Note that when we compute the initial delay, there are probably still
more up to date information to receive from slaves with new offsets, so
the delay is recomputed when new data is available.
2014-01-29 16:53:45 +01:00
antirez 6f54032080 Cluster: function clusterGetSlaveRank() added.
Return the number of slaves for the same master having a better
replication offset of the current slave, that is, the slave "rank" used
to pick a delay before the request for election.
2014-01-29 16:39:04 +01:00
antirez 40cd38f0c4 Cluster: update node replication offset from bus packets headers. 2014-01-29 16:01:00 +01:00
antirez 9d4ded7ec6 Cluster: refactoring: new macros to check node flags. 2014-01-29 12:17:16 +01:00
antirez 099bd336db Cluster: use myself instead of server->cluster.myself. 2014-01-29 11:38:14 +01:00
antirez e36bd8b43e Cluster: added a global myself pointer in cluster.c.
Accessing to the 'myself' node, the node representing the currently
running instance, is handy without the need to type
server.cluster->myself every time.
2014-01-29 11:22:22 +01:00
antirez f1e09d8c41 Cluster: clusterBroadcastPong() improved with target selection.
Now we can broadcast a pong to all the instances or just the local
slaves (that is useful for replication offset propagation).
2014-01-29 11:08:52 +01:00
antirez befcf6259e Cluster: broadcast master/slave replication offset in bus header. 2014-01-28 16:51:50 +01:00
antirez 0b1b25c51c Cluster: introduced repl_offset fields in clusterNode.
The two fields are used in order to remember the latest known
replication offset and the time we received it from other slave nodes.

This will be used by slaves in order to start the election procedure
with a delay that is proportional to the rank of the slave among the
other slaves for this master, when sorted for replication offset.

Usually this allows the slave with the most updated offset to win the
election and replace the failing master in the cluster.
2014-01-28 16:28:07 +01:00
antirez 0f9422d575 Cluster: update slaves lists in clusterSetMaster(). 2014-01-22 18:46:53 +01:00
antirez 5383ab0bc6 Cluster: CLUSTER SLAVES subcommand added. 2014-01-22 18:38:42 +01:00
antirez 603e480fd5 Cluster: clusterGenNodesDescription() refactored into two functions. 2014-01-22 18:36:12 +01:00
antirez 80e80668f4 Cluster: master nodes wait before rejoining the cluster after reboot.
One of the simple heuristics used by Redis Cluster in order to avoid
losing data in the typical failure modes created by the asynchronous
replication with the slaves (a master is unable, when accepting a
write, to immediately tell if it should be really accepted or refused
because of a configuration change), is to wait some time before to
rejoin the cluster after being partitioned away from the majority of
instances.

A similar condition happens when a master is restarted. It does not know
if it was already failed over, nor if all the clients have already an
updated configuration about the cluster map, so it is possible that
clients will try to write to stale masters that were restarted.

In a similar way this commit changes masters behavior so they wait
2000 milliseconds before accepting writes after a reboot. There is
nothing special about 2 seconds if not to be a value supposedly larger
a few orders of magnitude compared to the cluster bus communication
latencies.
2014-01-20 11:52:52 +01:00
antirez e6970e204f Cluster: debug printf statemets removed.
These were committed for error after being inserted in order to fix an
issue.
2014-01-20 11:19:04 +01:00
antirez ac3850cabd Cluster: allow CLUSTER REPLICATE to switch master.
The code was doing checks for slaves that should be done only when the
instance is currently a master. Switching a slave from a master to
another one should just work.
2014-01-17 18:22:35 +01:00
antirez 3d455393a6 Cluster: don't let a node forget its own master.
redis-trib should make sure to reconfigure slaves of a node to remove
from the cluster to replicate with other nodes before sending CLUSTER
FORGET.
2014-01-16 17:49:35 +01:00
antirez 0c373207fa Cluster: don't forget yourself with CLUSTER FORGET. 2014-01-16 09:46:23 +01:00
antirez 3e948970fe Cluster: use the node blacklist in CLUSTER FORGET.
CLUSTER FORGET is not useful if we can't remove a node from all the
nodes of our cluster because of the Gossip protocol that keeps adding
a given node to nodes where we already tried to remove it.

So now CLUSTER FORGET implements a nodes blacklist that is set and
checked by the Gossip section processing function. This way before a
node is re-added at least 60 seconds must elapse since the FORGET
execution.

This means that redis-trib has some time to remove a node from a whole
cluster. It is possible that in the future it will be uesful to raise
the 60 sec figure to something bigger.
2014-01-15 16:50:45 +01:00
antirez ccf268fa17 Cluster: fix clusterBlacklistAddNode() by setting right expire time.
The hash table value should be set to now + 60 seconds otherwise it
expires immediately.
2014-01-15 16:49:31 +01:00
antirez 4e1861155f Cluster: clusterBlacklistAddNode() key lookup fixed.
We can't lookup by node->name that's not an SDS string but a plain C
array in the node structure.
2014-01-15 16:45:07 +01:00
antirez b51be7b34f Cluster: clusterBlacklistExists() requires blacklist cleanup before lookup. 2014-01-15 16:06:54 +01:00
antirez a81340abaf Cluster: set a minimum rejoin delay if node_timeout is too small.
The rejoin delay usually is the node timeout. However if the node
timeout is too small, we set it to 500 milliseconds, that is a value
chosen to be greater than most setups RTT / instances latency figures
so that likely communication with other nodes happen before rejoining.
2014-01-15 12:34:33 +01:00
antirez a687cbc19c Cluster: periodically call clusterUpdateState() when cluster is down.
Usually we update the cluster state (to understand if we should accept
queries or reply with an error) only when there is a change in the state
of the nodes. However for the "delayed rejoin" feature to work, that is,
for a master to wait some time before accepting queries again after it
rejoins the majority, we need to periodically update the last time when
the node was partitioned away from the majority.

With this commit if the cluster is down we update the state ten times
per second.
2014-01-15 12:26:12 +01:00
antirez 25ddefdea3 Cluster: range checking in getSlotOrReply() fixed.
See issue #1426 on Github.
2014-01-15 11:33:46 +01:00
antirez fb659cd334 Cluster: ignore empty lines in nodes.conf.
Even without the user messing manually with the file, it is still
possible to have blank lines (just a single "\n" per line) because of
how the nodes.conf update/write process works.
2014-01-15 11:23:41 +01:00
antirez 6c63df3031 Cluster: atomic update of nodes.conf file.
The way the file was generated was unsafe and leaded to nodes.conf file
corruption (zero length file) on server stop/crash during the creation
of the file.

The previous file update method was as simple as open with O_TRUNC
followed by the write call. While the write call was a single one with
the full payload, ensuring no half-written files for POSIX semantics,
stopping the server just after the open call resulted into a zero-length
file (all the nodes information lost!).
2014-01-15 10:31:20 +01:00
antirez 28273394cb Cluster: support to read from slave nodes.
A client can enter a special cluster read-only mode using the READONLY
command: if the client read from a slave instance after this command,
for slots that are actually served by the instance's master, the queries
will be processed without redirection, allowing clients to read from
slaves (but without any kind fo read-after-write guarantee).

The READWRITE command can be used in order to exit the readonly state.
2014-01-14 16:33:16 +01:00
antirez 58c8a071a5 Fix RESTORE ttl handling in 32 bit archs.
long was used instead of long long in order to handle a 64 bit
resolution millisecond timestamp.

This fixes issue #1483.
2014-01-09 11:09:23 +01:00
antirez f510549044 Cluster: clusterProcessPacket() was not 80 cols friendly.
The function actually needs to be split into sub-functions at some
point in the future.
2013-12-25 17:57:36 +01:00
antirez 66ec1412fe Redis Cluster: add repl_ping_slave_period to slave data validity time.
When the configured node timeout is very small, the data validity time
(maximum data age for a slave to try a failover) is too little (ten
times the configured node timeout) when the replication link with the
master is mostly idle. In this case we'll receive some data from the
master only every server.repl_ping_slave_period to refresh the last
interaction with the master.

This commit adds to the max data validity time the slave ping period to
avoid this problem of slaves sensing too old data without a good reason.
However this max data validity time is likely a setting that should be
configurable by the Redis Cluster user in a way completely independent
from the node timeout.
2013-12-22 10:05:16 +01:00
antirez 658aff9d29 Redis Cluster: move node failure reports logging from VERBOSE to NOTICE level. 2013-12-21 00:04:53 +01:00
antirez 5a404c87c1 Redis Cluster: remove no longer relevant comment. 2013-12-20 14:40:11 +01:00
antirez fda4cba912 Redis Cluster: reconfigure replication when master changes address. 2013-12-20 12:47:22 +01:00
antirez d7374032c0 Redis Cluster: handshake code refactoring + Gossip IP switch detection.
This commit makes it simple to start an handshake with a specific node
address, and uses this in order to detect a node IP change and start a
new handshake in order to fix the IP if possible.
2013-12-20 12:38:03 +01:00
antirez a2c938c834 Redis Cluster: delay state change when in the majority again.
As specified in the Redis Cluster specification, when a node can reach
the majority again after a period in which it was partitioend away with
the minorty of masters, wait some time before accepting queries, to
provide a reasonable amount of time for other nodes to upgrade its
configuration.

This lowers the probabilities of both a client and a master with not
updated configuration to rejoin the cluster at the same time, with a
stale master accepting writes.
2013-12-20 09:56:18 +01:00
antirez 7a666ac419 Cluster: set n->slaves to NULL in clusterNodeResetSlaves().
The value was otherwise undefined, so next time the node was promoted
again from slave to master, adding a slave to the list of slaves
would likely crash the server or result into undefined behavior.
2013-12-17 14:50:24 +01:00
antirez fda91dbde3 Cluster: check link is valid before sending UPDATE. 2013-12-17 12:28:37 +01:00
antirez f57bb36ce7 Cluster: initialize todo_before_sleep flags to 0. 2013-12-17 12:22:02 +01:00
antirez c70c0c6db7 Cluster: use proper type mstime_t for ping delay var. 2013-12-17 10:27:36 +01:00
antirez 47815d38e0 Fixed clearNodeFailureIfNeeded() time type to mstime_t.
This prevented 32bit cluster instances from clearing the FAIL flag when
needed.
2013-12-17 09:45:52 +01:00
antirez e88e6a6334 Cluster: use long long for timestamps in clusterGenNodesDescription().
Ping sent and pong received fields need to be casted to long long to be
printed correctly into 32 bit systems.
2013-12-17 09:38:11 +01:00
antirez 11e81a1e9a Fixed grammar: before H the article is a, not an. 2013-12-05 16:35:32 +01:00
antirez 6fa42b7507 Cluster: nodes re-addition blacklist API. 2013-12-02 11:12:23 +01:00
antirez 8f18345ef0 Cluster: basic data structures for nodes black list. 2013-11-29 17:37:06 +01:00
antirez 3db825fde4 Cluster: some code about clusterHandleSlaveFailover() marginally improved.
80 cols friendly, some minor change to the code to make it simpler.
2013-11-29 16:17:05 +01:00
antirez a5e7358a12 Cluster: removed not needed newline at end of redisLog() msg. 2013-11-08 17:28:02 +01:00
antirez 28071caf38 Cluster: send a single UPDATE packet for now. 2013-11-08 17:25:49 +01:00
antirez d289c628b1 Cluster: replace hardcoded 4096 for bus msg len with sizeof(). 2013-11-08 17:19:19 +01:00
antirez 94a07d5901 Cluster: slots update refactored + UPDATE msg processing.
Now there is a function that handles the update of the local slot
configuration every time we have some new info about a node and its set
of served slots and configEpoch.

Moreoever the UPDATE packets are now processed when received (it was a
work in progress in the previous commit).
2013-11-08 17:02:10 +01:00
antirez dc43f66eac Cluster: UPDATE msg data structure and sending function. 2013-11-08 16:26:50 +01:00
antirez 6c6572be95 Cluster: refactoring of slots update code and more.
The commit also introduces detection of nodes publishing not updated
configuration. More work in progress to send an UPDATE packet to inform
of the config change.
2013-11-08 10:32:16 +01:00
antirez 1a0cea33a0 Cluster: initialize senderConfigEpoch and senderCurrentEpoch for warnings suppression. 2013-11-05 12:01:07 +01:00
antirez 0c9f60a628 Cluster: there is a lower limit for the handshake timeout. 2013-10-11 10:34:32 +02:00
antirez 1447d28c0f Cluster: data_age conversion to milliseconds fixed. 2013-10-09 16:36:06 +02:00
antirez 573c2fea91 Cluster: clusterCron() freq is now 10h. Still ping 1 node every sec.
After the change in clusterCron() frequency of call, we still want to
ping just one random node every second.
2013-10-09 16:29:17 +02:00
antirez ba42428633 Cluster: time switched from seconds to milliseconds.
All the internal state of cluster involving time is now using mstime_t
and mstime() in order to use milliseconds resolution.

Also the clusterCron() function is called with a 10 hz frequency instead
of 1 hz.

The cluster node_timeout must be also configured in milliseconds by the
user in redis.conf.
2013-10-09 16:19:26 +02:00
antirez 929b6a4480 Cluster: cluster stuff moved from redis.h to cluster.h. 2013-10-09 15:38:05 +02:00
antirez ae2763f564 Cluster: masters don't vote for a slave with stale config.
When a slave requests our vote, the configEpoch he claims for its master
and the set of served slots must be greater or equal to the configEpoch
of the nodes serving these slots in the current configuraiton of the
master granting its vote.

In other terms, masters don't vote for slaves having a stale
configuration for the slots they want to serve.
2013-10-08 12:45:35 +02:00
antirez f7d6ad4366 Cluster: fix slave data age computation when master is still connected. 2013-10-07 16:07:13 +02:00
antirez 2c3301b9f5 Cluster: log message improved when FAIL is cleared from a slave node. 2013-10-07 15:44:58 +02:00
antirez 72f38cd70f Cluster: slave nodes advertise master slots bitmap and configEpoch. 2013-10-07 11:31:12 +02:00
antirez 7afc0dd59a Cluster: new clusterDoBeforeSleep() API.
The new API is able to remember operations to perform before returning
to the event loop, such as checking if there is the failover quorum for
a slave, save and fsync the configuraiton file, and so forth.

Because this operations are performed before returning on the event
loop we are sure that messages that are sent in the same event loop run
will be delivered *after* the configuration is already saved, that is a
requirement sometimes. For instance we want to publish a new epoch only
when it is already stored in nodes.conf in order to avoid returning back
in the logical clock when a node is restarted.

This new API provides a big performance advantage compared to saving and
possibly fsyncing the configuration file multiple times in the same
event loop run, especially in the case of big clusters with tens or
hundreds of nodes.
2013-10-03 09:58:06 +02:00
antirez 211dcbe339 Cluster: update cluster config when slave changes master. 2013-10-02 12:27:12 +02:00
antirez 6c4d904baf Cluster: bus messages stats in CLUSTER info. 2013-10-02 10:10:08 +02:00
antirez abe81781ae Cluster: FAIL messages from unknown senders are handled better.
Previously the event was not logged but instead the node reported an
unknown packet type received.
2013-10-02 09:42:45 +02:00
antirez 7970ebd80a Cluster: senderCurrentEpoch == node currentEpoch was too strict.
We can accept a vote as long as its epoch is >= the epoch at which we
started the voting process. There is no need for it to be exactly the
same.
2013-10-01 17:21:28 +02:00
antirez f1bfd8233b Cluster: fix typo in clusterProcessPacket() comment. 2013-10-01 15:40:20 +02:00
antirez 1dedf9aa36 Cluster: time field removed from cluster messages header.
The new algorithm does not check replies time as checking for the
currentEpoch in the reply ensures that the reply is about the current
election process.
2013-09-30 16:19:44 +02:00
antirez 2d0844ee37 Cluster: log message shortened. 2013-09-30 11:51:58 +02:00
antirez 4dc247eb31 Cluster: detect cluster reconfiguration when master slots drop to 0.
The old algorithm used a PROMOTED flag and explicitly checks about
slave->master convertions. Wit the new cluster meta-data propagation
algorithm we just look at the configEpoch to check if we need to
reconfigure slots, then:

1) If a node is a master but it reaches zero served slots becuase of
reconfiguration.
2) If a node is a slave but the master reaches zero served slots because
of a reconfiguration.

We switch as a replica of the new slots owner.
2013-09-30 11:45:26 +02:00
antirez 62b1591439 Cluster: re-order failover operations to make it safer.
We need to:

1) Increment the configEpoch.
2) Save it to disk and fsync the file.
3) Broadcast the PONG with the new configuration.

If other nodes will receive the updated configuration we need to be sure
to restart with this new config in the event of a crash.
2013-09-30 10:16:48 +02:00
antirez b187517719 Cluster: when upading the configEpoch for a node, save config on disk ASAP. 2013-09-30 10:16:25 +02:00
antirez 03ca903983 Cluster: fsync data when saving the cluster config. 2013-09-30 10:13:07 +02:00
antirez 026e63392e Cluster: update the node configEpoch when newer is detected. 2013-09-27 09:55:41 +02:00
antirez 7c4b8f29e7 Cluster: react faster when a slave wins an election. 2013-09-26 16:54:43 +02:00
antirez 42fa46e49a Cluster: removed an old source of delay to start the slave failover. 2013-09-26 13:28:19 +02:00
antirez a445aa30a0 Cluster: master node now uses new protocol to vote. 2013-09-26 13:00:41 +02:00
antirez fb9b76fe14 Cluster: slave node now uses the new protocol to get elected. 2013-09-26 11:13:17 +02:00
antirez 32b5410af9 Cluster: add currentEpoch to CLUSTER INFO. 2013-09-25 12:38:36 +02:00