Commit Graph

2323 Commits

Author SHA1 Message Date
antirez bf670e0745 Cluster: don't update slave's master if we don't know it.
There is no way we can update the slave's node->slaveof pointer if we
don't know the master (no node with such an ID in our tables).
2014-02-10 18:33:34 +01:00
antirez a3755ae9ee Cluster: ignore slot config changes if we are importing it. 2014-02-10 18:04:43 +01:00
antirez 6fc53e16ad Cluster: update configEpoch after manually messing with slots. 2014-02-10 18:01:58 +01:00
antirez be0bb19fd3 Cluster: redis-trib, more info about open slots error. 2014-02-10 17:44:16 +01:00
antirez 1a73c992a3 Cluster: fixed inverted arguments in logging function call. 2014-02-10 17:21:10 +01:00
antirez 32563b4a5f Cluster: clear the FAIL status for masters without slots.
Masters without slots don't participate to the cluster but just do
redirections, no need to take them in FAIL state if they are back
reachable.
2014-02-10 17:18:27 +01:00
Matt Stancliff 21648473aa Auto-enter slaveMode when SYNC from redis-cli
If someone asks for SYNC or PSYNC from redis-cli,
automatically enter slaveMode (as if they ran
redis-cli --slave) and continue printing the replication
stream until either they Ctrl-C or the master gets disconnected.
2014-02-10 11:10:31 -05:00
antirez 5b2082ead3 Cluster: replica migration should only work for masters serving slots. 2014-02-10 17:08:37 +01:00
antirez f106a79309 Cluster: redis-trib del-node variable typo fixed. 2014-02-10 16:59:09 +01:00
antirez f885fa8bac Cluster: clusterReadHandler() fixed to work with new message header. 2014-02-10 16:27:37 +01:00
antirez 344a065d51 Cluster: don't propagate PUBLISH two times.
PUBLISH both published messages via Cluster bus and replication when
cluster was enabled, resulting in duplicated message in the slave.
2014-02-10 16:00:27 +01:00
antirez 7bf7b7350c Cluster: signature changed to "RCmb" (Redis Cluster message bus).
Sounds better after all.
2014-02-10 15:55:21 +01:00
antirez dced9c0619 Cluster: discard bus messages with version != 0. 2014-02-10 15:54:22 +01:00
antirez 007e1c7cb2 Cluster: added signature + version in bus packets. 2014-02-10 15:53:09 +01:00
antirez dca95f241c Cluster: redis-trib: options table entry for add-node fixed. 2014-02-10 12:34:21 +01:00
antirez 6df4ffe639 Don't count time to feed MONITORs in SLOWLOG. 2014-02-07 18:29:20 +01:00
antirez 142281dc79 Cluster: keys slot computation now supports hash tags.
Currently this is marginally useful, only to make sure two keys are in
the same hash slot when the cluster is stable (no rehashing in
progress).

In the future it is possible that support will be added to run
mutli-keys operations with keys in the same hash slot.
2014-02-07 17:39:01 +01:00
antirez 2d6eb68993 Sentinel: allow SHUTDOWN command in Sentinel mode. 2014-02-07 11:22:24 +01:00
antirez 970de3e9c0 Check for EAGAIN in sendBulkToSlave().
Sometime an osx master with a Linux server over a slow link caused
a strange error where osx called the writable function for
the socket but actually apparently there was no room in the socket
buffer to accept the write: write(2) call returned an EAGAIN error,
that was not checked, so we considered write(2) == 0 always as a connection
reset, which was unfortunate since the bulk transfer has to start again.

Also more errors are logged with the WARNING level in the same code path
now.
2014-02-05 16:38:10 +01:00
antirez 04fe000bf8 Cluster: fixed MF condition in clusterHandleSlaveFailover().
For manual failover we need a manual failover in progress, and that
mf_can_start is true (master offset received and matched).
2014-02-05 16:01:56 +01:00
antirez c6f02fd67a Cluster: CLUSTER FAILOVER replies with OK and logs the event. 2014-02-05 15:52:38 +01:00
antirez c72449af30 Cluster: check that a MF is in progress in manualFailoverCheckTimeout().
Otherwise it is always detected as a manual failover timed out.
2014-02-05 15:45:24 +01:00
antirez b7402bcad5 Cluster: force AUTH ACK on manual failover.
When a slave requests masters vote for a manual failover, the
REQUEST_AUTH message is flagged in a special way in order to force the
masters to give the authorization even if the master is not marked as
failing.
2014-02-05 13:10:03 +01:00
antirez 4cf0cd5719 Cluster: manual failover initial implementation. 2014-02-05 13:01:24 +01:00
antirez 4919a13f50 CLIENT PAUSE and related API implemented.
The API is one of the bulding blocks of CLUSTER FAILOVER command that
executes a manual failover in Redis Cluster. However exposed as a
command that the user can call directly, it makes much simpler to
upgrade a standalone Redis instance using a slave in a safer way.

The commands works like that:

    CLIENT PAUSE <milliesconds>

All the clients that are not slaves and not in MONITOR state are paused
for the specified number of milliesconds. This means that slaves are
normally served in the meantime.

At the end of the specified amount of time all the clients are unblocked
and will continue operations normally. This command has no effects on
the population of the slow log, since clients are not blocked in the
middle of operations but only when there is to process new data.

Note that while the clients are unblocked, still new commands are
accepted and queued in the client buffer, so clients will likely not
block while writing to the server while the pause is active.
2014-02-04 16:16:09 +01:00
antirez b089ba98cc Scripting: expire keys in scripts only at first access.
Keys expiring in the middle of the execution of Lua scripts are to
create inconsistencies in masters and / or AOF files. See the following
example:

    if redis.call("exists",KEYS[1]) == 1
    then
        redis.call("incr","mycounter")
    end

    if redis.call("exists",KEYS[1]) == 1
    then
        return redis.call("incr","mycounter")
    end

The script executes two times the same *if key exists then incrementcounter*
logic. However the two executions will work differently in the master and
the slaves, provided some unlucky timing happens.

In the master the first time the key may still exist, while the second time
the key may no longer exist. This will result in the key incremented just one
time. However as a side effect the master will generate a synthetic
`DEL` command in the replication channel in order to force the slaves to
expire the key (given that key expiration is master-driven).

When the same script will run in the slave, the key will no longer be
there, so the script will not increment the key.

The key idea used to implement the expire-at-first-lookup semantics was
provided by Marc Gravell.
2014-02-03 16:15:53 +01:00
antirez b770079f2c Allow CONFIG and SHUTDOWN while in stale-slave state. 2014-02-03 15:51:03 +01:00
antirez 89884e8f6e Scripting: use mstime() and mstime_t for lua_time_start.
server.lua_time_start is expressed in milliseconds. Use mstime_t instead
of long long, and populate it with mstime() instead of ustime()/1000.

Functionally identical but more natural.
2014-02-03 15:45:40 +01:00
antirez 7be946fde2 Option "backlog" renamed "tcp-backlog".
This is especially important since we already have a concept of backlog
(the replication backlog).
2014-01-31 14:56:10 +01:00
Nenad Merdanovic d76aa96d1a Add support for listen(2) backlog definition
In high RPS environments, the default listen backlog is not sufficient, so
giving users the power to configure it is the right approach, especially
since it requires only minor modifications to the code.
2014-01-31 14:52:10 +01:00
antirez a7d30681c9 Cluster: configurable replicas migration barrier.
It is possible to configure the min number of additional working slaves
a master should be left with, for a slave to migrate to an orphaned
master.
2014-01-31 11:26:36 +01:00
antirez 3ff1bb4b2e Sentinel: check arity for SENTINEL MASTER command.
This fixes issue #1530.
2014-01-31 10:13:38 +01:00
antirez 6c9359add1 Cluster: perform orphaned masters check before continue statements.
The check was placed in a way that conflicted with the continue
statements used by the node hearth beat code later that needs to skip
the current node sometimes. Moved at the start of the function so that's
always executed.
2014-01-30 18:23:31 +01:00
antirez c2507b0ff6 Cluster: replica migration implementation.
This feature allows slaves to migrate to orphaned masters (masters
without working slaves), as long as a set of conditions are met,
including the fact that the migrating slave needs to be in a
master-slaves ring with at least another slave working.
2014-01-30 18:05:11 +01:00
antirez 5b4020fb42 Cluster: swap two code blocks to have a more obvious flow. 2014-01-30 16:34:23 +01:00
antirez 4beaaff8ea Cluster: remove not needed return statement breaking failover. 2014-01-29 17:28:46 +01:00
antirez 3582054982 Cluster: broadcast pong to other slaves in the same ring.
When we schedule a failover, broadcast a PONG to the slaves.
The other slaves that plan to get elected will do the same too, this way
it is likely that every slave will have a good picture of its own rank.

Note that this is N*N messages where N is the number of slaves for the
failing master, however usually even large clusters have many master
nodes but a limited number of replicas per node, so this is harmless.
2014-01-29 17:19:55 +01:00
antirez e2b59621a8 Cluster: log offset when announcing the failover election delay. 2014-01-29 17:16:10 +01:00
antirez 940531e9b7 Cluster: added progressive election delay according to slave rank.
Note that when we compute the initial delay, there are probably still
more up to date information to receive from slaves with new offsets, so
the delay is recomputed when new data is available.
2014-01-29 16:53:45 +01:00
antirez 6f54032080 Cluster: function clusterGetSlaveRank() added.
Return the number of slaves for the same master having a better
replication offset of the current slave, that is, the slave "rank" used
to pick a delay before the request for election.
2014-01-29 16:39:04 +01:00
antirez 40cd38f0c4 Cluster: update node replication offset from bus packets headers. 2014-01-29 16:01:00 +01:00
antirez 9d4ded7ec6 Cluster: refactoring: new macros to check node flags. 2014-01-29 12:17:16 +01:00
antirez 099bd336db Cluster: use myself instead of server->cluster.myself. 2014-01-29 11:38:14 +01:00
antirez e36bd8b43e Cluster: added a global myself pointer in cluster.c.
Accessing to the 'myself' node, the node representing the currently
running instance, is handy without the need to type
server.cluster->myself every time.
2014-01-29 11:22:22 +01:00
antirez f1e09d8c41 Cluster: clusterBroadcastPong() improved with target selection.
Now we can broadcast a pong to all the instances or just the local
slaves (that is useful for replication offset propagation).
2014-01-29 11:08:52 +01:00
antirez befcf6259e Cluster: broadcast master/slave replication offset in bus header. 2014-01-28 16:51:50 +01:00
antirez 8b32bd483a Cluster: limit cluster.h to 80 cols. 2014-01-28 16:34:23 +01:00
antirez 0b1b25c51c Cluster: introduced repl_offset fields in clusterNode.
The two fields are used in order to remember the latest known
replication offset and the time we received it from other slave nodes.

This will be used by slaves in order to start the election procedure
with a delay that is proportional to the rank of the slave among the
other slaves for this master, when sorted for replication offset.

Usually this allows the slave with the most updated offset to win the
election and replace the failing master in the cluster.
2014-01-28 16:28:07 +01:00
antirez 72f1715e45 Fixed inverted if condition in MISCONF error code path. 2014-01-28 10:11:12 +01:00
antirez 23f4e9f0d9 Don't log MONITOR clients as disconnecting slaves. 2014-01-25 11:53:53 +01:00
antirez 40377fa522 Cluster: redis-trib set-timeout implemented. 2014-01-24 15:06:01 +01:00
antirez 0f9422d575 Cluster: update slaves lists in clusterSetMaster(). 2014-01-22 18:46:53 +01:00
antirez 5383ab0bc6 Cluster: CLUSTER SLAVES subcommand added. 2014-01-22 18:38:42 +01:00
antirez 603e480fd5 Cluster: clusterGenNodesDescription() refactored into two functions. 2014-01-22 18:36:12 +01:00
antirez 1cf532dc37 redis-cli --help output improved with --scan and periods. 2014-01-22 12:07:42 +01:00
antirez 994c5b26dd redis-cli: support for --scan option. 2014-01-22 12:04:08 +01:00
antirez 172f14d48c Use fflush() before fsync() in rio.c.
Incremental flushing in rio.c is only used to avoid huge kernel buffers
synched to slow disks creating big latency spikes, so this fix has no
durability implications, however it is certainly more correct to make
sure that the FILE buffers are flushed to the kernel before calling
fsync on the file descriptor.

Thanks to Li Shao Kai for reporting this issue in the Redis mailing
list.
2014-01-22 09:54:55 +01:00
antirez 80e80668f4 Cluster: master nodes wait before rejoining the cluster after reboot.
One of the simple heuristics used by Redis Cluster in order to avoid
losing data in the typical failure modes created by the asynchronous
replication with the slaves (a master is unable, when accepting a
write, to immediately tell if it should be really accepted or refused
because of a configuration change), is to wait some time before to
rejoin the cluster after being partitioned away from the majority of
instances.

A similar condition happens when a master is restarted. It does not know
if it was already failed over, nor if all the clients have already an
updated configuration about the cluster map, so it is possible that
clients will try to write to stale masters that were restarted.

In a similar way this commit changes masters behavior so they wait
2000 milliseconds before accepting writes after a reboot. There is
nothing special about 2 seconds if not to be a value supposedly larger
a few orders of magnitude compared to the cluster bus communication
latencies.
2014-01-20 11:52:52 +01:00
antirez e6970e204f Cluster: debug printf statemets removed.
These were committed for error after being inserted in order to fix an
issue.
2014-01-20 11:19:04 +01:00
antirez e4a5605c9a Cluster: don't rewrite slaveof config directive in cluster mode. 2014-01-20 11:10:42 +01:00
antirez 437fc2cb56 Cluster: fix error reporting when slaveof is found in config. 2014-01-20 11:08:14 +01:00
antirez ac3850cabd Cluster: allow CLUSTER REPLICATE to switch master.
The code was doing checks for slaves that should be done only when the
instance is currently a master. Switching a slave from a master to
another one should just work.
2014-01-17 18:22:35 +01:00
antirez abd6308d27 Set server.repl_down_since to 0 when changing master.
When an instance is potentially set to replicate with another master, it
is conceptually disconnected forever, since we have no old copy of the
dataset for this master in memory.
2014-01-17 18:20:31 +01:00
antirez 36c24bcca0 Cluster: redis-trib shows number of replicas of masters. 2014-01-17 17:56:45 +01:00
antirez 27ed9da383 Cluster: redis-trib help output format modified. 2014-01-17 12:32:49 +01:00
antirez a68c9ba97e Cluster: redis-trib shows what a slave replicates + fixes.
Also the :replicates info field in the node object is now correctly
populated. This also fixes the :replicas field computation.
2014-01-17 12:06:18 +01:00
antirez b451176734 Cluster: redis-trib addnode is now able to add replicas. 2014-01-17 11:48:42 +01:00
antirez 30d9c1dc32 Cluster: fix redis-trib help subcommand. 2014-01-17 10:29:40 +01:00
antirez 17d0c3e85a Cluster: redis-trib delnode implementation. 2014-01-16 18:22:03 +01:00
antirez 3d455393a6 Cluster: don't let a node forget its own master.
redis-trib should make sure to reconfigure slaves of a node to remove
from the cluster to replicate with other nodes before sending CLUSTER
FORGET.
2014-01-16 17:49:35 +01:00
antirez 9531c84807 Cluster: redis-trib help output improved.
Show options if any. Clarify that for some command any node address is
ok.
2014-01-16 16:23:33 +01:00
antirez 0c373207fa Cluster: don't forget yourself with CLUSTER FORGET. 2014-01-16 09:46:23 +01:00
antirez 3e948970fe Cluster: use the node blacklist in CLUSTER FORGET.
CLUSTER FORGET is not useful if we can't remove a node from all the
nodes of our cluster because of the Gossip protocol that keeps adding
a given node to nodes where we already tried to remove it.

So now CLUSTER FORGET implements a nodes blacklist that is set and
checked by the Gossip section processing function. This way before a
node is re-added at least 60 seconds must elapse since the FORGET
execution.

This means that redis-trib has some time to remove a node from a whole
cluster. It is possible that in the future it will be uesful to raise
the 60 sec figure to something bigger.
2014-01-15 16:50:45 +01:00
antirez ccf268fa17 Cluster: fix clusterBlacklistAddNode() by setting right expire time.
The hash table value should be set to now + 60 seconds otherwise it
expires immediately.
2014-01-15 16:49:31 +01:00
antirez 4e1861155f Cluster: clusterBlacklistAddNode() key lookup fixed.
We can't lookup by node->name that's not an SDS string but a plain C
array in the node structure.
2014-01-15 16:45:07 +01:00
antirez b51be7b34f Cluster: clusterBlacklistExists() requires blacklist cleanup before lookup. 2014-01-15 16:06:54 +01:00
antirez a81340abaf Cluster: set a minimum rejoin delay if node_timeout is too small.
The rejoin delay usually is the node timeout. However if the node
timeout is too small, we set it to 500 milliseconds, that is a value
chosen to be greater than most setups RTT / instances latency figures
so that likely communication with other nodes happen before rejoining.
2014-01-15 12:34:33 +01:00
antirez a687cbc19c Cluster: periodically call clusterUpdateState() when cluster is down.
Usually we update the cluster state (to understand if we should accept
queries or reply with an error) only when there is a change in the state
of the nodes. However for the "delayed rejoin" feature to work, that is,
for a master to wait some time before accepting queries again after it
rejoins the majority, we need to periodically update the last time when
the node was partitioned away from the majority.

With this commit if the cluster is down we update the state ten times
per second.
2014-01-15 12:26:12 +01:00
antirez 25ddefdea3 Cluster: range checking in getSlotOrReply() fixed.
See issue #1426 on Github.
2014-01-15 11:33:46 +01:00
antirez fb659cd334 Cluster: ignore empty lines in nodes.conf.
Even without the user messing manually with the file, it is still
possible to have blank lines (just a single "\n" per line) because of
how the nodes.conf update/write process works.
2014-01-15 11:23:41 +01:00
antirez 6c63df3031 Cluster: atomic update of nodes.conf file.
The way the file was generated was unsafe and leaded to nodes.conf file
corruption (zero length file) on server stop/crash during the creation
of the file.

The previous file update method was as simple as open with O_TRUNC
followed by the write call. While the write call was a single one with
the full payload, ensuring no half-written files for POSIX semantics,
stopping the server just after the open call resulted into a zero-length
file (all the nodes information lost!).
2014-01-15 10:31:20 +01:00
antirez 28273394cb Cluster: support to read from slave nodes.
A client can enter a special cluster read-only mode using the READONLY
command: if the client read from a slave instance after this command,
for slots that are actually served by the instance's master, the queries
will be processed without redirection, allowing clients to read from
slaves (but without any kind fo read-after-write guarantee).

The READWRITE command can be used in order to exit the readonly state.
2014-01-14 16:33:16 +01:00
antirez aacbba2607 Fix typo in aofRewriteBufferAppend() comment. 2014-01-14 15:37:49 +01:00
antirez 5189485625 Set REDIS_AOF_REWRITE_MIN_SIZE to 64mb.
64mb is the default value in redis.conf. For some reason instead the
hard-coded default was 1mb that is too small.
2014-01-14 11:27:28 +01:00
antirez d5763dceaf SENTINEL SET master quorum implemented. 2014-01-14 09:23:26 +01:00
antirez fe86f890b0 SENTINEL SET: error on bad option name + flush config on error. 2014-01-13 11:55:57 +01:00
antirez f822516e43 SENTINEL SET implemented.
The new command allows to change master-specific configurations
at runtime. All the settable parameters can be retrivied via the
SENTINEL MASTER command, so there is no equivalent "GET" command.
2014-01-13 11:53:29 +01:00
antirez 3cdcaff069 Sentinel: fix wrong arity error message. 2014-01-13 11:05:13 +01:00
antirez 964f6b17e9 Sentinel: SENTINEL REMOVE command added.
The command totally removes a monitored master.
2014-01-10 15:39:36 +01:00
antirez cf2835519e Sentinel: releaseSentinelRedisInstance() top comment fixed.
The claim about unlinking the instance from the connected hash tables
was the opposite of the reality. Also the current actual behavior is
safer in most cases, so it is better to manually unlink when needed.
2014-01-10 15:33:42 +01:00
antirez 9d0f46c6f5 Sentinel: flush config on disk when new master is added. 2014-01-10 15:22:06 +01:00
antirez d4f296bc1d anetResolveIP() prototype added to anet.h. 2014-01-10 15:18:41 +01:00
antirez 39f9f449b0 Sentinel: SENTINEL MONITOR command implemented.
It allows to add new masters to monitor at runtime.
2014-01-10 15:18:24 +01:00
antirez 774f0bd45e anetResolveIP() added to anet.c.
The new function is used when we want to normalize an IP address without
performing a DNS lookup if the string to resolve is not a valid IP.

This is useful every time only IPs are valid inputs or when we want to
skip DNS resolution that is slow during runtime operations if we are
required to block.
2014-01-10 15:02:39 +01:00
antirez c42e4bd0b6 Sentinel: added SENTINEL MASTER <name> command.
With SENTINEL MASTERS it was already possible to list all the configured
masters, but not a specific one.
2014-01-10 14:41:52 +01:00
antirez 2bb9cd464e Add all the configurable fields to addReplySentinelRedisInstance().
Note: the auth password with the master is voluntarily not exposed.
2014-01-10 14:31:41 +01:00
antirez 5a7d04ee7b Trip comment to 80 cols in SentinelCommand(). 2014-01-10 14:13:04 +01:00
antirez 58c8a071a5 Fix RESTORE ttl handling in 32 bit archs.
long was used instead of long long in order to handle a 64 bit
resolution millisecond timestamp.

This fixes issue #1483.
2014-01-09 11:09:23 +01:00
antirez e1ab2991c3 Fix keyspace events flags-to-string conversion.
Fixes issue #1491 on Github.
2014-01-08 17:18:34 +01:00
antirez 90a81b4ebb Don't send REPLCONF ACK to old masters.
Masters not understanding REPLCONF ACK will reply with errors to our
requests causing a number of possible issues.

This commit detects a global replication offest set to -1 at the end of
the replication, and marks the client representing the master with the
REDIS_PRE_PSYNC flag.

Note that this flag was called REDIS_PRE_PSYNC_SLAVE but now it is just
REDIS_PRE_PSYNC as it is used for both slaves and masters starting with
this commit.

This commit fixes issue #1488.
2014-01-08 14:28:16 +01:00