Codership Oy
http://www.codership.com
<info@codership.com>
DISCLAIMER
THIS SOFTWARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
IN NO EVENT SHALL CODERSHIP OY BE HELD LIABLE TO ANY PARTY FOR ANY DAMAGES
RESULTING DIRECTLY OR INDIRECTLY FROM THE USE OF THIS SOFTWARE.
Trademark Information.
All trademarks are the property of their respective owners.
Licensing Information.
Galera is copyright (c) 2007-2013 Codership Oy
Please see COPYING file that came with this distribution.
This product uses asio C++ library which is
Copyright (c) 2003-2011 Christopher M. Kohlhoff (chris at kohlhoff dot com)
licensed under Boost Software License.
This product uses aligned buffer implementation from Chromium Project which is
copyright Chromium Authors and licensed under Chromium License
Source code can be found at https://launchpad.net/galera
GALERA v3.x
CONTENTS:
=========
1. WHAT IS GALERA
2. GALERA USE CASES
3. GALERA CONFIGURATION PARAMETERS
4. GALERA ARBITRATOR
5. WRITESET CACHE
6. INCREMENTAL STATE TRANSFER
7. SPECIAL NOTES
1. WHAT IS GALERA
Galera is a synchronous multi-master replication engine that provides its
service through wsrep API (https://github.com/codership/wsrep-API). It features optimistic
transaction execution and commit time replication and certification of
writesets.
Since it replicates only final changes to the database it is transparent to
triggers, stored procedures and non-deterministic functions.
Galera nodes are connected to each other in a N-to-N fashion through a group
communication backend which provides automatic reconfiguration in the event of
a node failure or a new node added to cluster:
,-------. ,-------. ,--------.
| node1 |-----| node2 |<---| client |
`-------' G `-------' `--------'
\ /
,-------. ,--------.
| node3 |<---| client |
`-------' `--------'
Node states are synchronized by replicating transaction changes at commit time.
The cluster is virtually synchronous: this means that each node commits
transactions in exactly the same order, although not necessarily at the same
physical moment. (The latter is not that important as it may seem, since in most
cases DBMS gives no guarantee on when the transaction is actually processed.)
Built-in flow control keeps nodes within fraction of a second from each other,
this is more than enough for most practical purposes.
Main features of a Galera database cluster:
* Truly highly available: no committed transaction is ever lost in case of a
node crash. All nodes always have consistent state.
* True multi-master: all cluster nodes can handle WRITE load concurrently.
* Highly transparent. (See SPECIAL NOTES below)
* Scalable even with WRITE-intensive applications.
* Automatic synchronization of new nodes.
2. GALERA USE CASES
There is a number of ways how Galera replication can be utilized. They can be
categorized in three groups:
1) Seeking High Availability only. In this case client application connects to
only one node, the rest serving as hot backups:
,-------------.
| application |
`-------------'
| | | DB backups
,-------. ,-------. ,-------.
| node1 | | node2 | | node3 |
`-------' `-------' `-------'
<===== cluster nodes =====>
In the case of primary node failure or maintenance shutdown application can
instantly switch to another node without any special failover procedure.
2) Seeking High Availability and improved performance through uniform load
distribution. If there are several client connections to the database, they
can be uniformly distributed between cluster nodes resulting in better
performance. The exact degree of performance improvement depends on
application's load profile. Note, that transaction rollback rate may also
increase.
,-------------.
| clients |
`-------------'
| | | |
,-------------.
| application |
`-------------'
/ | \
,-------. ,-------. ,-------.
| node1 | | node2 | | node3 |
`-------' `-------' `-------'
<===== cluster nodes =====>
In the case of a node failure application can keep on using the remaining
healthy nodes.
In this setup application can also be clustered with a dedicated application
instance per database node, thus achieving HA not only for the database,
but for the whole application stack:
,-------------.
| clients |
`-------------'
// || \\
,------. ,------. ,------.
| app1 | | app2 | | app3 |
`------' `------' `------'
| | |
,-------. ,-------. ,-------.
| node1 | | node2 | | node3 |
`-------' `-------' `-------'
<====== cluster nodes ======>
3) Seeking High Availability and improved performance through smart load
distribution. Uniform load distribution can cause undesirably high rollback
rate. Directing transactions which access the same set of tables to the
same node can considerably improve performance by reducing the number of
rollbacks. Also, if your application can distinguish between read/write and
read-only transactions, the following configuration may be quite efficient:
,---------------------.
| application |
`---------------------'
writes / | reads \ reads
,-------. ,-------. ,-------.
| node1 | | node2 | | node3 |
`-------' `-------' `-------'
<========= cluster nodes =========>
3. GALERA PARAMETERS
3.1 Cluster URL.
Galera can use URL (RFC 3986) syntax for addressing with optional parameters
passed in the URL query part. Galera cluster address looks as follows:
<backend schema>://<cluster address>[?option1=value1[&option2=value2]]
e.g.: gcomm://192.168.0.1:4567?gmcast.listen_addr=0.0.0.0:5678
Currently Galera supports the following backends:
'dummy' - is a bypass backend for debugging/profiling purposes. It does not
connect to or replicate anything and the rest of the URL address string is
ignored.
'gcomm' - is a Codership's own Group Communication backend that provides
Virtual Synchrony quality of service. It uses TCP for membership service and
TCP (and UDP multicast as of version 0.8) for data replication.
Normally one would use just the simplest form of the address URL:
gcomm:// - if one wants to start a new cluster.
gcomm://<address> - if one wants to join an existing cluster. In that case
<address> is the address of one of the cluster members
3.2 Galera parameters.
There is quite a few galera configuration parameters which affect its behavior
and performance. Of particular interest to an end-user are the following:
To configure gcomm listen address:
gmcast.listen_addr
To configure how fast cluster reacts on node failure or connection loss:
evs.suspect_timeout
evs.inactive_timeout
evs.inactive_check_period
evs.keepalive_period
To fine-tune performance (especially in high latency networks):
evs.user_send_window
evs.send_window
To relax or tighten replication flow control:
gcs.fc_limit
gcs.fc_factor
For a full parameter list please see
http://www.codership.com/wiki/doku.php?id=galera_parameters
3.2.1 GMCast parameters group.
All parameters in this group are prefixed by 'gmcast.' (see example above).
group
String denoting group name. Max length of string is 16. Peer nodes
accept GMCast connection only if the group names match. It is set
automatically from wsrep options.
listen_addr
Listening address for GMCast. Address is currently passed in URI format
(for example tcp://192.168.3.1:4567) and it should be passed as the last
configuration parameter in order to avoid confusion. If parameter value
is undefined, GMCast starts listening all interfaces at default port
4567
mcast_addr
Multicast address in dotted decimal notation, enables using multicast to
transmit group communication messages. Defaults to none. Must have
the same value on all nodes.
mcast_port
Port used for UDP multicast messages. Defaults to listen_addr port.
Must have same value on all of the nodes.
mcast_ttl
Time to live for multicast packets. Defaults to 1.
3.2.2 EVS parameter group.
All parameters in this group are prefixed by 'evs.'.
All values for the timeout options below should follow ISO 8601 standard for
the time interval representation
(e.g. 02:01:37.2 -> PT2H1M37.2S == PT121M37.2S == PT7297.2S)
suspect_timeout
This timeout controls how long node can remain silent until it is put
under suspicion. If majority of the current group agree that the node
is under suspicion, it is discarded from group and new group view is
formed immediately. If majority of the group does not agree about
suspicion, <inactive_timeout> is waited until forming of new group will
be attempted. Default value is 5 seconds.
inactive_timeout
This timeout control how long node can remain completely silent until it
is discarded from the group. This is hard limit, unlike <suspect_timeout>,
and the node is discarded even if it becomes live during the formation of
the new group (so it is inclusive of <suspect_timeout>).
Default value is 15 seconds.
inactive_check_period
This period controls how often node liveness is checked. Default is
1 second and there is no need to change this unless <suspect_timeout> or
<inactive_timeout> is adjusted to smaller value. Minimum is 0.1 seconds
and maximum is <suspect_timeout>/2.
keepalive_period
This timeout controls how often keepalive messages are sent into network.
Node liveness is determined with these keepalives, so the value sould be
considerably smaller than <suspect_timeout>. Default value is 1 second,
minimum is 0.1 seconds and maximum is <suspect_timeout>/3.
consensus_timeout
This timeout defines how long forming of new group is attempted.
If there is no consensus after this time has passed since starting of
consensus protocol, every node discards all other nodes from the group
and forming of new group is attempted through singleton groups.
Default value is 30 seconds, minimum is <inactive_timeout> and maximum
is <inactive_timeout>*5.
join_retrans_period
This parameter controls how often join messages are retransmitted
during group formation. There is usually no need to adjust this value.
Default value is 0.3 seconds, minimum is 0.1 seconds and maximum is
<suspect_timeout>/3.
view_forget_timeout
This timeout controls how long information about known group views is
maintained. This information is needed to filter out delayed messages
from previous views that are not live anymore. Default value is
5 minutes and there is usually no need to change it.
debug_log_mask
This mask controls what debug information is printed in the logs if
debug logging is turned on. Mask value is bitwise-OR from values
gcomm::evs::Proto::DebugFlags. By default only state information is
printed.
info_log_mask
This mask controls what info log is printed in the logs.
Mask value is bitwise-or from values gcomm::evs::Proto::InfoFlags.
stats_report_period
This parameter controls how often statistics information is printed in
the log. This parameter has effect only if statistics reporting is
enabled via Conf::EvsInfoLogMask. Default value is 1 minute.
send_window
This parameter controls how many messages protocol layer is allowed
to send without getting all acknowledgements for any of them.
Default value is 32.
user_send_window
Like <send_window>, but for messages which sending is initiated by a
call from the upper layer. Default value is 16.
3.2.3 GCS parameter group
All parameters in this group are prefixed by 'gcs.'.
fc_debug
Post debug statistics about SST flow control every that many writesets.
Default: 0.
fc_factor
Resume replication after recv queue drops below that fraction of
gcs.fc_limit. For fc_master_slave = NO this is limit is also scaled.
Default: 1.0.
fc_limit
Pause replication if recv queue exceeds that many writesets.
Default: 16. For master-slave setups this number can be increased considerably.
fc_master_slave
When this is NO then the effective gcs.fc_limit is multipled by
sqrt( number of cluster members ). Default: NO.
sync_donor
Should we enable flow control in DONOR state the same way as in SYNCED
state. Useful for non-blocking state transfers. Default: NO.
max_packet_size
All writesets exceeding that size will be fragmented. Default: 32616.
max_throttle
How much we can throttle replication rate during state transfer (to avoid
running out of memory). Set it to 0.0 if stopping replication is acceptable
for the sake of completing state transfer. Default: 0.25.
recv_q_hard_limit
Maximum allowed size of recv queue. This should normally be half of
(RAM + swap). If this limit is exceeded, Galera will abort the server.
Default: LLONG_MAX.
recv_q_soft_limit
A fraction of gcs.recv_q_hard_limit after which replication rate will be
throttled. Default: 0.25.
The degree of throttling is a linear function of recv queue size and goes
from 1.0 (“full rate”) at gcs.recv_q_soft_limit to gcs.max_throttle at
gcs.recv_q_hard_limit. Note that “full rate”, as estimated between 0 and
gcs.recv_q_soft_limit is a very approximate estimate of a regular
replication rate.
3.2.4 Replicator parameter group
All parameters in this group are prefixed by 'replicator.'.
commit_order
Whether we should allow Out-Of-Order committing (improves parallel
applying performance). Possible settings:
0 – BYPASS: all commit order monitoring is turned off (useful for
measuring performance penalty)
1 – OOOC: allow out of order committing for all transactions
2 – LOCAL_OOOC: allow out of order committing only for local transactions
3 – NO_OOOC: no out of order committing is allowed (strict total order
committing)
Default: 3.
3.2.5 GCache parameter group
All parameters in this group are prefixed by 'gcache.'.
dir
Directory where GCache should place its files. Default: working directory.
name
Name of the main store file (ring buffer). Default: “galera.cache”.
size
Size of the main store file (ring buffer). This will be preallocated on
startup. Default: 128Mb.
page_size
Size of a page in the page store. The limit on overall page store is free
disk space. Pages are prefixed by “gcache.page”. Default: 128Mb.
keep_pages_size
Total size of the page store pages to keep for caching purposes. If only
page storage is enabled, one page is always present. Default: 0.
mem_size
Size of the malloc() store (read: RAM). For configurations with spare RAM.
Default: 0.
3.2.6 SSL parameters
All parameters in this group are prefixed by 'socket.'.
ssl_cert
Certificate file in PEM format.
ssl_key
A private key for the certificate above, unencrypted, in PEM format.
ssl
A boolean value to disable SSL even if certificate and key are configured.
Default: yes (SSL is enabled if ssl_cert and ssl_key are set)
To generate private key/certificate pair the following command may be used:
$ openssl req -new -x509 -days 365000 -nodes -keyout key.pem -out cert.pem
Using short-living (in most web examples - 1 year) certificates is not
advised as it will lead to complete cluster shutdown when certificate expires.
3.2.7 Incremental State Transfer parameters
All parameters in this group are prefixed by 'ist.'.
recv_addr
Address to receive incremental state transfer at. Setting this parameter
turns on incremental state transfer. IST will use SSL if SSL is configured
as described above. No default.
recv_bind
Address to which the incremental state transfer is bound. This
configuration is optional. When it is not set it will take its value from
recv_addr. It can be useful if the node is running behind a NAT, where the
public address and the internal address differ.
4. GALERA ARBITRATOR
Galera arbitrator found in this package is a small stateless daemon that can
serve as a lightweight cluster member to
* avoid split brain condition in a cluster with an otherwise even membership.
* request consistent state snapshots for backup purposes.
Example usage:
,---------.
| garbd |
`---------'
,---------. | ,---------.
| clients | | | clients |
`---------' | `---------'
\ | /
\ ,---. /
(' `)
( WAN )
(. ,)
/ `---' \
/ \
,---------. ,---------.
| node1 | | node2 |
`---------' `---------'
Data Center 1 Data Center 2
In this example, if one of the data centers loses WAN connection, the node
that sees arbitrator (and therefore sees clients) will continue the operation.
garbd accepts the same Galera options as the regular Galera node. Note that
at the moment garbd needs to see all replication traffic (although it does not
store it anywhere), so placing it in a location with poor connectivity to the
rest of the cluster may lead to cluster performance degradation.
Arbitrator failure does not affect cluster operation and a new instance can
be reattached to cluster at any time. There can be several arbitrators in the
cluster, although practicality of it is questionable.
5. WRITESET CACHE
Starting with version 1.0 Galera stores writesets in a special cache. It's
purpose is to improve control of Galera memory usage and offload writeset
storage to disk. Galera cache has 3 types of stores:
1. A permanent in-memory store, where writesets are allocated by a default
OS memory allocator. It can be useful for systems with spare RAM. It has
a hard size limit. By default it is disabled (size set to 0).
2. A permanent ring-buffer file which is preallocated on disk during cache
initialization. It is intended as the main writeset store. By default
its size is 128Mb.
3. An on-demand page store, which allocates memory-mapped page files during
runtime as necessary. Default page size is 128Mb, but it can be bigger
if it needs to store a bigger writeset. The size of page store is
limited by the free disk space. By default page files are deleted when
not in use, but a limit can be set on the total size of the page files
to keep. When all other stores are disabled, at least one page file is
always present on disk.
Allocation algorithm is as follows: all stores are tried in the above order.
If a store does not have enough space to allocate the writeset, then the next
store is tried. Page store should always succeed unless the writeset is bigger
than the available disk space.
By default Galera cache allocates files in the working directory of the
process, but a dedicated location can be specified. For configuration
parameters see GCache group above (p. 3.2.5).
NOTE: Since all cache files are memory-mapped, the process may appear to use
more memory than it actually does.
6. INCREMENTAL STATE TRANSFER (IST)
Galera 2.x introduces a long awaited functionality: incremental state transfer.
The idea is that if
a) the joining node state UUID is the same as that of the group and
b) all of the writesets that it missed can be found in the donor's Gcache
then instead of whole state snapshot it will receive the missing writesets and
catch up with the group by replaying them.
For example:
- local node state is 5a76ef62-30ec-11e1-0800-dba504cf2aab:197222
- group state is 5a76ef62-30ec-11e1-0800-dba504cf2aab:201913
- if writeset number 197223 is still in the donor's GCache, it will send
writests 197223-201913 to joiner instead of the whole state.
IST can dramatically speed up remerging node into cluster. It also non-blocking
on the donor.
The most important parameter for IST (besides 'ist.recv_addr') is GCache size
on donor. The bigger it is, the more writesets can be stored in it and the
bigger seqno gaps can be closed with IST. On the other hand, if GCache is much
bigger than the state size, serving IST may be less efficient than sending
state snapshot.
7. SPECIAL NOTES
7.1 DEADLOCK ON COMMIT
In multi-master mode transaction commit operation may return a deadlock error.
This is a consequence of writeset certification and is a fundamental property
of Galera. If deadlock on commit cannot be tolerated by application, Galera can
still be used on a condition that all write operations to a given table are
performed on the same node. This still has an advantage over the "traditional"
master-slave replication: write load can still be distributed between nodes and
since replication is synchronous, failover is trivial.
7.2 "SPLIT-BRAIN" CONDITION
Galera cluster is fully distributed and does not use any sort of centralized
arbitrator, thus having no single point of failure. However, like any cluster
of that kind it may fall to a dreaded "split-brain" condition where half of
the cluster nodes suddenly disappear (e.g. due to network failure).
In general case, having no information about the fate of disappeared nodes
remaining nodes cannot continue to process requests and modify their states.
While such situation is generally considered negligibly probable in a
multi-node cluster (normally nodes fail one at a time), in 2-node cluster
a single node failure can lead to this, thus making 3 nodes a minimum
requirement for a highly-available cluster.
Galera arbitrator (see above) can serve as an odd stateless cluster node
to help avoid the possibility of an even cluster split.
|