fault-tolerant disk batteries are not new. As of this writing, Dell will be happy to sell a point of no-single failure of MD3000 (SAS) or an MD3000 (iSCSI) array with a pair of SAS drives 146 GB 15K rpm for about $ 4,500. Not bad, eh? Again, if in a first name with Linux and have a few spare machine, you can not set a shared cluster disk for next to nothing.
How could it be? The good people at LINBIT have kindly offered their DistributedReplicated Block Device (DRBD) under the GPL. DRBD is a cluster disk suite that can say their own words, seen as a "network-based RAID1.
About DRBD
DRBD works by spraying a thin layer between the file system (and the buffer cache) and the disk driver. The DRBD kernel module intercepts all requests from the file system and is split in two ways - a real hard disk and another on a hard disk mirroring to a peer node. If theformer fails, the filesystem can be mounted on the opposite node and the data is available.
DRBD works on two nodes at a time - a minor role - a role is the primary node, the other data. Reads and writes occur only on the primary node. The secondary node does not have the file system, even in read-only mode. This last point requires some clarification. While the secondary node provides all the updates on the primary node, you can not suspendThese updates to the file system, such as DRBD is completely file system agnostic. That is not an explicit knowledge of the DRBD file system and as such has no means of communication between the upstream changes to the system driver. The rule of two-in-a-time, not just to limit to more than two nodes DRBD. DRBD supports further "stacking", where the highest level of DRBD module seem like a lock for the forks of the operating system, a pair of devices that lower-level blockitself) are DRBD modules (and so on.
The replica is made with one of three protocols:
A queue of log data written on the primary node to the secondary node, but not for the nodes waiting to receive the data prior to recognition of their own host that the data were committed to be confirmed. Those who use NFS connections for "asynchronous replication for this is indeed the case, pull. As asynchronous, is the fastest of allReplication protocols, suffers from a serious drawback - the failure of the primary group does not guarantee that all data are available on the secondary device. However, the data is always consistent on the secondary device, that is, it accurately represents the data on the first drive, when you synchronize your last save.
Protocol B waits for the host response secondary before recognition of the successful transfer of data to its own host. But thesecondary host is not required to immediately cease the changes replicated storage stable - can do it some time after confirmation of receipt of the main changes from the host. This ensures that in case of failure, the secondary node is not only logical but totally up-to-date than the primary node of the data. In the authors 'words, this Protocol shall be regarded as "semi-synchronous' replication. This protocol is slightly slower than Protocol A, as the exercisesthe network for each write.
C protocol not only expect the reply from the secondary hosts, but also that the secondary host of updates to ensure a stable storage before the primary response. The I / O overhead is significantly slower than C protocol log as NFS plan to return to our example, this protocol is fully synchronous replication.
The security protocols are different in terms ofIntegrity of the process of data replication and the speed of the trade to security. A protocol is the fastest of all, is not particularly secure. C protocol provides the flexibility to fail, but not the amount of latency. LINBID say that most customers should use the protocol C. This is controversial - Protocol B is as safe as they occur much less overhead. Protocol B will only be successful if both nodes have been shut down, or turn in exactly the same time. This scenario shouldguard against the use of a UPS and / or redundant power lines. If redundant power is not available, the protocol C is in fact more appropriate.
Setting up DRBD
Getting DRBD
DRBD has been built into the Linux kernel 2.6.33. If you are blessed with an older kernel, but a customer could pay LINBIT with a package of ready-made for your game distribution provided. But since this is a "stock" thing to do is download a distribution tarballDRBD website (or get an updated kernel). The following instructions apply to version 8.3.8.1 DRBD.
Building DRBD
You're probably familiar with the famous trio Linux: configure-make-install. This is no different, even if you specify an additional switch or two to build around that they want to go.
$. / Configure - with-km - sysconfdir / etc $ make # make install
NB: Throughout the body DRBD documentation indicates that the configuration files are in orderfrom / etc/drbd-83.conf / etc/drbd-08.conf, followed by / etc / drbd.conf. But the header file (user / config.h), showing the configuration script in / usr / local / etc are exactly the opposite of all the documents, the man pages generated. I - sysconfig switch ignores this behavior. Moreover, according to the source for version 8.3.8.1 (user / drbdadm_main.c), there is another configuration file that tried drbd drbd-82.conf-83.conf is that sinceomitted from the documentation. Our recommendation for people to LINBIT or should change this default setting to show / etc, or in the documentation, otherwise update.
Checking the Build
After construction, load the module to see if it was built correctly:
# Modprobe drbd
If modprobe to load the module, it is possible due to the fact that the DRBD module inserted into the wrong directory - one that is not appropriatewith the release of the kernel (not that it was the first DRBD confused). You can try searching for the form, such as:
# Find / lib / modules-name drbd.ko
If you find the module, you copy the file / lib / modules / `uname-r` / kernel / drivers / block directory. Having done so, register the module:
# Depmod-a
Alternatively, enter the subdirectory of the DRBD drbd-sources and the next (this time forcing the kernel version):
$make clean $ make KDIR = / lib / modules / `uname-r` / build # make install
Modprobe drbd then try again.
Configure DRBD
The arrangement of the DRBD cluster disks must be in a single configuration file in / etc / drbd.conf is described. In our example, the replication of two virtual machines from a single private link will be connected to take. The machines are called 'Spark' and 'flare'. Both hosts are in / dev/sda3 device are mixed. The correspondingThe configuration file is as follows:
Global {
usage-count yes;
)
common {
Protocol C;
)
Resource r0 {
Device / dev/drbd1;
disk / dev/sda3;
Meta-disk internal;
{To spark
Address 192.168.100.10:7789;
)
{On flare
Address 192.168.100.20:7789;
)
)
Depends on the configuration, C protocol is used on the hand. The resources section lists the details of a single resource named r0. (DRBD canmore resources configured and running). The two sections are specific configurations for nodes "spark" and "Torch." The device, hard disk and the voices of half-disk to both nodes. However, if one of these elements could be to distinguish between the two nodes would be expected to be down in sections. The address entries will inevitably differ between the two nodes. I feel bound to say that the two cross-address routing, and appropriatePrecautions must be taken to ensure that appointed DRDB cross traffic on all ports in the firewall to the entry address.
Configuring Metadata
DRBD is a need for a dedicated storage space on each node for storing metadata - information about the current state of synchronization between the node DRBD.
Metadata can be external, in which case you need a hard-disk space outside the partition that you want to reply to spend. external metadata can provide the best performance sinceYou can use a second hard disk on each node to parallelize the I / O.
Metadata can also be internal, which is in line with the partition replicated. This mode offers lower performance I / O compared to external metadata. It 'a little easier, however, and has the advantage of metadata come close to real data - if you have a physical location on disk. internal metadata at the end of the partition or a logical volume (LV) filling in the destination file in SeptemberSystem. To avoid overwriting the metadata from the end of the file system, it must first be reduced to make room for metadata.
In our example, we will work with internal metadata. In both cases, the metadata bit 'of space on the device, the space depends on the size of the replicated file system. Prior to determining the size of the metadata, we need to accurately measure the size of the file system can be replicated. When it comes to sizes, please refer to the size of the rawFile, or the system, the amount of space on the plate - not the amount of space has the file system for applications. The best way to determine the size of the file system is to examine the size of the underlying partition or LV, as the file system, use the entire partition / LV prone. We use the parted utility in our example, replication / dev/sda3 - 4 GB ext3 partition.
# Parted / dev/sda3 print unit s
Model: unknown (unknown) Disk / dev/sda3: 8193150sSector size (logical / physical): 512B/512B Partition Table: loop Number Start End Size File System Flags 1 0s 8193149s 8193150s ext3
Determine the size of the metadata:
given by: ceiling (Size/218) x 8 + 72 = 328 (where the ceiling of the function rounded to the nearest whole number)
Note: The observant among you will notice that the real needs of the interior block of metadata will be less than the specified number because, by reducing the file system, there are decliningthe application of metadata. However, the difference is negligible in terms of size, and is easier to calculate the size of the small size of the block of metadata.
Check the file system error (file system Ext2/Ext3):
# E2fsck-f / dev/sda3
Calculate the new size of the FS, so for DRBD metadata:
given by: Dimensions - 328 = 8,192,822
The size of the file system:
Resize2fs # / dev/sda3 8192822s
Finally, create the metadata block:
# Create-md r0 Drbdadm
Load up DRBDStart
In most cases it is desirable to load the kernel module and enable replication DRBD drbd start-ups. DRBD is a daemon for distributed only for this purpose. (Replace {} with the directory where it was DRBD_DIR DRBD unpacked.)
Cp # {} DRBD_DIR / scripts / DRBD / etc / rc.d / init.d # chkconfig - add drbd
Enable DRBD
Start the daemon:
# Service drbd start
Observe the status of hard drives:
Cat / proc / drbd
Version: 8.3.8.1 (API 88/proto :86-94) GIT-hash:0d8589fcc32c874df57c930ca1691399b55ec893 to build torch Emil @ 04/08/2010 20:45:00 1: cs: Connected ro: Secondary / Secondary ds: Inconsistent / Inconsistent C r ---- ns: 0 nr: 0 dw: 0 dr: 0 al: 0 bm: 0 Min: 0 pe: 0 FP: 0 AP: 0 EP: 1 WHERE: OOS b: 4,096,408
The irregular / inconsistent state disk is expected at this point. This means that the disks were never in sync.
initial synchronization
The next step is the first synchronization and includes a complete overwrite dataof a hard disk-based hard disk of another peer. You must select which of the peers contains the correct data, and type the following command peer:
# Drbdadm - - overwrite-data-to-peer primary R0
Well done, both on the peer nodes:
Clock $ cat / proc / drbd "
You will see a progress bar, similar to the bottom: Version: 8.3.8.1 (API 88/proto :86-94) GIT-hash: 0d8589fcc32c874df57c930ca1691399b55ec893 build torch Emil @ 08/04/2010 20:45:00 1: cs: SyncTargetro: Secondary / Primary ds: Inconsistent / UpToDate C r ---- ns: 0 nr: 24064 ext: 24064 dr: 0 al: 0 bm: 1 lo: 0 pe: 0 FP: 0 AP: 0 EP: 1 WO: OOS b: 4072344 [>....................] sync: 0.7% (4072344/4096408) K Date: 2:49:40 Speed: 324 (320) K / sec
Depending on the size of the file system and network speed, this process may take some time. With a couple of virtual machines and virtual internal network that has taken a 4 GB ext3 file system to synchronize approximately 3.5 hours. This means thatshould be able to start with your primary hard drive if it is open, without waiting for full synchronization. But to abstain from performing any mission-critical operations on the primary file system until the initial synchronization is complete (albeit with protocol C).
Mounting the file system
We can also mount the disk on the primary node. But first we must make sure that you select a node as a primary. On the primary node, the following:
#drbdadm primary r0
Observe the output of cat /proc/drbd, having made a node primary:
version: 8.3.8.1 (api:88/proto:86-94) GIT-hash: 0d8589fcc32c874df57c930ca1691399b55ec893 build by emil@spark, 2010-08-06 08:01:01 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---- ns:32768 nr:0 dw:0 dr:32984 al:0 bm:4 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
The output of cat /proc/drbd on the secondary node should be very similar, only the Primary/Secondary roles will appear vice versa.
The sequence of the HA disk stack (most first) is as follows:
physical disk partition, LVM (if applicable), DRBD, File System
When you mount the hard drive, refer to the specific DRBD block device, instead of the actual device (eg / dev/sda3). How true are the partitions DRBD devices with a 1-based index is attached. For convenience, it is worthwhile to add the following entry to the end of / etc / fstab:
/ Dev/drbd1 / mnt/drbd1 ext3 noauto 0 0
The noauto optionin the "Mount Options" column indicates the operating system to install the device, do not start. Otherwise, try one of the nodes always fail, use the assembly, because only one node in the file system at a given time.
Now mount the block device:
# Mount / dev/drbd1
NB: Due to the entry in / etc / fstab, you do not specify a mount point for the mount command.
So there you have it: high availability with no single point of failure of hard diskStack for the price of a couple of Linux machines. And all the time that you drink 17 cups of coffee.
Insights
Gridlock
DRBD is completely integrated with Gridlock - the best high availability cluster in the world. Whether you're after a powerful architecture, high availability, shared-nothing, or off-site replication and disaster recovery, and congestion on the challenge.
The problem with using Linux-based (or a specific operating system), clustering software is that it istied to the operating system.
Congestion, work on the other side at the application level and is not tied to the operating system. I think this is the right way to see above all that many organizations have a mixed Windows and Linux servers - may be able to group under Windows and Linux machines together to be a real asset. It also makes installation and configuration easier because you do not have specific instructions for a dozen different operating systems and areHardware configurations.
Gridlock The other nice thing is that you lets you use a quorum and not on NIC Bonding / Teaming multipath configurations to achieve - but combines redundant networks at the application level, which works on any network adapter and does not require "special t framework.
Split Brain
When running in an active-standby configuration, because only a DRBD primary node will be given at a time. Two (or more) panels together with the primaryState can lead the branch of records. In other words, you could not see the changes node, his peers, and vice versa. This condition is known as split brain. When the drbd daemon started, verify the presence of a condition of split-brain, and interruptions during synchronization attached an error message in / var / log / messages.
The first step in recovery from a split-brain condition is to identify changes in both nodes after the split-brain event. If both nodesimportant information that should be united, it is better to return to one of the nodes (which we call the node A or the node back) and re-sync data from another node (Node B, or the root node). When the re-synchronization is complete, both nodes contain the data from node B is set, it is the primary node. Subsequently, the node B to node A reduce the state to promote secondary and primary status. Hand-merge changes from the backup data set on node A - node to propagate changesB.
The node back up your data and the following commands:
# # Drbdadm Drbdadm secondary R0 - - discard-my-data connect R0
to do at the root node:
# # Connect drbdadm Drbdadm primary r0 r0
Notes / proc / drbd - should now be synchronized with the node.
After synchronization of the nodes, switch roles and manually merge the changes to the new primary node.
Starting barrier
Be the default, when a node DRBD, waiting for its peer-nodesto start. This prevents a scenario in the cluster will only start with a knot, and mission-critical data is replicated in a peer-written without the hard drive. The standard time limit is 'unlimited', ie, a node will wait indefinitely for his peers, before its startup sequence. However, DRBD will be presented with the option to skip the waiting time. To check out, add a start-section within the Resources section, as follows:
Start{
WFC-timeout 10;
Degr-WFC-timeout 10;
deprecated-WFC-timeout 10;
)
In this example, we explicitly set the timeout to 10 seconds. Then the knot will come in time for their peers, but the absence of the peer will not prevent the start of the node.
Synchronization Options
DRBD is synchronization mechanism is developed for slow computers with slow network connections. This is unfortunate, because the configuration out-of-the-boxrequires some 'work to go forward and the hardware base. The default is the LAN synchronization limited to around 250 KB / s, Mbit / s which is about 2% of a 100th While the presence of a throttling feature is good, the default settings are too conservative. In addition, DRBD default to download all the blocks that could keep out of sync. Compare this with used rolling checksum and compression using tools such as rsync. Even if the compression is not an option, butDRBD block can compare the digest of each block with the primary copy, and only the transmission to tell whether the sediments tell. Remember, though - the use of a checksum is CPU cycles for the exchange of bandwidth. There can be cap free-flowing and the use of MD5 checksums for faster synchronization with the addition of a section of sync for the common, are included as follows:
common {
...
syncer {
Vote 5m;
csum-alg md5;
)
...
)
In the exampleabove, the rate of synchronization is 5 MB / s, about 50% of the capacity of a network Ethernet 100Base-T is a tissue, taking into account TCP / IP frame overheads have been cut. This configuration uses the MD5 algorithm to digest the replicated blocks that are supported by the kernel must be calculated (most). The two approaches are completely independent: they can be a new gas, without giving a checksum algorithm, and vice versa.
No comments:
Post a Comment