GlusterFS with Multipath Devices
Opensource software-defined storage systems are becoming to be the next big thing. Instead of buying an expensive storage appliance, you can just put together a few bare-metal servers and stack them with disks and voila. You have a working enterprise-grade storage unit that you can scale as you wish.
Other than the freedom to scale, these solutions offer lots of features. When working with storage appliance, even the most trivial things like NFS requires extra licenses. With opensource software-defined storage systems, you have all the features unlocked.
Topology
I'm using two physical servers and a single virtual server for my cluster. Each of the physical servers has 70 terabytes connected to them over multipath.
Is there any reason to use glusterFS over storage appliance? Usually not, but this was an exception. I needed a storage device I can write and read simultaneously from multiple clients and storage aplience was asking for extra licences for this. Instead of buying licenses, I disabled high availability on the storage appliance and configured it to provide disks(like JBOD). Multipath offers high availability at the cable, and glusterFS covers the cases of disk and controller failures. This design ended up being very unorthodox yet it served me perfectly for the past two years.
Requirements for installation
To create a healthy glusterFS cluster, there some recommendations and requirements.
Disk and filesystems
Official documentation advises LVM and XFS for bricks. These are not mandatory, yet these options make management easier.
LVM offers better control over how to use storage devices. Having LVM under the glusterFS means:
- online partition resizing and scaling
- raid like parallel operation capability
- a single partition can span partitions across multiple disks
GlusterFS can run any filesystem that supports extended attributes(EA), but official documentation points toward XFS. XFS is preferred for multiple reasons:
- XFS journals metadata, this means faster crash recovery
- the filesystem can be defragmented and expanded while online
- advanced metadata read-ahead algorithms
After preparing your partitions, you need to mount them in an orderly fashion. I have chosen the following hierarchy for my disks.
/dev/mapper/mpathf on /bricks/brick6 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/mapper/mpathb on /bricks/brick2 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/mapper/mpatha on /bricks/brick1 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/mapper/mpathe on /bricks/brick5 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/mapper/mpathc on /bricks/brick3 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/mapper/mpathd on /bricks/brick4 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/mapper/mpathg on /bricks/brick7 type xfs (rw,relatime,attr2,inode64,noquota)
After putting some test data :
/dev/mapper/mpathf 10T 8.7G 10T 1% /bricks/brick6
/dev/mapper/mpathb 10T 19G 10T 1% /bricks/brick2
/dev/mapper/mpatha 10T 15G 10T 1% /bricks/brick1
/dev/mapper/mpathe 10T 12G 10T 1% /bricks/brick5
/dev/mapper/mpathc 10T 9.3G 10T 1% /bricks/brick3
/dev/mapper/mpathd 10T 19G 10T 1% /bricks/brick4
/dev/mapper/mpathg 10T 16G 10T 1% /bricks/brick7
Glusterfs works with mounted folders instead of raw disks. So your disks needs to be ready before creating a glusterfs volume.
Network Time Protocol
As for almost every cluster-based solution, strict time synchronization is essential. You need to set up your NTP servers correctly. After configuring NTP servers add the following to your ntp.conf
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict -6 ::1
After setting the configuration, restart and enable ntpd. More information about ntp is coming soon.
service ntpd restart
chkconfig ntpd on
or
systemctl restart ntpd
systemctl enable ntpd
Hostnames
Add all your servers to your /etc/hosts
192.168.51.1 gluster01p
192.168.51.2 gluster02p
192.168.51.3 glusterA1v
Installation
Creating the cluster
To merge the cluster, on the master server run:
gluster peer probe gluster02p
gluster peer probe glusterA1v
gluster peer status
Number of Peers: 2
Hostname: gluster02p
Uuid: d9d055d2-3080-4311-8016-c64091111204
State: Peer in Cluster (Connected)
Hostname: glusterA1v
Uuid: 376d9abc-3d00-11c3-b5f1-fe8a961112ac
State: Peer in Cluster (Connected)
If you like to dissolve the cluster:
gluster peer detach gluster02p
gluster peer detach glusterA1v
Creating volume
I want my data to be stored on physical nodes, and I want them to replicate. If a single physical device is lost, the system will survive in this configuration. When storing data in such a way, we need a third node to decide which node has the latest data in case of failure. This node is called the arbiter node. I used a virtual machine for this. To create a replicated volume I used the:
gluster volume create volume01 replica 2 arbiter 1 gluster01p:/bricks/brick1/datafolder gluster02p:/bricks/brick1/datafolder glusterA1v:/bricks/brick1/datafolder gluster01p:/bricks/brick2/datafolder gluster02p:/bricks/brick2/datafolder glusterA1v:/bricks/brick2/datafolder gluster01p:/bricks/brick3/datafolder gluster02p:/bricks/brick3/datafolder glusterA1v:/bricks/brick3/datafolder gluster01p:/bricks/brick4/datafolder gluster02p:/bricks/brick4/datafolder glusterA1v:/bricks/brick4/datafolder gluster01p:/bricks/brick5/datafolder gluster02p:/bricks/brick5/datafolder glusterA1v:/bricks/brick5/datafolder gluster01p:/bricks/brick6/datafolder gluster02p:/bricks/brick6/datafolder glusterA1v:/bricks/brick6/datafolder gluster01p:/bricks/brick7/datafolder gluster02p:/bricks/brick7/datafolder glusterA1v:/bricks/brick7/datafolder
This command creates groups like:
- gluster01p:/bricks/brickN/datafolder
- gluster02p:/bricks/brickN/datafolder
- glusterA1v:/bricks/brickN/datafolder
In this config, data is replicated between 01p and 02p and A1v is used for arbiter.
After creating the volume you can check with :
[root@gluster01p ~]# gluster volume status volume01
Status of volume: volume01
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick gluster01p:/bricks/brick1/datafolder 49152 0 Y 12094
Brick gluster02p:/bricks/brick1/datafolder 49152 0 Y 21344
Brick glusterA1v:/bricks2/brick1/datafolder 49152 0 Y 114644
Brick gluster01p:/bricks/brick2/datafolder 49153 0 Y 12097
Brick gluster02p:/bricks/brick2/datafolder 49153 0 Y 21363
Brick glusterA1v:/bricks2/brick2/datafolder 49153 0 Y 114653
Brick gluster01p:/bricks/brick3/datafolder 49154 0 Y 12078
Brick gluster02p:/bricks/brick3/datafolder 49154 0 Y 21362
Brick glusterA1v:/bricks2/brick3/datafolder 49154 0 Y 57340
Brick gluster01p:/bricks/brick4/datafolder 49155 0 Y 12079
Brick gluster02p:/bricks/brick4/datafolder 49155 0 Y 21355
Brick glusterA1v:/bricks2/brick4/datafolder 49155 0 Y 57413
Brick gluster01p:/bricks/brick5/datafolder 49156 0 Y 12134
Brick gluster02p:/bricks/brick5/datafolder 49156 0 Y 21398
Brick glusterA1v:/bricks2/brick5/datafolder 49156 0 Y 114662
Brick gluster01p:/bricks/brick6/datafolder 49157 0 Y 12110
Brick gluster02p:/bricks/brick6/datafolder 49157 0 Y 21391
Brick glusterA1v:/bricks2/brick6/datafolder 49157 0 Y 114669
Brick gluster01p:/bricks/brick7/datafolder 49158 0 Y 12121
Brick gluster02p:/bricks/brick7/datafolder 49158 0 Y 21384
Brick glusterA1v:/bricks2/brick7/datafolder 49158 0 Y 57656
Self-heal Daemon on localhost N/A N/A Y 12148
Self-heal Daemon on gluster02p N/A N/A Y 21414
Self-heal Daemon on glusterA1v N/A N/A Y 114729
Task Status of Volume volume01
------------------------------------------------------------------------------
There are no active volume tasks
You can get much more infornation about your volume with:
[root@gluster01p ~]# gluster volume status volume01 detail
Status of volume: volume01
------------------------------------------------------------------------------
Brick : Brick gluster01p:/bricks/brick1/datafolder
TCP Port : 49152
RDMA Port : 0
Online : Y
Pid : 12094
File System : xfs
Device : /dev/mapper/mpatha
Mount Options : rw,relatime,attr2,inode64,noquota
Inode Size : 512
Disk Space Free : 10.0TB
Total Disk Space : 10.0TB
Inode Count : 1073741760
Free Inodes : 1073657372
Now we have a working volume. Lets use it.
Connecting the clients to the storage
I was working with Centos 7 clients, following packages were needed.
sudo yum -y install openssh-server wget fuse fuse-libs openib libibverbs glusterfs glusterfs-fuse glusterfs-rdma
Considering you already set your hosts file, test with the following line.
mount -t glusterfs -o backupvolfile-server=gluster02p, use-readdirp=no , volfile-max-fetch-attempts=2 gluster01p:/volume01 /data/
Add the following to the fstab to make the volume permanent:
gluster01p:/volume01 /mnt/glusterstorage glusterfs defaults,backupvolfile-server=gluster02p,use-readdirp=no,_netdev 0 0
df | grep mnt
Filesystem Size Used Avail Use% Mounted on
gluster01p:/volume01 70T 813G 70T 2% /mnt/glusterstorage