Latest posts

GlusterFS with Multipath Devices

Opensource software-defined storage systems are becoming to be the next big thing. Instead of buying an expensive storage appliance, you can just put together a few bare-metal servers and stack them with disks and voila. You have a working enterprise-grade storage unit that you can scale as you wish.

Other than the freedom to scale, these solutions offer lots of features. When working with storage appliance, even the most trivial things like NFS requires extra licenses. With opensource software-defined storage systems, you have all the features unlocked.

Topology

I'm using two physical servers and a single virtual server for my cluster. Each of the physical servers has 70 terabytes connected to them over multipath.

Is there any reason to use glusterFS over storage appliance? Usually not, but this was an exception. I needed a storage device I can write and read simultaneously from multiple clients and storage aplience was asking for extra licences for this. Instead of buying licenses, I disabled high availability on the storage appliance and configured it to provide disks(like JBOD). Multipath offers high availability at the cable, and glusterFS covers the cases of disk and controller failures. This design ended up being very unorthodox yet it served me perfectly for the past two years.

glusterA1vgluster01pgluster02pclientsRead/Write dataRead/Write datareplicate datareplicate datareplicate metadatareplicate metadataglusterA1vgluster01pgluster02pclients

Requirements for installation

To create a healthy glusterFS cluster, there some recommendations and requirements.

Disk and filesystems

Official documentation advises LVM and XFS for bricks. These are not mandatory, yet these options make management easier.

LVM offers better control over how to use storage devices. Having LVM under the glusterFS means:

  • online partition resizing and scaling
  • raid like parallel operation capability
  • a single partition can span partitions across multiple disks

GlusterFS can run any filesystem that supports extended attributes(EA), but official documentation points toward XFS. XFS is preferred for multiple reasons:

  • XFS journals metadata, this means faster crash recovery
  • the filesystem can be defragmented and expanded while online
  • advanced metadata read-ahead algorithms

https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.1/html/administration_guide/brick_configuration

After preparing your partitions, you need to mount them in an orderly fashion. I have chosen the following hierarchy for my disks.

/dev/mapper/mpathf on /bricks/brick6 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/mapper/mpathb on /bricks/brick2 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/mapper/mpatha on /bricks/brick1 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/mapper/mpathe on /bricks/brick5 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/mapper/mpathc on /bricks/brick3 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/mapper/mpathd on /bricks/brick4 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/mapper/mpathg on /bricks/brick7 type xfs (rw,relatime,attr2,inode64,noquota)

After putting some test data :

/dev/mapper/mpathf                    10T  8.7G   10T   1% /bricks/brick6
/dev/mapper/mpathb                    10T   19G   10T   1% /bricks/brick2
/dev/mapper/mpatha                    10T   15G   10T   1% /bricks/brick1
/dev/mapper/mpathe                    10T   12G   10T   1% /bricks/brick5
/dev/mapper/mpathc                    10T  9.3G   10T   1% /bricks/brick3
/dev/mapper/mpathd                    10T   19G   10T   1% /bricks/brick4
/dev/mapper/mpathg                    10T   16G   10T   1% /bricks/brick7

Glusterfs works with mounted folders instead of raw disks. So your disks needs to be ready before creating a glusterfs volume.

Network Time Protocol

As for almost every cluster-based solution, strict time synchronization is essential. You need to set up your NTP servers correctly. After configuring NTP servers add the following to your ntp.conf

restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict -6 ::1

After setting the configuration, restart and enable ntpd. More information about ntp is coming soon.

service ntpd restart
chkconfig ntpd on

or

systemctl restart ntpd
systemctl enable ntpd

Hostnames

Add all your servers to your /etc/hosts

192.168.51.1   gluster01p
192.168.51.2   gluster02p
192.168.51.3   glusterA1v

Installation

Creating the cluster

To merge the cluster, on the master server run:

gluster peer probe gluster02p
gluster peer probe glusterA1v
gluster peer status
Number of Peers: 2

Hostname: gluster02p
Uuid: d9d055d2-3080-4311-8016-c64091111204
State: Peer in Cluster (Connected)

Hostname: glusterA1v
Uuid: 376d9abc-3d00-11c3-b5f1-fe8a961112ac
State: Peer in Cluster (Connected)

If you like to dissolve the cluster:

gluster peer detach gluster02p
gluster peer detach glusterA1v

Creating volume

I want my data to be stored on physical nodes, and I want them to replicate. If a single physical device is lost, the system will survive in this configuration. When storing data in such a way, we need a third node to decide which node has the latest data in case of failure. This node is called the arbiter node. I used a virtual machine for this. To create a replicated volume I used the:

gluster volume create volume01 replica 2 arbiter 1 gluster01p:/bricks/brick1/datafolder gluster02p:/bricks/brick1/datafolder glusterA1v:/bricks/brick1/datafolder gluster01p:/bricks/brick2/datafolder gluster02p:/bricks/brick2/datafolder glusterA1v:/bricks/brick2/datafolder gluster01p:/bricks/brick3/datafolder gluster02p:/bricks/brick3/datafolder glusterA1v:/bricks/brick3/datafolder gluster01p:/bricks/brick4/datafolder gluster02p:/bricks/brick4/datafolder glusterA1v:/bricks/brick4/datafolder gluster01p:/bricks/brick5/datafolder gluster02p:/bricks/brick5/datafolder glusterA1v:/bricks/brick5/datafolder gluster01p:/bricks/brick6/datafolder gluster02p:/bricks/brick6/datafolder glusterA1v:/bricks/brick6/datafolder gluster01p:/bricks/brick7/datafolder gluster02p:/bricks/brick7/datafolder glusterA1v:/bricks/brick7/datafolder

This command creates groups like:

  • gluster01p:/bricks/brickN/datafolder
  • gluster02p:/bricks/brickN/datafolder
  • glusterA1v:/bricks/brickN/datafolder

In this config, data is replicated between 01p and 02p and A1v is used for arbiter.

After creating the volume you can check with :

[root@gluster01p ~]# gluster volume status volume01   
Status of volume: volume01
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gluster01p:/bricks/brick1/datafolder                                49152     0          Y       12094
Brick gluster02p:/bricks/brick1/datafolder                                49152     0          Y       21344
Brick glusterA1v:/bricks2/brick1/datafolder                                49152     0          Y       114644
Brick gluster01p:/bricks/brick2/datafolder                                49153     0          Y       12097
Brick gluster02p:/bricks/brick2/datafolder                                49153     0          Y       21363
Brick glusterA1v:/bricks2/brick2/datafolder                                49153     0          Y       114653
Brick gluster01p:/bricks/brick3/datafolder                                49154     0          Y       12078
Brick gluster02p:/bricks/brick3/datafolder                                49154     0          Y       21362
Brick glusterA1v:/bricks2/brick3/datafolder                                49154     0          Y       57340
Brick gluster01p:/bricks/brick4/datafolder                                49155     0          Y       12079
Brick gluster02p:/bricks/brick4/datafolder                                49155     0          Y       21355
Brick glusterA1v:/bricks2/brick4/datafolder                                49155     0          Y       57413
Brick gluster01p:/bricks/brick5/datafolder                                49156     0          Y       12134
Brick gluster02p:/bricks/brick5/datafolder                                49156     0          Y       21398
Brick glusterA1v:/bricks2/brick5/datafolder                                49156     0          Y       114662
Brick gluster01p:/bricks/brick6/datafolder                                49157     0          Y       12110
Brick gluster02p:/bricks/brick6/datafolder                                49157     0          Y       21391
Brick glusterA1v:/bricks2/brick6/datafolder                                49157     0          Y       114669
Brick gluster01p:/bricks/brick7/datafolder                                49158     0          Y       12121
Brick gluster02p:/bricks/brick7/datafolder                                49158     0          Y       21384
Brick glusterA1v:/bricks2/brick7/datafolder                                49158     0          Y       57656
Self-heal Daemon on localhost               N/A       N/A        Y       12148
Self-heal Daemon on gluster02p       N/A       N/A        Y       21414
Self-heal Daemon on glusterA1v        N/A       N/A        Y       114729

Task Status of Volume volume01
------------------------------------------------------------------------------
There are no active volume tasks

You can get much more infornation about your volume with:

[root@gluster01p ~]# gluster volume status volume01 detail
Status of volume: volume01
------------------------------------------------------------------------------
Brick                : Brick gluster01p:/bricks/brick1/datafolder
TCP Port             : 49152               
RDMA Port            : 0                   
Online               : Y                   
Pid                  : 12094               
File System          : xfs                 
Device               : /dev/mapper/mpatha  
Mount Options        : rw,relatime,attr2,inode64,noquota
Inode Size           : 512                 
Disk Space Free      : 10.0TB              
Total Disk Space     : 10.0TB              
Inode Count          : 1073741760          
Free Inodes          : 1073657372        

Now we have a working volume. Lets use it.

Connecting the clients to the storage

I was working with Centos 7 clients, following packages were needed.

sudo yum -y install openssh-server wget fuse fuse-libs openib libibverbs glusterfs glusterfs-fuse glusterfs-rdma

Considering you already set your hosts file, test with the following line.

mount -t glusterfs -o backupvolfile-server=gluster02p, use-readdirp=no , volfile-max-fetch-attempts=2  gluster01p:/volume01 /data/

Add the following to the fstab to make the volume permanent:

gluster01p:/volume01 /mnt/glusterstorage glusterfs defaults,backupvolfile-server=gluster02p,use-readdirp=no,_netdev 0 0
df | grep mnt
Filesystem			   Size  Used Avail Use% Mounted on
gluster01p:/volume01   70T  813G   70T   2% /mnt/glusterstorage

Published under  on .

Last updated on .

Root101

Open Source and Linux, Notes, Guides and Ideas