File Systems

Introduction

This page details the current Distributed File System (GlusterFS), as well as the previous system (Ceph). Ceph is resource-intensive (when many server and client daemons have to run on the same machine) and performs automatic replica rebalancing when a node goes down; GlusterFS simply waits for the node to come back up to ensure replication.

GlusterFS

Installation and configuration is straightforward (though it is easier to run it as root as every command requires sudo). On each machine the 6 drives had one partition made with GPT, and the partitions were formatted with XFS and an inode size of 512 bytes. The Gluster volume (gv0) was created to replicate data over all 3 machines, and distribute over the disks, hence:

sudo gluster volume create gv0 replica 3 transport tcp bg-angel:/export/sda1/brick bg-beast:/export/sda1/brick bg-cyclops:/export/sda1/brick bg-angel:/export/sdb1/brick bg-beast:/export/sdb1/brick bg-cyclops:/export/sdb1/brick bg-angel:/export/sdd1/brick bg-beast:/export/sdd1/brick bg-cyclops:/export/sdd1/brick bg-angel:/export/sde1/brick bg-beast:/export/sde1/brick bg-cyclops:/export/sde1/brick bg-angel:/export/sdf1/brick bg-beast:/export/sdf1/brick bg-cyclops:/export/sdf1/brick bg-angel:/export/sdg1/brick bg-beast:/export/sdg1/brick bg-cyclops:/export/sdg1/brick

The following was manually set for extra safety/maintenance:

sudo gluster volume gv0 bitrot enable

To access the volume, the Gluster Native Client is used. Specifically, each machine adds itself as the server to perform the mount on in /etc/fstab:

<hostname>:/gv0 /mnt/glusterfs glusterfs defaults,_netdev 0 0

Due to network outages, it may be necessary to heal the volumes on the machines.

Note that if one machine loses a connection to the others, it becomes read-only to prevent split-brain when the connection is restored.

In order to make sure Gluster is mounted properly, and the Samba container runs only after this, the following command was added to the end of /etc/rc.local (before “exit 0”):

( sleep 60; mount -a; docker start samba ) &

Ceph

Setup

For a good overview of Ceph, see Ceph Intro and Architectural Overview (1/2 hour video). We were running the Hammer release of Ceph.

The standard instructions for ceph-deploy were followed, with bg-cyclops being chosen as the “admin” node. First was the preflight checklist. The ceph user was created with BICV’s default password. Before setting up passwordless SSH the user was switched using “su – ceph”.

Second was the storage cluster. The config files and keys are stored in /home/ceph/ceph. Make sure that all ceph-deploy commands are run within this folder! Note that when /etc/hosts had the private network IPs for the machines, Ceph failed to initialise (as it expects addresses to be from the public network). The final step to check Ceph’s health fails due to the low number of placement groups, so the default pool (rbd) was adjusted to have the right number of pgs:

ceph osd pool set rbd pg_num 2048
ceph osd pool set rbd pgp_num 2048

Beyond “Create A Cluster”, a metadata server was created on each server. According to the early adopters article, this is the most stable setup with 2/3 servers being inactive, taking over as necessary. As per setup for CephFS, cephfs_data and cephfs_metadata pools were created. The file system is called “cephfs”. The admin.secret file was created and stored in /home/ceph of all servers. Despite the usage of ceph-deploy, the mount command does not support CephFS – this is solved by installing the “ceph-fs-common” package via apt-get. The CephFS is to be mounted on each server at /mnt/ceph.

Due to the high number of OSDs per node, if the cluster network fails then Ceph gets confused. The solution has been added to the ceph.conf.

ceph.conf

[global]

## Initial settings

# Cluster ID
fsid = b0cc4b3a-5423-4cad-8472-a8a2aec137e7

# Initial members
mon_initial_members = bg-angel, bg-beast, bg-cyclops
mon_host = 155.198.98.61,155.198.98.205,155.198.98.108

# Authentication
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

# XFS Settings
filestore_xattr_use_omap = true

## Custom settings

# Network settings
public network = 155.198.96.0/21
cluster network = 10.0.0.0/24

# Replication settings
osd pool default size = 2 # 2-way replication
osd pool default min size = 1 # Minimum number of replicas in degraded mode

# Placement group settings
osd pool default pg num = 2048
osd pool default pgp num = 2048
mon pg warn max per osd = 0 # Suppress "too many PGs per OSD" warning

[mon]
osd min down reports = 7 # OSD down reports must involve more than one node
osd min down reporters = 7

Management

To check the health of ceph, from any node run:

ceph -s

To check the OSD config, run:

ceph osd tree

Further details on adding/removing OSDs can be found in the documentation.

To deploy a new configuration, from /home/ceph/ceph in bg-cyclops, run:

ceph-deploy --overwrite-conf config push bg-angel bg-beast bg-cyclop