This page details the current Distributed File System (GlusterFS), as well as the previous system (Ceph). Ceph is resource-intensive (when many server and client daemons have to run on the same machine) and performs automatic replica rebalancing when a node goes down; GlusterFS simply waits for the node to come back up to ensure replication.
Installation and configuration is straightforward (though it is easier to run it as root as every command requires sudo). On each machine the 6 drives had one partition made with GPT, and the partitions were formatted with XFS and an inode size of 512 bytes. The Gluster volume (gv0) was created to replicate data over all 3 machines, and distribute over the disks, hence:
sudo gluster volume create gv0 replica 3 transport tcp bg-angel:
The following was manually set for extra safety/maintenance:
sudo gluster volume gv0 bitrot enable
To access the volume, the Gluster Native Client is used. Specifically, each machine adds itself as the server to perform the mount on in /etc/fstab:
<hostname>:/gv0 /mnt/glusterfs glusterfs defaults,_netdev 0 0
Due to network outages, it may be necessary to heal the volumes on the machines.
Note that if one machine loses a connection to the others, it becomes read-only to prevent split-brain when the connection is restored.
In order to make sure Gluster is mounted properly, and the Samba container runs only after this, the following command was added to the end of /etc/rc.local (before “exit 0”):
( sleep 60; mount -a; docker start samba ) &
For a good overview of Ceph, see Ceph Intro and Architectural Overview (1/2 hour video). We were running the Hammer release of Ceph.
The standard instructions for ceph-deploy were followed, with bg-cyclops being chosen as the “admin” node. First was the preflight checklist. The ceph user was created with BICV’s default password. Before setting up passwordless SSH the user was switched using “su – ceph”.
Second was the storage cluster. The config files and keys are stored in /home/ceph/ceph. Make sure that all ceph-deploy commands are run within this folder! Note that when /etc/hosts had the private network IPs for the machines, Ceph failed to initialise (as it expects addresses to be from the public network). The final step to check Ceph’s health fails due to the low number of placement groups, so the default pool (rbd) was adjusted to have the right number of pgs:
ceph osd pool set rbd pg_num 2048 ceph osd pool set rbd pgp_num 2048
Beyond “Create A Cluster”, a metadata server was created on each server. According to the early adopters article, this is the most stable setup with 2/3 servers being inactive, taking over as necessary. As per setup for CephFS, cephfs_data and cephfs_metadata pools were created. The file system is called “cephfs”. The admin.secret file was created and stored in /home/ceph of all servers. Despite the usage of ceph-deploy, the mount command does not support CephFS – this is solved by installing the “ceph-fs-common” package via apt-get. The CephFS is to be mounted on each server at /mnt/ceph.
Due to the high number of OSDs per node, if the cluster network fails then Ceph gets confused. The solution has been added to the ceph.conf.
[global] ## Initial settings # Cluster ID fsid = b0cc4b3a-5423-4cad-8472-a8a2aec137e7 # Initial members mon_initial_members = bg-angel, bg-beast, bg-cyclops mon_host = 220.127.116.11,18.104.22.168,22.214.171.124 # Authentication auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx # XFS Settings filestore_xattr_use_omap = true ## Custom settings # Network settings public network = 126.96.36.199/21 cluster network = 10.0.0.0/24 # Replication settings osd pool default size = 2 # 2-way replication osd pool default min size = 1 # Minimum number of replicas in degraded mode # Placement group settings osd pool default pg num = 2048 osd pool default pgp num = 2048 mon pg warn max per osd = 0 # Suppress "too many PGs per OSD" warning [mon] osd min down reports = 7 # OSD down reports must involve more than one node osd min down reporters = 7
To check the health of ceph, from any node run:
To check the OSD config, run:
ceph osd tree
Further details on adding/removing OSDs can be found in the documentation.
To deploy a new configuration, from /home/ceph/ceph in bg-cyclops, run:
ceph-deploy --overwrite-conf config push bg-angel bg-beast bg-cyclop