At the beginning of 2016, our Operations team started investigating other storage options. We had a lot of requirements – a solution that is stable, reliable, fast, predictable, can scale horizontally and vertically, doesn’t kill all the resources, is manageable, integratable with our metrics and monitoring systems, extendable and transparent (so no black box dependency on a single vendor). That is the point where DRBD came into play as a possible underlying technology.
DRBD is a distributed replicated storage system for Linux implemented as a kernel driver with some tooling around it. It has a history of more than 15 years of development and hundreds of thousands production installations worldwide – and a stable company behind it (LINBIT) to continue development and offering support. DRBD was originally written for HA scenarios (2-node clusters that are basically doing RAID1 over network) but since version 9 it offers the possibility to spin out up to 32 replicas per volume and other useful features. What DRBD basically does is replicates your block devices over the network – and provides a single block device on the output with very little overhead, so extremely quick. Just the rock-solid underlying technology needed for any SDS solution you’d like to build. And but one small piece of what is needed in cloud SDS solutions but the most critical one.
At that time DRBD also came with an orchestrator called drbdmanage and a Cinder driver providing the link between orchestrator and OpenStack. Together with LINBIT, we tested and put together a plan to improve all components and implement new features to make DRBD production-ready for OpenStack installations over the course of 3 months.
Long story short – after 3 months, a huge amount of work has been done, we verified that the solution is super-fast and extremely reliable (we were not able to break it in a way where we would lose data – and even in very unlikely scenarios we were always able to easily recover the data), possibly usable in some production scenarios but not exactly what we were looking for in a highly scaled cloud environment.
And because we were determined to bring to market a new, enterprise-grade SDS for cloud that administrators would actually like to use, trust and understand, and wouldn’t break the bank, we decided to double down on our investment – more resources, time, money and know-how – and created BlackStor – a true cloud native Software-Defined Storage solution.
BlackStor is a complex software package that turns any regular pool of servers into a fully functional storage stack for cloud (currently OpenStack; we’re also working on drivers for other cloud systems) that is policy driven (you define policies determining data placement, replication strategy, QoS, RPO, etc., and assign these to storage and consumer objects), multi-tenant and multi-cloud (so can run under multiple clouds with their authentication), scales both horizontally and vertically, is fully cloud-aware (so there are real bi-directional integrations and storage also provide detailed feedback to connected cloud stacks), has a lot of needed tooling around, including both CLI and JSON-based APIs and very soon a complete web-based GUI for management down to detailed low-level metrics and monitoring tools. And uses DRBD as an underlying replication technology (BlackStor is like a car, DRBD its engine).
It’s been about 1 1/2 years since we decided to replace Ceph and invested thousands of MDs into this new solution, and it was one of the best decisions we’ve ever made. BlackStor has been a standard part of all our cloud deployments since the beginning of 2017 and one of the first users was our client running a complete on-line business on it with turnover in the high tens of millions of dollars.
We also offer BlackStor as a standalone product (software kit or pre-installed appliance), so if you’re interested in it, just let us know – you can take it for a spin (or if SSD, for a flash).
Also be sure to stay tuned to our blog, where we’ll soon have articles with more detailed technical insights into the system, results of performance tests, new roadmap, and a few war stories about challenges we’ve crossed and lessons learned.