We’ve been overfed with projections about the long-term growth of IaaS, PaaS and SaaS markets, how enterprises are moving their infrastructures to the cloud and building cloud-ready applications. IDC, 451 Research, TBR and other providers of market intelligence and analysis constantly forecast double-digit growth of public and private clouds at the expense of traditional IT spending – and they’re right. From what we’ve seen in the last couple of years, companies of all sizes – from SMEs to global enterprises – are moving their technologies to various forms of cloud infrastructure and this trend is not going to stop any time soon. In fact, it is just a matter of time before the majority of applications will be sitting in the cloud.
And as would be logical, in-line with this trend we see the growth of the Software-Defined Storage (SDS) market – the one that forms the core of the most critical underlying technology to store data in cloud environments – forecasted to grow from $4.72 Billion last year to $22.56 Billion by 2021, with a CAGR of 36.7%. And to say SDS is most critical is not exaggerated in any way. One just has to Google statistics from insurance companies showing that some 25-80% of businesses go under after a data disaster. Some stats based on numbers, some based on “researchers who found” estimations. But at the end of the day it is clear that data loss or data corruption may be a serious headache to company executives along with a significant financial and business impact to the corporation itself, in some cases unrecoverable.
That said, any company moving their data to the cloud should carefully think over the reliability of providers and underlying technology being used to store the data. Of course there are many more perspectives to take into account such as backup and D&R procedures, data safety, etc. and we will broach these in followup blog articles.
When we started building our cloud infrastructure 4 years ago, based mainly on OpenStack technology, there were basically two options considered reliable enough for an underlying storage stack in a production environment – Ceph and enterprise-grade storage solutions (both SDS and storage appliances) from vendors such as EMC, IBM, etc. And this has remained static until today (Or not? We’ll come to that…).
So we began our OpenStack journey with Ceph, mainly for the following reasons:
- Ceph had and still has the vast majority of market share in OpenStack storage installations
- It was open-source, which is very important for a company like us providing end-to-end solutions – meaning it is not an absolute black-box; instead we can dig deep when it comes to design and debug issues and not rely 100% on other vendors’ support and closed-source solutions
- It seemed reliable
- It seemed to be much more cost-effective than other options (haha, what a mistake!), which made it the only solution that would fit certain business cases
The truth is that Ceph is reliable and perfectly integrated with OpenStack. You can do a lot of things with it – remove resources, reboot servers, etc. and it heals itself (at significant cost in performance, of course). Until something non-standard happens. In our case, it was the moment after one incident when recovery became so CPU/memory-hungry that it caused crashes of the OSD daemon and ended-up with broken objects – which caused other crashes whenever OSD touched those. The result was a very difficult-to-recover storage cluster. And since in Ceph you have metadata in RADOS, objects everywhere, and Red Hat wasn’t able to offer anything else than standard support response time, the only option was tons of scripting and low-level programming and days to recover the data (the large trove that wouldn’t fit on standard backup). And of course broken SLAs… (So if you find yourself in a similar situation and don’t know what to do, feel free to get in touch with us – we have a lot of experience earned the hard way 🙂 ) And now, with the additional layer represented by BlueStore, such a scenario would be most probably even worse.
There is also an issue with Ceph’s overall performance, starting with high latencies and bad write performance (don’t even try to run anything bigger on spindle drives) for some needed applications (eg. huge database systems running in the cloud), ending with the fact that Ceph is extremely CPU hungry, making it practically impossible to run certain hyper-converged scenarios, resulting in skyrocketing final TCO due to needed investments into more hardware. Also at that time we would have appreciated common features such as compression, deduplication, etc. Red Hat has invested a lot into Ceph in the last couple of years, showing great improvement, but from our perspective not enough to make it a system we’d trust to underpin our infrastructure for very-high-performance applications and bigger cloud stacks.
To be continued.