Should you move your Big Data and Data Lake to the Cloud
Should you follow the trend and migrate your data, workflows and infrastructure to GCP, AWS and Azure? During the Strata Data Conference in New-York, a general focus was put on moving customer’s Big Data solutions to the Cloud. Let’s see what has been addressed in each vendor’s talks, and what to consider before doing so.
Before diving into cloud vendors promises and their pitfalls, I’ll share the context on why this article was written.
Since Hortonworks and Cloudera’s merger, maybe even a little earlier, traditional on-premise Big Data is on a downhill trend. Companies started their lakes and analytic platforms and, around 2015, the hype was at the highest. Most of the actors realized the operational costs included with distributed technologies, and started searching for ways to alleviate them to focus on their core business instead.
Cloud services started their hype cycle as early as Big Data, but were at first primarily built for on-demand computing and simple highly available applications (eg. websites). While on-premise Big Data, especially Hadoop and Data Lakes, went to the “trough of disillusionment” stage mainly because of the aforementioned operational costs, Cloud providers started adding Big Data managed services to their catalog.
These two factors, on-premise downfall due to infrastructure complexity and the cloud’s managed services appearance, led to the current trend. The Strata Data Conference overflowed with on-premise-to-cloud migration talks and other cloud based features presentations, while traditional Big Data was hardly covered at all. In fact, the only Apache Hadoop talk that I attend started its first slide with the title “is hadoop dead”.
So, what can Cloud services bring to your organisation? Let me spoil you some of your local cloud sales key bullet points.
Infrastructure elasticity: scale your infrastructure up and down to fit your needs. A new data source feeds your lake and requires additional storage? No problem, S3/GC Storage/ADLS has you covered. A new compute-intensive use-case appeared? Spawn some compute engines/EC2/Azure VMs in a matter of seconds. For even more flexibility, use a fully managed kubernetes cluster. Focus only on what matters, and leave the rest to your Cloud provider.
Maintenance and upgrades: always work with the latest most efficient infrastructure and newest up-to-date software. No downtime required for hardware maintenance or software updates. Security patches are applied as soon as released and new features are regularly added to the catalog. Focus only on what matters, and leave the rest to your Cloud provider.
Expert support: every service is offered ready-to-use, but if you need help to get the most out of them call a specifically trained experts to support you. Just like with the platforms, Cloud providers have an on-demand team ready to assist you at every moment of the day, night and weekends. And as always, pay only for what you use. Focus only on what matters, and leave the rest to your Cloud provider.
Third party support: most of specialized tech product leaders are already integrated with the platforms, and the others are to come. Looking for an unified analytics platform? All your appliances unified on one platform. Focus only on what matters, and leave the various third-party support handling to your third party solution provider.
Operational tools: deploy, coordinate, monitor, and automate your infrastructure and applications through one single CLI. For disaster recovery, choose a multi-region plan. To schedule tasks and jobs, use the Azure Scheduler/Google Cloud Scheduler/AWS Batch. Whatever operation is not a feature yet can be achieved through
aws2 scripts. Focus only on what matters, and make what isn’t available as easy as possible.
While I may have been a bit overdramatic with the form, these are some of the reasons you will be offered to migrate from your historical on-premise solution to the cloud.
During my professional life I’ve been asked to deploy/migrate Big Data infrastructures to the Cloud. Some of the benefits listed above were truthful, but others did not go as planned. While I mainly worked with Microsoft Azure, I believe the following statements hold true for other cloud providers.
Infrastructure elasticity: sure, if you use managed services. Managed services are by definition managed by the cloud provider, and they are expected to able to up or downscale your virtual infrastructure based on your current load. But every use-case is not fit for managed-services, and other offerings often do not handle automatic scaling or even scaling at all. For example, first Hadoop solution migrations tend to use a lift-and-shift approach, using the cloud provider uniquely as a IaaS. Storage and compute is directly attached to VMs, and while scale up might be easier since there’s no hardware concerns, scale down is as complex as on-premise. Other example: most Big Data platforms I have encountered have a fairly linear compute workload. Reserved infrastructure tends to cost less than on-demand one. It is thus often more cost-effective to use dedicated reserved infrastructure than its managed counterpart.
Maintenance and upgrades: this is mostly true for infrastructure. Most organisation I have worked with have dedicated teams for these tasks and often are trailing behind compared to cloud hardware and software versions. Keep in mind that cloud-powered software is however locked to a few versions, and their end of support is sometimes more restrictive than the native vendor. Some custom built images of one of my customers had its infrastructure support rejected because of them.
Expert support: vendor support is nothing more than it was on-premise. You still require an internal team of experts to administrate whatever cloud-service you use. On complex multi-purpose platforms like these, there is no such thing as a single all-capable expert. Even on the cloud.
Third party support: native specialised vendors that have already integrated their offers with the major cloud providers are still few. Those who are available on all three of them even more so. But the trend is definitely there, and the statement may become true in the near future. For now however, check your appliances and their integration before switching to the cloud, as it may be a deciding factor.
Operational tools: the most advanced features in the cloud stack. From their very beginning when DevOps was the primary usage of cloud services, vendors have been perfecting their operational catalog. Now that more complex platforms and distributed architectures are looking to migrate to the cloud, these tools have been adapted to also fit their needs. While everything is not ready yet, most administrative tasks can be done through one of their services. The CLI, for Azure at least, is however not as complete as it should be, and you will often have to go back to each service individual CLI.
In hindsight, cloud is actually a good prospect for your IT hosting, but not for the most advertised reasons. Infrastructure cost and ease of maintenance and provisioning coupled with an one-tool-for-every-usage vibe is the most common reasons I heard for organizations to look to the cloud. These are only achieved with careful preparation and a good cloud-admin team, for some use-cases only, and will probably only end up just as good as the legacy on-premise solution. The stack of operational tools however is an instant benefit, almost fully ready, and a feature that cannot be found on heterogeneous on-premise architectures.
There are some other aspects of cloud to take into account when considering to move a whole Big Data platform to one of the big three solutions.
One misconception is that tools supported by cloud providers are designed to work with each other, unlike individual open-source projects like the ones of the Apache Software Foundation. Know that a lot of cloud services are actually implementation of one of these open-source projects. Google’s Dataproc Component Gateway is in fact Apache Knox, and most of AWS services are based on Apache open-source projects. Apache YARN and Kubernetes are the main orchestration tools for your distributed tasks, Apache Spark is the primary compute engine, and you may find other common on-premise tools like Apache Kafka behind your cloud features. All in all, cloud services are not guaranteed to work with each other. Plan on what features you are going to use before migrating and check their compatibility with each-other.
Cloud services are not always compatible between themselves, let alone with other external components. It is by design an anti-pattern to make cross-cloud tasks. To leverage multi-provider services, it is recommended to have a governance solution on top and respect Cloud-locality. Be also careful when integrating on-premise services with cloud ones. One real-world example I can share is Azure’s AD sync tool between a local AD and its managed counterpart in Azure. It has for now some missing features and a restrictive usage, despite being a Microsoft product on both ends. Internal security is particularly affected by cloud and on-premise compatibility. Most providers now have bring-your-own-key features, but be sure to check your security requirements and that they are supported by your cloud vendor.
While on-premise comes with several restrictions, cloud services also introduce some new issues. For example S3 consistency model makes it so that a read request might not always return the latest state of a data. Additional design has to be implemented for read consistency to be achieved, like writing a transaction log file with locked write privileges. As S3 is the primary storage solution on AWS, third party tools are sometimes necessary to cover some use-cases.
As illustrated by the previous point, using cloud services implies that your are locked into a vendor’s solutions. Let’s say for example you chose AWS for S3 and EMR. You want to use a messaging system to feed your EMR jobs? Use either SQS/SNS (aka Apache ActiveMQ), Kinesis (aka Apache Kafka), or maybe even ElastiCache (aka Redis). No flexibility to use RabbitMQ or ZeroMQ natively. AWS has still the most complete stack, so you might meet these restrictions even more often on Azure and Google Cloud.
The Strata Data Conference in New-York might have been more business-oriented, but it provided an interesting view of the perceived trend ongoing in the Big Data landscape. So, should you migrate your Big Data to the cloud?
First, it is a change that will happen eventually. Just like Big Data, cloud technologies are quite complex, and it will take time to adapt your technical teams and end-users to it. Best start early with low-importance projects to get a hang of it before more business-critical ones appear.
Second, heterogeneous Big Data platforms with a mix of various vendor solutions have created a complex ecosystem. Giving some order to this environment by placing it on top of a single operational entity is a good thing.
Lastly, some Big Data use-cases are better served in the cloud. The separation of compute and storage offered a more flexible way to handle one-time tasks, like initial loads. There are other unique designs that cloud enables.
Cloud services are however not fitted for every use-case. Security critical ones should for example be kept on-premise. Also, cloud features are not all ready yet, and their compatibility with each other and external components is still extremely lacking.
All in all, I would suggest to try out some Cloud solutions with a few selected use-cases to learn the technologies, and keep your DataLake on-premise for the time being.