Cloud Databases and Storage

Steve Sarsfield
3 min readMar 22, 2021

How data location in the cloud has a big bearing on how much you’ll pay

Licensed image source: iStock Photos

Cloud databases have variations in the way that they manage data storage. Before deciding on Snowflake, Redshift, Vertica, one of the many other cloud databases on the market, it’s essential to understand the solution’s capability to use data outside the database itself — data stored in separate S3 buckets in distinct formats like JSON, Parquet, and CSV.

Traditionally, data warehouses are used when ACID compliance, predictability, and reproducibility are required. Data warehouses offer carefully managed metadata, which are held to standards set by the data governance team. They are often augmented by data lakes, which give up some of those ACID characteristics for scalability and low cost. Data lakes offer the opportunity to use low-cost storage, like Hadoop or object stores like Amazon S3, to perform analytics at very large scale. Metadata is more loosely managed in a data lake, and the data within might be less structured than data in the data warehouse.

A closed cloud system

The above description of data lakes and data warehouses is accurate, except when your cloud-only database supports only one of these methods. Some cloud analytics solutions require that all data be loaded into a specific structured format in the data warehouse and offer no data lakes. They lack polyglot persistence, a term that analyst firm Gartner coined to describe this phenomenon.

While most solutions let you load data from, say Parquet, into the database, few will let you analyze it in place. This results in a couple of negative results for your budget and analytical timeline. First, you’ll have to spend time loading the data into the database, and in analytics, time is money. Second, once it gets there, you’ll spend your IT budget as the data spends its holiday living in the cloud.

Since all corporate analytical workloads are unique, you may find that the process of storing all data in one database type is flawed. The data science team likely has different expectations for performance than, say, users of an executive dashboard. IT teams often want to set up a tiering architecture, which means paying a premium when the service level agreement (SLA) requires fast response and getting a discount when the workload has a less critical SLA.

Data can exist in many file systems and many formats. Commonly, they include:

· File Systems: Apache Hadoop HDFS, Cloud or On-premises Object Store, Unix NFS

· File Formats: Text, CSV, JSON, ORC, Parquet, and so on.

Choosing a solution that can leverage these file systems and formats is crucial for aligning costs with your analytical needs.

Choose Well

When choosing your cloud database vendor, explore how it can bring together a data lake’s scale and economics with the predictability and reproducibility of a data warehouse. Consider how well your solution understands the difference between workloads. Can it keep your hot data hot in optimized database-native formats and your cold data cold in other formats like JSON and Parquet, or is the only option to store all data in the database format?

I write more about the other technical gotchas with cloud databases in my whitepaper entitled Avoiding Cloud Database Pitfalls and Lock-ins: The critical capabilities of your next cloud analytics solution.

--

--

Independent voice on database and analytics. Formerly from AnzoGraph, IBM, Talend and Vertica. Author of the book the Data Governance Imperative.