Analytics Lakehouse on GCP — Principles and Building Blocks

Samet Karadag
Google Cloud - Community
6 min readJul 13, 2023

--

Analytics Lakehouse is a unified data platform that combines the best of data warehouses and data lakes. It provides a single, scalable, and secure repository for all your data, and makes it easy to analyze and visualize your data, regardless of its format or structure. Data lakehouses offer the scalability and flexibility of data lakes, with the performance and security of data warehouses.

(Disclaimer: The views expressed in this blog post are solely my own and do not represent the opinions, beliefs, or positions of my employer.)

Evolution of an Analytics Platform

Before getting into the key benefits, principles and building blocks of an analytics lakehouse, let’s talk a bit of history.

Twenty years ago, business intelligence was established by building centralized repositories for structured data in data warehouses (EDWs). EDWs offered great analytical query response times, but they had limited capabilities to process unstructured and semi-structured data, and they were quite expensive. Enterprises then built two-tier architectures with data lakes near EDWs to process large amounts of un/semi-structured data and optimize data storage costs. Data lakes are great for dealing with large amounts of data in different formats, but it is difficult to maintain large clusters, run interactive queries, and provide proper data governance on top. Fortunately, cloud storage and BigQuery have now made us forget about most of the scalability, maintenance, high availability, and disaster recovery-related hassles. With mass migrations to the cloud, we are now building and analyzing dozens of times more data, but optimizations are still one of our biggest challenges.

Example — Evolution of an analytics platform

Key Benefits of Analytics Lakehouse

Analytics Lakehouse removes the border between data warehouses and lakes, enabling data synergy thus providing a number of benefits:

  • Improved data quality and consistency
  • Reduced data silos
  • Improved data governance
  • Reduced costs
  • Reduced operational challenges
  • Accelerated innovation with faster time to insights
  • Increased data agility
  • Data democratization
  • Better collaboration on data
  • Improved decision-making by providing access to a unified view of data.

Key Principles of Analytics Lakehouse

Here are some of the key principles of a modern analytics lakehouse:

  • Flexible — analyze any data with any processing engine
  • Intelligent — A modern analytics lakehouse should provide modern AI development capabilities such that it should be easy to train and call traditional, custom and generative AI models
  • Fast — It is no brainer that amodern analytics lakehouse should be able to provide better scalability and performance to process large amounts but also it needs to do that for interactive, batch and streaming analytics use cases.
  • Agile — By making data more accessible and easier to analyze, analytics lakehouse should help organizations get to insights faster.
  • Secure — providing holistic data governance capabilities, Analytics lakehouse on GCP should help organizations improve data governance, security and quality.
  • Efficient — A modern analytics lakehouse should be cost-effective providing both on-demand and predictive capacity with auto-scalability.
  • Reliable — A modern analytics lakehouse should be reliable and available.

To achieve these principles an analytics lakehouse should provide below key capabilities:

  • Supporting different data formats - structured, unstructured and semi-structured data in various formats
  • Supporting both batch and streaming data processing
  • Supporting different data processing engines such as SQL, Spark, Beam, Python, Notebooks…
  • Security and data governance features such as fine grained access controls, role based access controls, identity and access management, dynamic data masking, lineage, data discovery, profiling, data quality, encryption…
  • Auto scalability to meet the capacity demand when needed
  • Ability to run AI models for classifications, regressions, predictions and generations.
  • Federations and integration with different data storages.
  • Data sharing capabilities to share data with internal and external customers/partners without copying data.
  • Multi-regional data storage and automatic failovers

Building Blocks of Analytics Lakehouse

On a higher level analytics lakehouse have the following essential building blocks:

  • Data Lake: A cloud storage based data lake is a centralized repository for all your data, both structured and unstructured. It provides a common ingestion and raw layer for your data, which can be used for analytics, machine learning, and other purposes.
  • Data warehouse: A data warehouse is a repository for structured data that is used for reporting and analysis. It is optimized for performance. BigQuery: A serverless, highly scalable, and cost-effective cloud data warehouse
  • Analytics engine: An analytics engine is a software platform that provides the tools and capabilities for analyzing data. It can be used to build and run queries, reports, and visualizations.
  • Machine learning platform: A machine learning platform is a software platform that provides the tools and capabilities for building and deploying machine learning models. It can be used to train and deploy models on data from the data lake or data warehouse.
  • Governance and security: Governance and security are essential for any analytics lakehouse. They ensure that data is accessible to authorized users only and that it is used in a compliant manner.

There are many other components that can be added to a lakehouse, such as data integration and quality tools. The specific components that are needed will vary depending on the specific needs of the organization.

How all these building blocks and capabilities come together within GCP:

Lakehouse Components in GCP

Modern analytics Lakehouse is a comprehensive platform, requires a village of capabilities working in harmony.

GCP offers a wide range of products seemlessly integrated with each other to meet these key principles. Key products include:

  • BigQuery: A serverless, highly scalable, and cost-effective cloud data warehouse
  • Google Cloud Storage (GCS): Seemlessly scalable, resilient and secure storage for files.
  • Big Lake: Storage engine that unifies data warehouses and lakes by enabling BigQuery and open source frameworks like Spark to access data with fine-grained access control.
  • Dataplex: Dataplex is a unified lakehouse management platform that helps you manage your data lake in a secure and compliant way. It provides a centralized place to manage your data, and it offers a variety of features to help you manage access to your data.
  • Vertex AI: A unified platform for machine learning that makes it easy to build, train, and deploy machine learning models
  • BigQuery Omni: allows you to analyze data stored in multiple public clouds without having to move the data
  • Cloud Data Catalog: A unified metadata management service
  • Cloud Dataproc: A fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks
  • Spark on BigQuery: You can build Spark stored procedures in BigQuery in Python, Java and Scala.
  • Cloud Dataflow: Unified stream and batch data processing that’s serverless, fast, and cost-effective.
  • Cloud Data Fusion: A fully-managed, data integration service
  • Cloud Dataproc Metastore: A critical component of data lakes built on open source processing frameworks like Apache Hadoop, Apache Spark, Apache Hive, Trino, Presto, and many others.

Below capabilities of these products are also important to mention:

  • BigQuery Metadata Cache: improve query performance for BigLake object tables
  • BigQuery BI Engine: accelerate query performance for visualization tools
  • BigQuery ML:create and execute machine learning models using SQL
  • Serverless Dataproc: run Spark batch workloads without provisioning and managing your own cluster. Process data which is stored on BigQuery or GCS with direct access via storage API and spark connectors.
  • Spark Procedures:create Spark stored procedures in BigQuery
  • BigQuery Federated Queries: query CloudSql and Spanner tables from BigQuery.
  • Bigquery Search: BigQuery search indexes let you use GoogleSQL to easily find unique data elements that are buried in unstructured text and semi-structured JSON data, without having to know the table schemas in advance.
  • Analytics Hub: Analytics Hub is a data exchange that allows you to efficiently and securely exchange data assets across organizations to address challenges of data reliability and cost.
  • Bigquery JSON functions: Retrieve and transform JSON data
  • BigQuery Generate_text function: Call LLM (Generative AI) directly from BigQuery.

Overall, Analytics Lakehouse will bring your organization’s analytical capabilities to a next level, further unlock new qualities and get you ahead in understanding and decision making. And GCP is the perfect place to take this journey. If you want to read more here are some articles published by great GCP PMs:

Open data lakehouse on Google Cloud

Unify your data assets with an open analytics lakehouse

Complimentary White Paper: Learn how to build an analytics lakehouse on Google Cloud

--

--