As businesses use machine learning across more areas of their business to make informed decisions, we see them facing four significant challenges: 

  1. Increasing agility to launch machine learning capabilities more quickly to meet a range of business needs, from inventory optimization to fraud detection.
  2. Minimizing the cost of delivering business intelligence and machine learning in a hybrid cloud or multi-cloud environment.
  3. Quickly and easily scaling data management and analytic capabilities as workloads change.
  4. Seamlessly switching from one hyperscaler to another to take advantage of the best balance between cost and performance, while meeting business requirements.

To address these challenges, businesses need to combine the best of data warehouses and data lakes into a new data architecture. The “lakehouse architecture” is gaining support from leading hyperscalers, in partnership with their technology and service provider partners.

Lakehouses evolved to overcome the limits of data delivery platforms such as data warehouses and data lakes, which are often too expensive to maintain and cannot support the types and amount of data required by today’s machine learning systems.

More + Different Data = Storage and Management Challenges 

Enterprises deployed data marts in the 1970s and 1980s to provide easier, cost-effective access to data that had previously been siloed in individual applications. A data mart might, for example, be used to combine information from a customer relationship management (CRM) system with product purchase records from an accounting system in order to track a customer’s lifetime value to the enterprise.

Over time, businesses sought to aggregate data from multiple data marts into data warehouses, to gain a more complete view of their customers and the business. Such a warehouse might combine CRM and purchase data with manufacturing and supply chain data to better understand not only sales trends but also manufacturing quality and supply chain efficiency.

In the last 10 years, businesses created even more comprehensive data stores called data lakes. Unlike data marts and data warehouses – which typically store only structured data such as SQL databases – data lakes store almost any information, in any format, including video, images and social media posts. The ability to scale with data lakes allowed for the massive amounts of data needed to build and train machine learning models.

But like data warehouses before them, the new data management and delivery pipelines that data lakes require can make them too expensive to create and maintain. Adding to the cost burden, many businesses maintain both: data warehouses (to meet existing analytics needs) and data lakes (to support new machine learning applications). Doing so doubles their data storage and management costs.

Enter the Lakehouse 

A new approach to meeting these needs is a data architecture known as a lakehouse, which aims to combine the best features of data warehouses and data lakes by providing:

  • Support for both traditional SQL-based structured data and for other, more modern, data formats.
  • A robust data storage and compute model that leverages any data format or scale of data to more quickly and easily build machine learning models.
  • A unified view of data engineering and consumption that reduces costs and speeds up innovation for users, analysts and data scientists.
  • Unlimited scalability and lower costs than on-premise infrastructure.

Databricks on Google Cloud is the latest example of leading hyperscalers offering the lakehouse architecture. Delta Lake on Databricks enables data engineering, cloud data processing, data science and analytics workloads on a unified data platform. This makes it easier, faster and more cost-effective for any user – from business analyst to data scientist – to discover and deliver insights quickly to the enterprise and use machine learning in production.

Here is how a lakehouse architecture can help meet the four critical enterprise ML needs. 

  1. Agility: With some lakehouse architectures, businesses can query the data lake directly to answer any business question (whether using traditional business analytics or new machine learning applications). This reduces the development time for both queries and reports, as well as machine learning models. Reusable data engineering processes and self-service data preparation and analytics reduce the time required to find and act on new insights.

    2. Cost control: Some users have discovered that low-cost cloud storage and data compression methods, such as the open-source Delta data file format, can cut storage, compute and networking costs by as much as 80%. Pay-as-you-go billing for storage, compute and networks, and the use of open-source container and serverless technologies, also minimizes infrastructure costs and improves portability.

    3. Scalability: By tapping into the hyperscaler’s infrastructure, enterprises can quickly increase their capacity as business needs change.

    4. Openness: Because many technologies that enable the lakehouse architecture, such as the Databricks Unified Data Platform and Google Cloud, are open source and/or supported by multiple hyperscalers, businesses will find it relatively easy to move their machine learning platforms from one cloud provider to another, or among hybrid cloud platforms. The use of common pipelines to orchestrate, trigger and manage the various data and machine learning workflows and pipelines, as well as a single unified runtime, can also ease workload shifting among hyperscalers.

The Lakehouse Effect

As the market evolves, look for hyperscalers and their partners to deliver improved self-service capabilities for data engineering and access; higher-performance lakehouse-based platforms so they match that of data warehouses; improved ACID (atomicity, consistency, isolation, durability) capabilities; and simpler containerization and deployment of production-ready ML models.

Remember to make proper data governance and data management the foundation of your lakehouse strategy, as the “garbage-in/garbage-out” rule is more important than ever when it comes to the data required to build machine learning models.

Anil Nagaraj

Anil Nagaraj

Anil Nagaraj is a Vice President of Analytics & AI in Cognizant’s Digital Business practice. He has spent over 20 years guiding... Read more