Data Access Layer — What? Why? How?

4 min readMar 26, 2021

What is a data access layer? Why do you need it? How does it work?
Read further to know everything you need to get started with building a data access layer.

Running out of storage?

With data workflows generating data on a regular basis, the load on the storage layer keeps on increasing. The costs proportionally shoot up too. There have been multiple situations where the system admin has sent out storage utilization limit alerts to the team asking for data to be deleted or migrated to low-cost data storage.

The problem is more pronounced if your set up uses traditional block storage systems like HDFS. Every data is also replicated internally to ensure availability, increasing the rate at which the utilization hits the limits even after a round of cleanup.

Why object storage is not the best solution?

Migration to object storage systems like S3 or Azure Blob is not the best solution. But, they come with their own set of challenges. The immediate observation would be that the data workloads are now slower. You can find multiple benchmarks comparing S3 vs HDFS, with the same observation.

Apache-Spark is used in most of the modern big data workloads. With Spark generating up to 200 part files for a data frame, reading the entire data-frame from S3/Azure Blob becomes much slower. As the object storage system has to look up all of the part files.

This is where building an intelligent data access layer helps you to strike the balance between cost and performance.

How does the data access layer fit in?

The data access layer comes in-between the data workloads and the data storage layer.

Comparing the change of data flow (image from Author)

The data processing jobs now talk to the data access layer instead of the storage layer. The data access layer now becomes responsible for writing and fetching the required data from the primary or secondary storage wherever it is available.

Primary Storage: The original storage system that was in use. The role of primary storage is below —

This is where new data will be written by the data processing jobs
Primary storage holds recent data and data that is accessed frequently
Data that is old or not being accessed frequently are moved to the secondary storage by the data access layer
High read, write performance with strong consistency is the requirement. Preferably block storage system like HDFS to be used here

Secondary Storage: This is the new storage system that has low data storage costs. The role of the secondary storage is below —

Data that is old, which will not be accessed by the regular jobs and required only for ad-hoc analyses will be moved to the secondary storage from the primary storage
The focus of secondary storage is to lower cost, you can afford lower performance as it will be serving data majorly for ad-hoc analyses
Object storage like S3/Azure Blobs can be used here

Let’s look into this in detail and understand how this can solve the storage utilisation problem mentioned earlier.

How does the data access layer work?

Let’s look in detail at how the data access layer works. Below is the component level breakdown —

Components making up the Data Access Layer (image by Author)

DataStream Writer: Any new data that needs to be written, goes through the DataStream Writer. It writes the data to the primary storage and adds details of the written data to the DataStream Repo.
DataStream Repo: Hosts definition of all the DataStreams that are available — data archiving strategy, data expiry period etc. The DataStream writer provides details of all the instances of DataStreams that are generated and stored to the primary storage.
DataStream Archiver: This interacts with the DataStream Repo and archives data as per the defined archival strategy and expiry period. Intermediate and temporary data is deleted and older versions of the rest are moved to the secondary storage. The location of the moved DataStream instances is also updated in the DataStream Repo.
DataStream Reader: Works like a read-through system, Reads a DataStream from the primary storage. If not found, reads the same DataStream from secondary storage. If still not found, fetches information from the DataStream Repo to raise a suitable error.

The Data access layer works in the background to manage data across primary and secondary storage. The logic for managing data across the storage layers can be improved in several ways. Any improvement in the logic will lead to a better balance between cost and performance.

Other applications of the data access layer can be —

Changes across versions to the schema of the same dataStream is a common problem. The DataStream Reader can accommodate for these cases and pre-process the data to provide the latest schema. Similar functionality is available with Databricks delta-lake
Testing of data workloads can be automated. DataStream Reader can generate dummy data as per provided rules to test the data workloads for different data scenarios
Migration of data sources. Migrating any dataStream to a new data source requires a definition update in the DataStream Repo. Then both DataStream Writer and DataStream Reader will start communicating with the new data source

Data Quality Engine

When data is transferred between systems, there is a high probability of degradation of data quality. With ML models prone to garbage-in, garbage-out and reports used to take important business decisions, maintaining high data quality becomes the next obvious problem to solve.

A data quality engine allows you to monitor the quality of incoming data, identify anomalies and raise issues to relevant stakeholders for resolution. Detailed post regarding “Data Quality Engine” will be out soon. Follow the page to stay updated.