When building your data infrastructure that encompasses from data collection to analytics, starting is a relatively easy part. Things become more challenging when you have to scale. There is no better place to experience it than data lake management. Unlike traditional database systems where data is indexed and stored in tables, data lake has its data stored in a file system, which makes indexing and partitioning part of everyday DataOps tasks. This eventually causes inefficiencies in terms of cost and performance. In this article, we will discuss the DataOps challenges in data lake and what technologies we can consider to overcome the issues.
Difficulties in DataOps with current data lake architecture
The current data lake architecture is built upon Presto and Trino (Presto SQL is now Trino). Organizations adopt data lake since it allows them to run analysis on various types of data in formats, sources, and locations without having to run additional processing. This dramatically reduced dev-to-market time. However, this comes with a cost. Data teams still have to mind data budget and optimize query performance. There are three main areas where teams can feel difficulties in data operation.
No visibility of infrastructure for consistent and improved performance
Managed data services provide conveniences for users by minimizing necessities of managing data infrastructure. This tends to work when your use cases remain simple. However, when your data becomes bigger and thus needs more consistent and faster performance, it can be more difficult to meet your criteria. For example, when you run a query in AWS Athena, your query request goes to the shared queue that takes all query requests across your region. Depending on the demand at your request time, query performance can become different. Since it is a managed service, you cannot change its fundamental infrastructure to guarantee consistent performance.
Difficulties in partitioning data
Although you can still run a query without partitioning your data, when you want to optimize query performance for big data, partitioning would be recommended. Partitioning is a manual job that requires you to select a base column for partitioning. Depending on the column you choose and its type, query performance for the table becomes different. In addition, there are several points that you need to consider and limitations. Plus, once you decide a partitioning column and you want to change it later, you may have to rebuild a table with a new partitioned structure in S3.
Query failures due to wrong join
When you run join queries on AWS Athena, one recommended practice is to put the larger table on the left and the smaller one on the right. This is because of Presto’s allocation behaviors. When a join query runs, Presto allocates the right table to worker nodes, and then it uses the left table to perform the join. When the table on the right is smaller, Presto consumes less memory and thus the query runs faster. According to the AWS example, queries result in 53% speed difference.
When users don’t consider this to run big join queries, it can lead to a “timeout” error. The timeout error does not show the exact cause, so when the people who do not know the Presto behavior face this error, it can take a long time to find and fix the problem.
Due to the aspects we discussed above, maintaining query performance in a data lake can take a lot of resources. When this can potentially cause an operational overhead, you can consider a third party solution to reduce the burden. One example would be Varada that developed dynamic indexing for data lakes. What they do is to break down a large dataset into what they call a nano block of 64k rows. Their technology looks at each nano block (thus original dataset) and automatically chooses an index for each nano block. Their monitoring system looks at the queries from users and keeps evaluating performance.
Then, their indexing logic dynamically introduces indexes for filtering, joining, and aggregating queries. Also, unlike other data lake solutions that do not allow users to choose any column for partitioning, theirs lets yours allocate any column. They claim that Their dynamic indexing can make Presto queries faster by 10 to 100 times with 40% – 60% cost reduction.
Numerous companies use cloud services like AWS, GCP, and Azure thanks to their managed services. They bring huge cost benefits and conveniences to their users. When we look at those big cloud players at a micro-level, each service has strengths and weaknesses, and for these gaps, another group of cloud companies provide solutions to fill the gaps. The same story applies to data lake products. When your teams can potentially suffer from operational complexity due to partitioning and indexing, seeking a third party technology like dynamic indexing can mitigate the pain.
Also Read: Data Versioning- How to Version your Data