aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Big Data
  • Engineering
  • Platforms

Are NoSQL Databases Relevant For Data Engineering?

  • root
  • March 18, 2021
  • 6 minute read

SQL is great, but sometimes you may need something else.

By and large, the prevalent type of data that data engineers deal with on a regular basis is relational. Tables in a data warehouse, transactional data in Online Transactional Processing (OLTP) databases — they can all be queried and accessed using SQL. But does it mean that NoSQL is irrelevant for data engineering? In this article, we’ll investigate use cases for which data engineers may need to interact with NoSQL data stores.


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

 

The reasons behind NoSQL

These days, data is stored at such a velocity, volume, and variety (in short: Big Data) that many relational database systems can’t keep up. Historically, this was the main reason why large tech companies developed their own NoSQL solutions to mitigate those issues. In 2006, Google published their Bigtable paper which laid the foundations for the open-source HBase NoSQL data store and GCP’s Cloud Bigtable. In 2007, Amazon offered their alternative solution with Dynamo paper.

From that time on, other NoSQL distributed database systems kept emerging. All of them have been mainly trying to mitigate the issue of scale (data volume) that is hard to achieve with the traditional vertically-scalable RDBMS. Instead of scaling vertically (buying more RAM and CPU for your single server), they operate on horizontally-scalable distributed clusters allowing them to increase the capacity by simply adding more nodes to the cluster.

Apart from scalability and addressing the sheer data volume, many distributed NoSQL database systems address the challenge of data coming in many different formats (highly nested JSON-like structure, XML, time-series, images, videos, audio…), i.e., variety, which is difficult to achieve in relational databases. I’ve seen many creative solutions in my career when people tried to serialize and blob non-relational data to a relational database. Still, such self-built systems tend to be hard to operate and manage at scale. Document databases with schema-on-read such as MongoDB allow much more flexibility in that regard.

Finally, with the velocity of data, continuously streaming in real-time, many NoSQL databases help address that issue. For instance, Amazon Timestream handles data streaming from a variety of IoT devices in a very intuitive and easy way.

Read More  How To Run SAP On Google Cloud When High Availability Is High Priority

Why do I need to know NoSQL data stores as a data engineer?
One of a data engineer responsibilities is integrating and consolidating data from various sources and providing it to data consumers consistently and reliably so that this data can be used for analytics and building data products. This means building data pipelines that pull data from (among others) NoSQL databases and store it in a data lake or data warehouse. Due to the above-mentioned data variety, data lakes are particularly useful for storing data from NoSQL data stores.

NoSQL use cases for data engineering

Now that we know why NoSQL is important for data engineering let’s look at the typical use cases that we may encounter.

1. Analysis of log streams

These days, many applications are sending logs to Elasticsearch and can be analyzed and visualized using Kibana. Thanks to the dynamic schema and indexing, Elaticsearch is a handy NoSQL data store in your data engineering toolbox, especially for container monitoring and log analytics.

 

2. Extracting time-series data from IoT devices and real-time applications

Relational database systems are usually not intended to be used with thousands of open connections used simultaneously. Imagine that you manage a large fleet of servers or IoT devices that keep streaming metrics such as CPU utilization to some centralized data store so that this information can be used for a real-time dashboard (showing the health of the system) and for anomaly detection analysis. If you would use a relational database for that, you would likely encounter connection issues (leading to some data loss) since RDBMS is not designed for thousands of short-lived connections. In contrast, NoSQL distributed data stores such as AWS Timestream  can handle that easily.

 

3. Extracting data from NoSQL backend systems

NoSQL databases are used as a backend data store for so many applications, ranging from e-commerce, content management platforms (blogs), mobile apps, websites, web analytics, clickstream, shopping carts, and many more. All of this data can be used for analytics once you established data pipelines that extract or stream this data into your data lake or data warehouse.

 

4. Caching data for fast retrieval

If you’ve ever built a dashboard with some BI tools such as Tableau, you probably preferred working with extracts rather than with live data, provided that the dataset was fairly large. Making connections to relational databases and waiting till they process and retrieve the data is often too slow to build really performant dashboards (that don’t hang when you apply a filter). For this purpose, in-memory NoSQL data stores such as Redis are great. If you use those to cache recent data that you need for your dashboards, you can make sure that you experience no lag and provide a good user experience.

Read More  Keep Tabs On Your Tables: Cloud SQL For MySQL Launches Database Auditing

 

Demo: extracting data from NoSQL

Let’s demonstrate how we can extract data from DynamoDB, one of the most popular (serverless) NoSQL data stores on AWS. I created a table called demo with partition key index. This index will be equivalent to my Pandas dataframe’s index.

Now we can use a combination of boto3 and awswrangler to load a sample dataset. Then, starting from line 34, we build a simple ETL to retrieve data that has been inserted since yesterday.

Additionally, we could use the PartiQL query editor in the management console to validate that our ETL returns correct data.

DynamoDB PartiQL query editor — image by author

 

How to keep an eye on your DynamoDB resources?

To monitor the read and write capacity units used by our table, it’s helpful to make use of an observability platform that provides you more details about your infrastructure and your applications’ health. For example, with Dashbird you can track any errors that occurred when reading or writing your data, the average latency of your read and write operations, whether a continuous backup is enabled for a quick point-in-time recovery,, and how many read (RCU) and write capacity units (WCU) are consumed by your resources.

The “demo” DynamoDB table in Dashbird — image by author

 

The average latency in your read and write operations observed using Dashbird — image by author

 

Drawbacks of NoSQL databases

The previous sections demonstrated the use cases where NoSQL data stores shine in data engineering. But now to the drawbacks.

The major drawback of NoSQL data stores is that without a full-fledged SQL interface, the developers need to learn (again) some proprietary vocabulary, API, or interface to access data. Timescale DB put it nicely on their blog:

“SQL is back. Not just because writing glue code to kludge together NoSQL tools is annoying. Not just because retraining workforces to learn a myriad of new languages is hard. Not just because standards can be a good thing. But also because the world is filled with data. […] Either we can live in a world of brittle systems and a million interfaces. Or we can continue to embrace SQL.” — Source

Even though many NoSQL data stores offer SQL interfaces on top, they will never be as powerful as a full-fledged relational database where you can create complex queries (not that complex queries are always a good thing).

Read More  What Was Your First Programming Language?

So purely from a data engineering perspective, it’s best to work with those NoSQL systems rather than against them. Use them when there is a use case that clearly justifies the need for NoSQL’s scale and flexibility coming from schema-on-read.

If data comes in a format that is not suitable to be stored in a data warehouse, we can always extract it, store it in a data lake (ELT-approach), and provide it for analytics in its native raw format. These days, many tools allow us to make sense of data stored in a data lake. When leveraging serverless SQL query engines such as Presto (or AWS version: Amazon Athena) or BigQuery, you can analyze even semi-structured nested data.

 

Conclusion

To answer the question from the title: yes, NoSQL data stores are important for data engineering. Even though they often lack a friendly SQL-like interface to retrieve the information you need, there are still very beneficial in various use cases. Given that more and more vendors realize the need for a SQL interface for data retrieval (such as PartiQL from AWS, the “good old” Hive, Spark SQL, CQL in Cassandra, SQL queries in Timestream), we should expect that the trend will continue. It’s possible that one day, we may be able to write SQL queries to retrieve data the same way regardless of the backend data store under the hood. In fact, if we look at federated queries with Presto, this future may not be that far away.

This article is republished from hackernoon.com


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

root

Related Topics
  • Data
  • Data Engineerning
  • Database
  • IoT
  • NoSQL
You May Also Like
View Post
  • Engineering

Just make it scale: An Aurora DSQL story

  • May 29, 2025
View Post
  • Engineering
  • Technology

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

  • March 9, 2025
View Post
  • Computing
  • Engineering

Why a decades old architecture decision is impeding the power of AI computing

  • February 19, 2025
View Post
  • Engineering
  • Software Engineering

This Month in Julia World

  • January 17, 2025
View Post
  • Engineering
  • Software Engineering

Google Summer of Code 2025 is here!

  • January 17, 2025
View Post
  • Data
  • Engineering

Hiding in Plain Site: Attackers Sneaking Malware into Images on Websites

  • January 16, 2025
View Post
  • Computing
  • Design
  • Engineering
  • Technology

Here’s why it’s important to build long-term cryptographic resilience

  • December 24, 2024
IBM and Ferrari Premium Partner
View Post
  • Data
  • Engineering

IBM Selected as Official Fan Engagement and Data Analytics Partner for Scuderia Ferrari HP

  • November 7, 2024

Stay Connected!
LATEST
  • 1
    Just make it scale: An Aurora DSQL story
    • May 29, 2025
  • 2
    Reliance on US tech providers is making IT leaders skittish
    • May 28, 2025
  • Examine the 4 types of edge computing, with examples
    • May 28, 2025
  • AI and private cloud: 2 lessons from Dell Tech World 2025
    • May 28, 2025
  • 5
    TD Synnex named as UK distributor for Cohesity
    • May 28, 2025
  • Weigh these 6 enterprise advantages of storage as a service
    • May 28, 2025
  • 7
    Broadcom’s ‘harsh’ VMware contracts are costing customers up to 1,500% more
    • May 28, 2025
  • 8
    Pulsant targets partner diversity with new IaaS solution
    • May 23, 2025
  • 9
    Growing AI workloads are causing hybrid cloud headaches
    • May 23, 2025
  • Gemma 3n 10
    Announcing Gemma 3n preview: powerful, efficient, mobile-first AI
    • May 22, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • Understand how Windows Server 2025 PAYG licensing works
    • May 20, 2025
  • By the numbers: How upskilling fills the IT skills gap
    • May 21, 2025
  • 3
    Cloud adoption isn’t all it’s cut out to be as enterprises report growing dissatisfaction
    • May 15, 2025
  • 4
    Hybrid cloud is complicated – Red Hat’s new AI assistant wants to solve that
    • May 20, 2025
  • 5
    Google is getting serious on cloud sovereignty
    • May 22, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.