MongoDB to Redshift ETL: 2 Easy Methods

营销拓客

2024-08-14 08:28:56

LIKE.TG 成立于2020年，总部位于马来西亚，是首家汇集全球互联网产品，提供一站式软件产品解决方案的综合性品牌。唯一官方网站：www.like.tg

If you are looking to move data from MongoDB to Redshift, I reckon that you are trying to upgrade your analytics set up to a modern data stack. Great move!

Kudos to you for taking up this mammoth of a task! In this blog, I have tried to share my two cents on how to make the data migration from MongoDB to Redshift easier for you.

Before we jump to the details, I feel it is important to understand a little bit on the nuances of how MongoDB and Redshift operate. This will ensure you understand the technical nuances that might be involved in MongoDB to Redshift ETL. In case you are already an expert at this, feel free to skim through these sections or skip them entirely.

What is MongoDB?

MongoDB distinguishes itself as a NoSQL database program. It uses JSON-like documents along with optional schemas. MongoDB is written in C++. MongoDB allows you to address a diverse set of data sets, accelerate development, and adapt quickly to change with key functionalities like horizontal scaling and automatic failover.

MondoDB is a best RDBMS when you have a huge data volume of structured and unstructured data. It’s features make scaling and flexibility smooth. These are available for data integration, load balancing, ad-hoc queries, sharding, indexing, etc.

Another advantage is that MongoDB also supports all common operating systems (Linux, macOS, and Windows). It also supports C, C++, Go, Node.js, Python, and PHP.

What is Amazon Redshift?

Amazon Redshift is essentially a storage system that allows companies to store petabytes of data across easily accessible “Clusters” that you can query in parallel. Every Amazon Redshift Data Warehouse is fully managed which means that the administrative tasks like maintenance backups, configuration, and security are completely automated.

Suppose, you are a data practitioner who wants to use Amazon Redshift to work with Big Data. It will make your work easily scalable due to its modular node design. It also us you to gain more granular insight into datasets, owing to the ability of Amazon Redshift Clusters to be further divided into slices. Amazon Redshift’s multi-layered architecture allows multiple queries to be processed simultaneously thus cutting down on waiting times. Apart from these, there are a few more benefits of Amazon Redshift you can unlock with the best practices in place.

Main Features of Amazon Redshift

When you submit a query, Redshift cross checks the result cache for a valid and cached copy of the query result. When it finds a match in the result cache, the query is not executed. On the other hand, it uses a cached result to reduce runtime of the query.
You can use the Massive Parallel Processing (MPP) feature for writing the most complicated queries when dealing with large volume of data.
Your data is stored in columnar format in Redshift tables. Therefore, the number of disk I/O requests to optimize analytical query performance is reduced.

Why perform MongoDB to Redshift ETL?

It is necessary to bring MongoDB’s data to a relational format data warehouse like AWS Redshift to perform analytical queries. It is simple and cost-effective to efficiently analyze all your data by using a real-time data pipeline. MongoDB is document-oriented and uses JSON-like documents to store data.

MongoDB doesn’t enforce schema restrictions while storing data, the application developers can quickly change the schema, add new fields and forget about older ones that are not used anymore without worrying about tedious schema migrations. Owing to the schema-less nature of a MongoDB collection, converting data into a relational format is a non-trivial problem for you.

In my experience in helping customers set up their modern data stack, I have seen MongoDB be a particularly tricky database to run analytics on. Hence, I have also suggested an easier / alternative approach that can help make your journey simpler.

In this blog, I will talk about the two different methods you can use to set up a connection from MongoDB to Redshift in a seamless fashion: Using Custom ETL Scripts and with the help of a third-party tool, LIKE.TG .

What Are the Methods to Move Data from MongoDB to Redshift?

These are the methods we can use to move data from MongoDB to Redshift in a seamless fashion:

Method 1: Using Custom Scripts to Move Data from MongoDB to Redshift
Method 2: Using an Automated Data Pipeline Platform to Move Data from MongoDB to Redshift

Integrate MongoDB to Redshift

Get a DemoTry it

Method 1: Using Custom Scripts to Move Data from MongoDB to Redshift

Following are the steps we can use to move data from MongoDB to Redshift using Custom Script:

Step 1: Use mongoexport to export data.

mongoexport --collection=collection_name --db=db_name --out=outputfile.csv

Step 2: Upload the .json file to the S3 bucket.

2.1: Since MongoDB allows for varied schema, it might be challenging to comprehend a collection and produce an Amazon Redshift table that works with it. For this reason, before uploading the file to the S3 bucket, you need to create a table structure.

2.2: Installing the AWS CLI will also allow you to upload files from your local computer to S3. File uploading to the S3 bucket is simple with the help of the AWS CLI. To upload.csv files to the S3 bucket, use the command below if you have previously installed the AWS CLI. You may use the command prompt to generate a table schema after transferring.csv files into the S3 bucket.

AWS S3 CP D:\outputfile.csv S3://S3bucket01/outputfile.csv

Step 3: Create a Table schema before loading the data into Redshift.
Step 4: Using the COPY command load the data from S3 to Redshift.Use the following COPY command to transfer files from the S3 bucket to Redshift if you’re following Step 2 (2.1).

COPY table_name
from 's3://S3bucket_name/table_name-csv.tbl' 
'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' 
csv;

Use the COPY command to transfer files from the S3 bucket to Redshift if you’re following Step 2 (2.2). Add csv to the end of your COPY command in order to load files in CSV format.

COPY db_name.table_name 
FROM ‘S3://S3bucket_name/outputfile.csv’
'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' 
csv;

We have successfully completed MongoDB Redshift integration.

For the scope of this article, we have highlighted the challenges faced while migrating data from MongoDB to Amazon Redshift. Towards the end of the article, a detailed list of advantages of using approach 2 is also given. You can check out Method 1 on our other blog and know the detailed steps to migrate MongoDB to Amazon Redshift.

Limitations of using Custom Scripts to Move Data from MongoDB to Redshift

Here is a list of limitations of using the manual method of moving data from MongoDB to Redshift:

Schema Detection Cannot be Done Upfront: Unlike a relational database, a MongoDB collection doesn’t have a predefined schema. Hence, it is impossible to look at a collection and create a compatible table in Redshift upfront.
Different Documents in a Single Collection: Different documents in single collection can have a different set of fields. A document in a collection in MongoDB can have a different set of fields.

{
  "name": "John Doe",
  "age": 32,
  "gender": "Male"
}

{
  "first_name": "John",
  "last_name": "Doe",
  "age": 32,
  "gender": "Male"
}

Different documents in a single collection can have incompatible field data types. Hence, the schema of the collection cannot be determined by reading one or a few documents.

2 documents in a single MongoDB collection can have fields with values of different types.

{
  "name": "John Doe",
  "age": 32,
  "gender": "Male"
  "mobile": "(424) 226-6998"
}

{
"name": "John Doe",
"age": 32,
"gender": "Male",
"mobile": 4242266998
}

The field mobile is a string and a number in the above documents respectively. It is a completely valid state in MongoDB. In Redshift, however, both these values either will have to be converted to a string or a number before being persisted.

New Fields can be added to a Document at Any Point in Time: It is possible to add columns to a document in MongoDB by running a simple update to the document. In Redshift, however, the process is harder as you have to construct and run ALTER statements each time a new field is detected.
Character Lengths of String Columns: MongoDB doesn’t put a limit on the length of the string columns. It has a 16MB limit on the size of the entire document. However, in Redshift, it is a common practice to restrict string columns to a certain maximum length for better space utilization. Hence, each time you encounter a longer value than expected, you will have to resize the column.
Nested Objects and Arrays in a Document: A document can have nested objects and arrays with a dynamic structure. The most complex of MongoDB ETL problems is handling nested objects and arrays.

{
"name": "John Doe",
"age": 32,
"gender": "Male",
"address": {
"street": "1390 Market St",
"city": "San Francisco",
"state": "CA"
},
"groups": ["Sports", "Technology"]
}

MongoDB allows nesting objects and arrays to several levels. In a complex real-life scenario is may become a nightmare trying to flatten such documents into rows for a Redshift table.

Data Type Incompatibility between MongoDB and Redshift: Not all data types of MongoDB are compatible with Redshift. ObjectId, Regular Expression, Javascript are not supported by Redshift. While building an ETL solution to migrate data from MongoDB to Redshift from scratch, you will have to write custom code to handle these data types.

Method 2: Using Third Pary ETL Tools to Move Data from MongoDB to Redshift

White using the manual approach works well, but using an automated data pipeline tool like LIKE.TG can save you time, resources and costs. LIKE.TG Data is a No-code Data Pipeline platform that can help load data from any data source, such as databases, SaaS applications, cloud storage, SDKs, and streaming services to a destination of your choice. Here’s how LIKE.TG overcomes the challenges faced in the manual approach for MongoDB to Redshift ETL:

Dynamic expansion for Varchar Columns: LIKE.TG expands the existing varchar columns in Redshift dynamically as and when it encounters longer string values. This ensures that your Redshift space is used wisely without you breaking a sweat.
Splitting Nested Documents with Transformations: LIKE.TG lets you split the nested MongoDB documents into multiple rows in Redshift by writing simple Python transformations. This makes MongoDB file flattening a cakewalk for users.
Automatic Conversion to Redshift Data Types: LIKE.TG converts all MongoDB data types to the closest compatible data type in Redshift. This eliminates the need to write custom scripts to maintain each data type, in turn, making the migration of data from MongoDB to Redshift seamless.

Here are the steps involved in the process for you:

Step 1: Configure Your Source

Load Data from LIKE.TG to MongoDB by entering details like Database Port, Database Host, Database User, Database Password, Pipeline Name, Connection URI, and the connection settings.

Step 2: Intgerate Data

Load data from MongoDB to Redshift by providing your Redshift databases credentials like Database Port, Username, Password, Name, Schema, and Cluster Identifier along with the Destination Name.

LIKE.TG supports 150+ data sources including MongoDB and destinations like Redshift, Snowflake, BigQuery and much more. LIKE.TG ’s fault-tolerant and scalable architecture ensures that the data is handled in a secure, consistent manner with zero data loss.

Give LIKE.TG a try and you can seamlessly export MongoDB to Redshift in minutes.

GET STARTED WITH LIKE.TG FOR FREE

For detailed information on how you can use the LIKE.TG connectors for MongoDB to Redshift ETL, check out:

MongoDB Source Connector
Redshift Destination Connector

Additional Resources for MongoDB Integrations and Migrations

Stream data from mongoDB Atlas to BigQuery
Move Data from MongoDB to MySQL
Connect MongoDB to Snowflake
Connect MongoDB to Tableau

Conclusion

In this blog, I have talked about the 2 different methods you can use to set up a connection from MongoDB to Redshift in a seamless fashion: Using Custom ETL Scripts and with the help of a third-party tool, LIKE.TG .

Outside of the benefits offered by LIKE.TG , you can use LIKE.TG to migrate data from an array of different sources – databases, cloud applications, SDKs, and more. This will provide the flexibility to instantly replicate data from any source like MongoDB to Redshift.