DynamoDB to BigQuery ETL: 3 Easy Steps to Move Data
LIKE.TG 成立于2020年,总部位于马来西亚,是首家汇集全球互联网产品,提供一站式软件产品解决方案的综合性品牌。唯一官方网站:www.like.tg
If you wish to move your data from DynamoDB to BigQuery, then you are on the right page. This post aims to help you understand the methods to move data from DynamoDB to BigQuery. But, before we get there, it is important to briefly understand the features of DynamoDB and BigQuery.
Introduction to DynamoDB and Google BigQuery
DynamoDB and BigQuery are popular, fully managed cloud databases provided by the two biggest names in Tech. Having launched for business in 2012 and 2010 respectively, these come as part of a host of services offered by their respective suite of services. This makes the typical user wanting to stick to just one, a decision that solidifies as one looks into the cumbersome process of setting up and maximizing the potential of having both these up and running parallelly. That being said, businesses still end up doing this for a variety of reasons, and therein lies the relevance of discussing this topic.
As mentioned before, because these services are offered by two different companies that want everything to be done within their tool suite, it is a non-trivial task to move data seamlessly from one to the other. Here are the two ways to move data from DynamoDB to BigQuery:
1) Using LIKE.TG Data: An easy-to-use integration platform that gets the job done with minimal effort.
2) Using Custom Scripts: You can custom build your ETL pipeline by hand-coding scripts.
This article aims to guide the ones that have opted to move data on their own from DynamoDB to BigQuery. The blog would be able to guide you with a step-by-step process, make you aware of the pitfalls and provide suggestions to overcome them.
Steps to Move Data from DynamoDB to Bigquery using Custom Code Method
Below are the broad steps that you would need to take to migrate your data from DynamoDB to BigQuery. Each of these steps is further detailed in the rest of the article.
- Step 1: Export the DynamoDB Data onto Amazon S3
- Step 2: Setting Up Google Cloud Storage and Copy Data from Amazon S3
- Step 3: Import the Google Cloud Storage File into the BigQuery Table
Step 1: Export the DynamoDB Data onto Amazon S3
The very first step is to transfer the source DynamoDB data to Amazon S3. Both S3 and GCS(Google Cloud Storage) support CSV as well as JSON files but for demonstration purposes, let’s take the CSV example. The actual export from DynamoDB to S3 can be done using the Command Line or via the AWS Console.
Method 1
The command-line method is a two-step process. First, you export the table data into a CSV file:
$aws dynamodb scan --table-name LIKE.TG _dynamo --output > LIKE.TG .txt
The above would produce a tab-separated output file which can then be easily converted to a CSV file. This CSV file (LIKE.TG .csv, let’s say) could then be uploaded to an S3 bucket using the following command:
$aws s3 cp LIKE.TG .csv s3://LIKE.TG bucket/LIKE.TG .csv
Method 2
If you prefer to use the console, sign in to your Amazon Console here.
The steps to be followed on the console are mentioned in detail in the AWS documentation here.
Step 2: Setting Up Google Cloud Storage and Copy Data from Amazon S3
The next step is to move the S3 data file onto Google Cloud Storage. As before, there is a command-line path as well as the GUI method to get this done. Let’s go through the former first.
- Using gsutil
gsutil is a command-line service to access and do a number of things on Google Cloud; primarily it is used to work with the GCS buckets.
To create a new bucket the following command could be used:
$gsutil mb gs://LIKE.TG _gc/LIKE.TG
You could mention a bunch of parameters in the above command to specify the cloud location, retention, etc. (full list here under ‘Options’) per your requirements. An interesting thing about BigQuery is that it generally loads uncompressed CSV files faster than compressed ones. Hence, unless you are sure of what you are doing, you probably shouldn’t run a compression utility like gzip on the CSV file for the next step. Another thing to keep in mind with GCS and your buckets is setting up access control. Here are all the details you will need on that.
The next step is to copy the S3 file onto this newly created GCS bucket. The following copy command gets that job done:
$gsutil cp s3://LIKE.TG _s3/LIKE.TG .csv/ gs://LIKE.TG _gc/LIKE.TG .csv
- BigQuery Data Transfer Service
This is a relatively new and faster way to get the same thing done. Both CSV and JSON files are supported by this service however there are limitations that could be found here and here. Further documentation and the detailed steps on how to go about this can be found here.
Step 3: Import the Google Cloud Storage File into the BigQuery Table
Every BigQuery table lies in a specific data set of a specific project. Hence, the following steps are to be executed in the same order:
- Create a new project.
- Create a data set.
- Run the bq load command to load the data into a table.
- The first step is to create a project. Sign in on the BigQuery Web UI. Click on the hamburger button ( ) and select APIs & Services. Click Create Project and provide a project name (Let’s say ‘LIKE.TG _project’). Now you need to enable BigQuery for which search for the same and click on Enable. Your project is now created with BigQuery enabled.
- The next step is to create a data set. This can be quickly done using the bq command-line tool and the command is called mk. Create a new data set using the following command:
$bq mk LIKE.TG _dataset
- At this point, you are ready to import the GCS file into a table in this data set. The load command of bq lets you do the same. It’s slightly more complicated than the mk command so let’s go through the basic syntax first.
Bq load command syntax -
$bq load project:dataset.table --autodetect --source_format
- autodetect is a parameter used to automatically detect the schema from the source file and is generally recommended. Hence, the following command should do the job for you:
$bq load LIKE.TG _project:LIKE.TG _dataset.LIKE.TG _table --autodetect
--source_format=CSV gs://LIKE.TG _gc/LIKE.TG .csv
The GCS file gets loaded into the table LIKE.TG _table.
- If no table exists under the name ‘LIKE.TG _table’ the above load command creates a new table.
- If LIKE.TG _table is an existing table there are two types of load available to bring the source data into this table – Overwrite or Table Append.
Here’s the command to overwrite or replace:
$bq load LIKE.TG _project:LIKE.TG _dataset.LIKE.TG _table --autodetect --replace
--source_format=CSV gs://LIKE.TG _gc/LIKE.TG .csv
Here’s the command to append data:
$bq load LIKE.TG _project:LIKE.TG _dataset.LIKE.TG _table --autodetect --noreplace
--source_format=CSV gs://LIKE.TG _gc/LIKE.TG .csv
You should be careful with the append in terms of unique key constraints as BigQuery doesn’t enforce it on its tables.
Incremental load – Type 1/ Upsert
In this type of incremental load, a new record from the source is either inserted as a new record in the target table or replaces an existing record in the target table.
Let’s say the source (LIKE.TG .csv) looks like this:
And the target table (LIKE.TG _table) looks like this:
Post incremental load, LIKE.TG _table will look like this:
The way to do this would be to load the LIKE.TG .csv into a separate table (staging table) first, let’s call it, LIKE.TG _intermediate. This staging table is then compared with the target table to perform the upsert as follows:
INSERT LIKE.TG _dataset.LIKE.TG _table (id, name, salary, date)
SELECT id, name, salary, date
FROM LIKE.TG _dataset.LIKE.TG _intermediate
WHERE NOT id IN (SELECT id FROM LIKE.TG _dataset.LIKE.TG _intermediate);
UPDATE LIKE.TG _dataset.LIKE.TG _table h
SET h.name = i.name,
h.salary = i.salary,
h.date = i.date
FROM LIKE.TG _dataset.LIKE.TG _intermediate i
WHERE h.id = i.id;
Incremental load – Type 2/ Append Only
In this type of incremental load, a new record from the source is always inserted into the target table if at least one of the fields has a different value from the target. This is quite useful to understand the history of data changes for a particular field and helps drive business decisions.
Let’s take the same example as before. The target table in this scenario would look like the following:
To write the code for this scenario, you first insert all the records from the source to the target table as below:
INSERT LIKE.TG _dataset.LIKE.TG _table (id, name, salary, date)
SELECT id, name, salary, date
FROM LIKE.TG _dataset.LIKE.TG _intermediate;
Next, you delete the duplicate records (all fields have the same value) using the window function like this:
DELETE FROM (SELECT id, name, salary, date, ROW_NUMBER() OVER(PARTITION BY id, name, salary, date) rn
FROM LIKE.TG _dataset.LIKE.TG _table)
WHERE rn <> 1;
Hurray! You have successfully migrated your data from DynamoDB to BigQuery.
Limitations of Moving Data from DynamoDB to BigQuery using Custom Code Method
As you have seen now, Data Replication from DynamoDB to BigQuery is a lengthy and time-consuming process. Furthermore, you have to take care of the following situations:
- The example discussed in this article is to demonstrate copying over a single file from DynamoDB to BigQuery. In reality, hundreds of tables would have to be synced periodically or close to real-time; to manage that and not be vulnerable to data loss and data inconsistencies is quite the task.
- There are sometimes subtle, characteristic variations between services, especially when the vendors are different. It could happen in file Size Limits, Encoding, Date Format, etc. These things may go unnoticed while setting up the process and if not taken care of before kicking off Data Migration, it could lead to loss of data.
So, to overcome these limitations to migrate your data from DynamoDB to BigQuery, let’s discuss an easier alternative – LIKE.TG .
The tedious task of setting this up as well as the points of concern mentioned above does not make the ‘custom method’ endeavor a suggestible one. You can save a lot of time and effort by implementing an integration service like LIKE.TG and focus more on looking at the data and generating insights from it. Here is how you can migrate your data from DynamoDB to BigQuery using LIKE.TG :
- Connect and configure your DynamoDB Data Source.
- Select the Replication mode: (i) Full dump (ii) Incremental load for append-only data (iii) Incremental load for mutable data.
- Configure your Google BigQuery Data Warehouse where you want to move data.
Conclusion
In this article, you got a detailed understanding of how to export DynamoDB to BigQuery using Custom code. You also learned some of the limitations associated with this method. Hence, you were introduced to an easier alternative- LIKE.TG to migrate your data from DynamoDB to BigQuery seamlessly.
With LIKE.TG , you can move data in real-time from DynamoDb to BigQuery in a reliable, secure, and hassle-free fashion. In addition to this, LIKE.TG has 150+ native data source integrations that work out of the box. You could explore the integrations here.
VISIT OUR WEBSITE TO EXPLORE LIKE.TGBefore you go ahead and take a call on the right approach to move data from DynamoDB to BigQuery, you should try LIKE.TG for once.
SIGN UP to experience LIKE.TG ’s hassle-free Data Pipeline platform.
Share your experience of moving data from DynamoDB to BigQuery in the comments section below!
LIKE.TG 专注全球社交流量推广,致力于为全球出海企业提供有关的私域营销获客、国际电商、全球客服、金融支持等最新资讯和实用工具。免费领取【WhatsApp、LINE、Telegram、Twitter、ZALO】等云控系统试用;点击【联系客服】 ,或关注【LIKE.TG出海指南频道】、【LIKE.TG生态链-全球资源互联社区】了解更多最新资讯
本文由LIKE.TG编辑部转载自互联网并编辑,如有侵权影响,请联系官方客服,将为您妥善处理。
This article is republished from public internet and edited by the LIKE.TG editorial department. If there is any infringement, please contact our official customer service for proper handling.