Integrate TiDB Cloud with dbt
Data build tool (dbt) is a popular open-source data transformation tool that helps analytics engineers transform data in their warehouses through SQL statements. Through the dbt-tidb plugin, analytics engineers working with TiDB Cloud can directly create forms and match data through SQL without having to think about the process of creating tables or views.
This document introduces how to use dbt with TiDB Cloud, taking a dbt project as an example.
Step 1: Install dbt and dbt-tidb
You can install dbt and dbt-tidb using only one command. In the following command, dbt is installed as a dependency when you install dbt-tidb.
pip install dbt-tidb
You can also install dbt separately. See How to install dbt in the dbt documentation.
Step 2: Create a demo project
To try out the dbt function, you can use jaffle_shop, a demo project provided by dbt-lab. You can clone the project directly from GitHub:
git clone https://github.com/dbt-labs/jaffle_shop && \
cd jaffle_shop
All files in the jaffle_shop
directory are structured as follows:
.
├── LICENSE
├── README.md
├── dbt_project.yml
├── etc
│ ├── dbdiagram_definition.txt
│ └── jaffle_shop_erd.png
├── models
│ ├── customers.sql
│ ├── docs.md
│ ├── orders.sql
│ ├── overview.md
│ ├── schema.yml
│ └── staging
│ ├── schema.yml
│ ├── stg_customers.sql
│ ├── stg_orders.sql
│ └── stg_payments.sql
└── seeds
├── raw_customers.csv
├── raw_orders.csv
└── raw_payments.csv
In this directory:
dbt_project.yml
is the dbt project configuration file, which holds the project name and database configuration file information.The
models
directory contains the project’s SQL models and table schemas. Note that the data analyst writes this section. For more information about models, see SQL models.The
seeds
directory stores the CSV files that are dumped by the database export tools. For example, you can export the TiDB Cloud data into CSV files through Dumpling. In thejaffle_shop
project, these CSV files are used as raw data to be processed.
Step 3: Configure the project
To configure the project, take the following steps:
Complete the global configuration.
You can refer to Description of profile fields and edit the default global profile,
~/.dbt/profiles.yml
, to configure the connection with TiDB Cloud:sudo vi ~/.dbt/profiles.yml
In the editor, add the following configuration:
jaffle_shop_tidb: # Project name target: dev # Target outputs: dev: type: tidb # The specific adapter to use server: gateway01.ap-southeast-1.prod.aws.tidbcloud.com # The TiDB Cloud clusters' endpoint to connect to port: 4000 # The port to use schema: analytics # Specify the schema (database) to normalize data into username: xxxxxxxxxxx.root # The username to use to connect to the TiDB Cloud clusters password: "your_password" # The password to use for authenticating to the TiDB Cloud clusters
You can get the values of
server
,port
, andusername
from the Connect dialog in TiDB Cloud console.Complete the project configuration.
In the jaffle_shop project directory, edit the project configuration file
dbt_project.yml
and change theprofile
field tojaffle_shop_tidb
. This configuration allows the project to query from the database as specified in the~/.dbt/profiles.yml
file.vi dbt_project.yml
In the editor, update the configuration as follows:
name: 'jaffle_shop' config-version: 2 version: '0.1' profile: 'jaffle_shop_tidb' # note the modification here model-paths: ["models"] # model path seed-paths: ["seeds"] # seed path test-paths: ["tests"] analysis-paths: ["analysis"] macro-paths: ["macros"] target-path: "target" clean-targets: - "target" - "dbt_modules" - "logs" require-dbt-version: [">=1.0.0", "<2.0.0"] models: jaffle_shop: materialized: table # *.sql which in models/ would be materialized to table staging: materialized: view # *.sql which in models/staging/ would bt materialized to view
Verify the configuration.
Run the following command to check whether the database and project configuration is correct.
dbt debug
Step 4: (optional) Load CSV files
Now that you have successfully created and configured the project, it’s time to load the CSV data and materialize the CSV as a table in the target database.
Load the CSV data and materialize the CSV as a table in the target database.
dbt seed
The following is an example output:
Running with dbt=1.0.1 Partial parse save file not found. Starting full parse. Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics Concurrency: 1 threads (target='dev') 1 of 3 START seed file analytics.raw_customers.................................. [RUN] 1 of 3 OK loaded seed file analytics.raw_customers.............................. [INSERT 100 in 0.19s] 2 of 3 START seed file analytics.raw_orders..................................... [RUN] 2 of 3 OK loaded seed file analytics.raw_orders................................. [INSERT 99 in 0.14s] 3 of 3 START seed file analytics.raw_payments................................... [RUN] 3 of 3 OK loaded seed file analytics.raw_payments............................... [INSERT 113 in 0.24s]
As you can see in the results, the seed file was started and loaded into three tables:
analytics.raw_customers
,analytics.raw_orders
, andanalytics.raw_payments
.Verify the results in TiDB Cloud.
The
show databases
command lists the newanalytics
database that dbt has created. Theshow tables
command indicates that there are three tables in theanalytics
database, corresponding to the ones you have created.mysql> SHOW DATABASES; +--------------------+ | Database | +--------------------+ | INFORMATION_SCHEMA | | METRICS_SCHEMA | | PERFORMANCE_SCHEMA | | analytics | | io_replicate | | mysql | | test | +--------------------+ 7 rows in set (0.00 sec) mysql> USE ANALYTICS; mysql> SHOW TABLES; +---------------------+ | Tables_in_analytics | +---------------------+ | raw_customers | | raw_orders | | raw_payments | +---------------------+ 3 rows in set (0.00 sec) mysql> SELECT * FROM raw_customers LIMIT 10; +------+------------+-----------+ | id | first_name | last_name | +------+------------+-----------+ | 1 | Michael | P. | | 2 | Shawn | M. | | 3 | Kathleen | P. | | 4 | Jimmy | C. | | 5 | Katherine | R. | | 6 | Sarah | R. | | 7 | Martin | M. | | 8 | Frank | R. | | 9 | Jennifer | F. | | 10 | Henry | W. | +------+------------+-----------+ 10 rows in set (0.10 sec)
Step 5: Transform data
Now you are ready to run the configured project and finish the data transformation.
Run the dbt project to finish the data transformation:
dbt run
The following is an example output:
Running with dbt=1.0.1 Found 5 models, 20 tests, 0 snapshots, 0 analyses, 170 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics Concurrency: 1 threads (target='dev') 1 of 5 START view model analytics.stg_customers................................. [RUN] 1 of 5 OK created view model analytics.stg_customers............................ [SUCCESS 0 in 0.31s] 2 of 5 START view model analytics.stg_orders.................................... [RUN] 2 of 5 OK created view model analytics.stg_orders............................... [SUCCESS 0 in 0.23s] 3 of 5 START view model analytics.stg_payments.................................. [RUN] 3 of 5 OK created view model analytics.stg_payments............................. [SUCCESS 0 in 0.29s] 4 of 5 START table model analytics.customers.................................... [RUN] 4 of 5 OK created table model analytics.customers............................... [SUCCESS 0 in 0.76s] 5 of 5 START table model analytics.orders....................................... [RUN] 5 of 5 OK created table model analytics.orders.................................. [SUCCESS 0 in 0.63s] Finished running 3 view models, 2 table models in 2.27s. Completed successfully Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
The result shows that two tables (
analytics.customers
andanalytics.orders
), and three views (analytics.stg_customers
,analytics.stg_orders
, andanalytics.stg_payments
) are created successfully.Go to TiDB Cloud to verify that the transformation is successful.
mysql> USE ANALYTICS; mysql> SHOW TABLES; +---------------------+ | Tables_in_analytics | +---------------------+ | customers | | orders | | raw_customers | | raw_orders | | raw_payments | | stg_customers | | stg_orders | | stg_payments | +---------------------+ 8 rows in set (0.00 sec) mysql> SELECT * FROM customers LIMIT 10; +-------------+------------+-----------+-------------+-------------------+------------------+-------------------------+ | customer_id | first_name | last_name | first_order | most_recent_order | number_of_orders | customer_lifetime_value | +-------------+------------+-----------+-------------+-------------------+------------------+-------------------------+ | 1 | Michael | P. | 2018-01-01 | 2018-02-10 | 2 | 33.0000 | | 2 | Shawn | M. | 2018-01-11 | 2018-01-11 | 1 | 23.0000 | | 3 | Kathleen | P. | 2018-01-02 | 2018-03-11 | 3 | 65.0000 | | 4 | Jimmy | C. | NULL | NULL | NULL | NULL | | 5 | Katherine | R. | NULL | NULL | NULL | NULL | | 6 | Sarah | R. | 2018-02-19 | 2018-02-19 | 1 | 8.0000 | | 7 | Martin | M. | 2018-01-14 | 2018-01-14 | 1 | 26.0000 | | 8 | Frank | R. | 2018-01-29 | 2018-03-12 | 2 | 45.0000 | | 9 | Jennifer | F. | 2018-03-17 | 2018-03-17 | 1 | 30.0000 | | 10 | Henry | W. | NULL | NULL | NULL | NULL | +-------------+------------+-----------+-------------+-------------------+------------------+-------------------------+ 10 rows in set (0.00 sec)
The output shows that five more tables or views have been added, and the data in the tables or views has been transformed. Only part of the data from the customer table is shown in this example.
Step 6: Generate the document
dbt lets you generate visual documents that display the overall structure of the project and describe all the tables and views.
To generate visual documents, take the following steps:
Generate the document:
dbt docs generate
Start the server:
dbt docs serve
To access the document from your browser, go to http://localhost:8080.
Description of profile fields
Option | Description | Required? | Example |
---|---|---|---|
type | The specific adapter to use | Required | tidb |
server | The TiDB Cloud clusters' endpoint to connect to | Required | gateway01.ap-southeast-1.prod.aws.tidbcloud.com |
port | The port to use | Required | 4000 |
schema | The schema (database) to normalize data into | Required | analytics |
username | The username to use to connect to the TiDB Cloud clusters | Required | xxxxxxxxxxx.root |
password | The password to use for authenticating to the TiDB Cloud clusters | Required | "your_password" |
retries | The retry times for connection to TiDB Cloud clusters (1 by default) | Optional | 2 |
Supported functions
You can use the following functions directly in dbt-tidb. For information about how to use them, see dbt-util.
The following functions are supported:
bool_or
cast_bool_to_text
dateadd
datediff
. Note thatdatediff
is a little different from dbt-util. It rounds down rather than rounds up.date_trunc
hash
safe_cast
split_part
last_day
cast_bool_to_text
concat
escape_single_quotes
except
intersect
length
position
replace
right