Sign InTry Free

Integrate TiDB Cloud with dbt

Data build tool (dbt) is a popular open-source data transformation tool that helps analytics engineers transform data in their warehouses through SQL statements. Through the dbt-tidb plugin, analytics engineers working with TiDB Cloud can directly create forms and match data through SQL without having to think about the process of creating tables or views.

This document introduces how to use dbt with TiDB Cloud, taking a dbt project as an example.

Step 1: Install dbt and dbt-tidb

You can install dbt and dbt-tidb using only one command. In the following command, dbt is installed as a dependency when you install dbt-tidb.

pip install dbt-tidb

You can also install dbt separately. See How to install dbt in the dbt documentation.

Step 2: Create a demo project

To try out the dbt function, you can use jaffle_shop, a demo project provided by dbt-lab. You can clone the project directly from GitHub:

git clone https://github.com/dbt-labs/jaffle_shop && \
cd jaffle_shop

All files in the jaffle_shop directory are structured as follows:

.
├── LICENSE
├── README.md
├── dbt_project.yml
├── etc
│    ├── dbdiagram_definition.txt
│    └── jaffle_shop_erd.png
├── models
│    ├── customers.sql
│    ├── docs.md
│    ├── orders.sql
│    ├── overview.md
│    ├── schema.yml
│    └── staging
│        ├── schema.yml
│        ├── stg_customers.sql
│        ├── stg_orders.sql
│        └── stg_payments.sql
└── seeds
    ├── raw_customers.csv
    ├── raw_orders.csv
    └── raw_payments.csv

In this directory:

  • dbt_project.yml is the dbt project configuration file, which holds the project name and database configuration file information.

  • The models directory contains the project’s SQL models and table schemas. Note that the data analyst writes this section. For more information about models, see SQL models.

  • The seeds directory stores the CSV files that are dumped by the database export tools. For example, you can export the TiDB Cloud data into CSV files through Dumpling. In the jaffle_shop project, these CSV files are used as raw data to be processed.

Step 3: Configure the project

To configure the project, take the following steps:

  1. Complete the global configuration.

    You can refer to Description of profile fields and edit the default global profile, ~/.dbt/profiles.yml, to configure the connection with TiDB Cloud:

    sudo vi ~/.dbt/profiles.yml
    

    In the editor, add the following configuration:

     jaffle_shop_tidb:                                                 # Project name
       target: dev                                                     # Target
       outputs:
         dev:
           type: tidb                                                  # The specific adapter to use
           server: gateway01.ap-southeast-1.prod.aws.tidbcloud.com     # The TiDB Cloud clusters' endpoint to connect to
           port: 4000                                                  # The port to use
           schema: analytics                                           # Specify the schema (database) to normalize data into
           username: xxxxxxxxxxx.root                                  # The username to use to connect to the TiDB Cloud clusters
           password: "your_password"                                   # The password to use for authenticating to the TiDB Cloud clusters
    

    You can get the values of server, port, and username from the Connect dialog in TiDB Cloud console.

  2. Complete the project configuration.

    In the jaffle_shop project directory, edit the project configuration file dbt_project.yml and change the profile field to jaffle_shop_tidb. This configuration allows the project to query from the database as specified in the ~/.dbt/profiles.yml file.

    vi dbt_project.yml
    

    In the editor, update the configuration as follows:

    name: 'jaffle_shop'
    
    config-version: 2
    version: '0.1'
    
    profile: 'jaffle_shop_tidb'                   # note the modification here
    
    model-paths: ["models"]                       # model path
    seed-paths: ["seeds"]                         # seed path
    test-paths: ["tests"]
    analysis-paths: ["analysis"]
    macro-paths: ["macros"]
    
    target-path: "target"
    clean-targets:
        - "target"
        - "dbt_modules"
        - "logs"
    
    require-dbt-version: [">=1.0.0", "<2.0.0"]
    
    models:
      jaffle_shop:
          materialized: table            # *.sql which in models/ would be materialized to table
          staging:
            materialized: view           # *.sql which in models/staging/ would bt materialized to view
    
  3. Verify the configuration.

    Run the following command to check whether the database and project configuration is correct.

    dbt debug
    

Step 4: (optional) Load CSV files

Now that you have successfully created and configured the project, it’s time to load the CSV data and materialize the CSV as a table in the target database.

  1. Load the CSV data and materialize the CSV as a table in the target database.

    dbt seed
    

    The following is an example output:

    Running with dbt=1.0.1
    Partial parse save file not found. Starting full parse.
    Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
    
    Concurrency: 1 threads (target='dev')
    
    1 of 3 START seed file analytics.raw_customers.................................. [RUN]
    1 of 3 OK loaded seed file analytics.raw_customers.............................. [INSERT 100 in 0.19s]
    2 of 3 START seed file analytics.raw_orders..................................... [RUN]
    2 of 3 OK loaded seed file analytics.raw_orders................................. [INSERT 99 in 0.14s]
    3 of 3 START seed file analytics.raw_payments................................... [RUN]
    3 of 3 OK loaded seed file analytics.raw_payments............................... [INSERT 113 in 0.24s]
    

    As you can see in the results, the seed file was started and loaded into three tables: analytics.raw_customers, analytics.raw_orders, and analytics.raw_payments.

  2. Verify the results in TiDB Cloud.

    The show databases command lists the new analytics database that dbt has created. The show tables command indicates that there are three tables in the analytics database, corresponding to the ones you have created.

    mysql> SHOW DATABASES;
    +--------------------+
    | Database           |
    +--------------------+
    | INFORMATION_SCHEMA |
    | METRICS_SCHEMA     |
    | PERFORMANCE_SCHEMA |
    | analytics          |
    | io_replicate       |
    | mysql              |
    | test               |
    +--------------------+
    7 rows in set (0.00 sec)
    
    mysql> USE ANALYTICS;
    mysql> SHOW TABLES;
    +---------------------+
    | Tables_in_analytics |
    +---------------------+
    | raw_customers       |
    | raw_orders          |
    | raw_payments        |
    +---------------------+
    3 rows in set (0.00 sec)
    
    mysql> SELECT * FROM raw_customers LIMIT 10;
    +------+------------+-----------+
    | id   | first_name | last_name |
    +------+------------+-----------+
    |    1 | Michael    | P.        |
    |    2 | Shawn      | M.        |
    |    3 | Kathleen   | P.        |
    |    4 | Jimmy      | C.        |
    |    5 | Katherine  | R.        |
    |    6 | Sarah      | R.        |
    |    7 | Martin     | M.        |
    |    8 | Frank      | R.        |
    |    9 | Jennifer   | F.        |
    |   10 | Henry      | W.        |
    +------+------------+-----------+
    10 rows in set (0.10 sec)
    

Step 5: Transform data

Now you are ready to run the configured project and finish the data transformation.

  1. Run the dbt project to finish the data transformation:

    dbt run
    

    The following is an example output:

    Running with dbt=1.0.1
    Found 5 models, 20 tests, 0 snapshots, 0 analyses, 170 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
    
    Concurrency: 1 threads (target='dev')
    
    1 of 5 START view model analytics.stg_customers................................. [RUN]
    1 of 5 OK created view model analytics.stg_customers............................ [SUCCESS 0 in 0.31s]
    2 of 5 START view model analytics.stg_orders.................................... [RUN]
    2 of 5 OK created view model analytics.stg_orders............................... [SUCCESS 0 in 0.23s]
    3 of 5 START view model analytics.stg_payments.................................. [RUN]
    3 of 5 OK created view model analytics.stg_payments............................. [SUCCESS 0 in 0.29s]
    4 of 5 START table model analytics.customers.................................... [RUN]
    4 of 5 OK created table model analytics.customers............................... [SUCCESS 0 in 0.76s]
    5 of 5 START table model analytics.orders....................................... [RUN]
    5 of 5 OK created table model analytics.orders.................................. [SUCCESS 0 in 0.63s]
    
    Finished running 3 view models, 2 table models in 2.27s.
    
    Completed successfully
    
    Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
    

    The result shows that two tables (analytics.customers and analytics.orders), and three views (analytics.stg_customers, analytics.stg_orders, and analytics.stg_payments) are created successfully.

  2. Go to TiDB Cloud to verify that the transformation is successful.

    mysql> USE ANALYTICS;
    mysql> SHOW TABLES;
    +---------------------+
    | Tables_in_analytics |
    +---------------------+
    | customers           |
    | orders              |
    | raw_customers       |
    | raw_orders          |
    | raw_payments        |
    | stg_customers       |
    | stg_orders          |
    | stg_payments        |
    +---------------------+
    8 rows in set (0.00 sec)
    
    mysql> SELECT * FROM customers LIMIT 10;
    +-------------+------------+-----------+-------------+-------------------+------------------+-------------------------+
    | customer_id | first_name | last_name | first_order | most_recent_order | number_of_orders | customer_lifetime_value |
    +-------------+------------+-----------+-------------+-------------------+------------------+-------------------------+
    |           1 | Michael    | P.        | 2018-01-01  | 2018-02-10        |                2 |                 33.0000 |
    |           2 | Shawn      | M.        | 2018-01-11  | 2018-01-11        |                1 |                 23.0000 |
    |           3 | Kathleen   | P.        | 2018-01-02  | 2018-03-11        |                3 |                 65.0000 |
    |           4 | Jimmy      | C.        | NULL        | NULL              |             NULL |                    NULL |
    |           5 | Katherine  | R.        | NULL        | NULL              |             NULL |                    NULL |
    |           6 | Sarah      | R.        | 2018-02-19  | 2018-02-19        |                1 |                  8.0000 |
    |           7 | Martin     | M.        | 2018-01-14  | 2018-01-14        |                1 |                 26.0000 |
    |           8 | Frank      | R.        | 2018-01-29  | 2018-03-12        |                2 |                 45.0000 |
    |           9 | Jennifer   | F.        | 2018-03-17  | 2018-03-17        |                1 |                 30.0000 |
    |          10 | Henry      | W.        | NULL        | NULL              |             NULL |                    NULL |
    +-------------+------------+-----------+-------------+-------------------+------------------+-------------------------+
    10 rows in set (0.00 sec)
    

    The output shows that five more tables or views have been added, and the data in the tables or views has been transformed. Only part of the data from the customer table is shown in this example.

Step 6: Generate the document

dbt lets you generate visual documents that display the overall structure of the project and describe all the tables and views.

To generate visual documents, take the following steps:

  1. Generate the document:

    dbt docs generate
    
  2. Start the server:

    dbt docs serve
    
  3. To access the document from your browser, go to http://localhost:8080.

Description of profile fields

OptionDescriptionRequired?Example
typeThe specific adapter to useRequiredtidb
serverThe TiDB Cloud clusters' endpoint to connect toRequiredgateway01.ap-southeast-1.prod.aws.tidbcloud.com
portThe port to useRequired4000
schemaThe schema (database) to normalize data intoRequiredanalytics
usernameThe username to use to connect to the TiDB Cloud clustersRequiredxxxxxxxxxxx.root
passwordThe password to use for authenticating to the TiDB Cloud clustersRequired"your_password"
retriesThe retry times for connection to TiDB Cloud clusters (1 by default)Optional2

Supported functions

You can use the following functions directly in dbt-tidb. For information about how to use them, see dbt-util.

The following functions are supported:

  • bool_or
  • cast_bool_to_text
  • dateadd
  • datediff. Note that datediff is a little different from dbt-util. It rounds down rather than rounds up.
  • date_trunc
  • hash
  • safe_cast
  • split_part
  • last_day
  • cast_bool_to_text
  • concat
  • escape_single_quotes
  • except
  • intersect
  • length
  • position
  • replace
  • right
Download PDFRequest docs changesAsk questions on TiDB Forum
Was this page helpful?
Open Source Ecosystem
TiDB
TiKV
TiSpark
Chaos Mesh
© 2023 PingCAP. All Rights Reserved.