Sign InTry Free

Migrate and Merge MySQL Shards of Large Datasets to TiDB Cloud

This document describes how to migrate and merge a large MySQL dataset (for example, more than 1 TiB) from different partitions into TiDB Cloud. After full data migration, you can use TiDB Data Migration (DM) to perform incremental migration according to your business needs.

The example in this document uses a complex shard migration task across multiple MySQL instances, and involves handling conflicts in auto-increment primary keys. The scenario in this example is also applicable to merging data from different sharded tables within a single MySQL instance.

Environment information in the example

This section describes the basic information of the upstream cluster, DM, and downstream cluster used in the example.

Upstream cluster

The environment information of the upstream cluster is as follows:

  • MySQL version: MySQL v5.7.18

  • MySQL instance1:

    • schema store_01 and table [sale_01, sale_02]
    • schema store_02 and table [sale_01, sale_02]
  • MySQL instance 2:

    • schema store_01and table [sale_01, sale_02]
    • schema store_02and table [sale_01, sale_02]
  • Table structure:

    CREATE TABLE sale_01 (
    id bigint(20) NOT NULL auto_increment,
    uid varchar(40) NOT NULL,
    sale_num bigint DEFAULT NULL,
    PRIMARY KEY (id),
    UNIQUE KEY ind_uid (uid)
    );
    

DM

The version of DM is v5.3.0. You need to deploy TiDB DM manually. For detailed steps, see Deploy a DM Cluster Using TiUP.

External storage

This document uses the Amazon S3 as an example.

Downstream cluster

The sharded schemas and tables are merged into the table store.sales.

Perform full data migration from MySQL to TiDB Cloud

The following is the procedure to migrate and merge full data of MySQL shards to TiDB Cloud.

In the following example, you only need to export the data in tables to CSV format.

Step 1. Create directories in the Amazon S3 bucket

Create a first-level directory store (corresponding to the level of databases) and a second-level directory sales (corresponding to the level of tables) in the Amazon S3 bucket. In sales, create a third-level directory for each MySQL instance (corresponding to the level of MySQL instances). For example:

  • Migrate the data in MySQL instance1 to s3://dumpling-s3/store/sales/instance01/
  • Migrate the data in MySQL instance2 to s3://dumpling-s3/store/sales/instance02/

If there are shards across multiple instances, you can create one first-level directory for each database and create one second-level directory for each sharded table. Then create a third-level directory for each MySQL instance for easy management. For example, if you want to migrate and merge tables stock_N.product_N from MySQL instance1 and MySQL instance2 into the table stock.products in TiDB Cloud, you can create the following directories:

  • s3://dumpling-s3/stock/products/instance01/
  • s3://dumpling-s3/stock/products/instance02/

Step 2. Use Dumpling to export data to Amazon S3

For information about how to install Dumpling, see Dumpling Introduction.

When you use Dumpling to export data to Amazon S3, note the following:

  • Enable binlog for upstream clusters.
  • Choose the correct Amazon S3 directory and region.
  • Choose the appropriate concurrency by configuring the -t option to minimize the impact on the upstream cluster, or export directly from the backup database. For more information about how to use this parameter, see Option list of Dumpling.
  • Set appropriate values for --filetype csv and --no-schemas. For more information about how to use these parameters, see Option list of Dumpling.

Name the CSV files as follows:

  • If the data of one table is separated into multiple CSV files, append a numeric suffix to these CSV files. For example, ${db_name}.${table_name}.000001.csv and ${db_name}.${table_name}.000002.csv. The numeric suffixes can be inconsecutive but must be in ascending order. You also need to add extra zeros before the number to ensure all the suffixes are in the same length.

To export data to Amazon S3, do the following:

  1. Get the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY of the Amazon S3 bucket.

    [root@localhost ~]# export AWS_ACCESS_KEY_ID={your_aws_access_key_id}
    [root@localhost ~]# export AWS_SECRET_ACCESS_KEY= {your_aws_secret_access_key}
    
  2. Export data from MySQL instance1 to the s3://dumpling-s3/store/sales/instance01/ directory in the Amazon S3 bucket.

    [root@localhost ~]# tiup dumpling -u {username} -p {password} -P {port} -h {mysql01-ip} -B store_01,store_02 -r 20000 --filetype csv --no-schemas -o "s3://dumpling-s3/store/sales/instance01/" --s3.region "ap-northeast-1"
    

    For more information about the parameters, see Option list of Dumpling.

  3. Export data from MySQL instance2 to the s3://dumpling-s3/store/sales/instance02/ directory in the Amazon S3 bucket.

    [root@localhost ~]# tiup dumpling -u {username} -p {password} -P {port} -h {mysql02-ip} -B store_01,store_02 -r 20000 --filetype csv --no-schemas -o "s3://dumpling-s3/store/sales/instance02/" --s3.region "ap-northeast-1"
    

For detailed steps, see Export data to Amazon S3 cloud storage.

Step 3. Create schemas in TiDB Cloud cluster

Create schemas in the TiDB Cloud cluster as follows:

mysql> CREATE DATABASE store;
Query OK, 0 rows affected (0.16 sec)
mysql> use store;
Database changed

In this example, the column IDs of the upstream tables sale_01 and sale_02 are auto-increment primary keys. Conflicts might occur when you merge sharded tables in the downstream database. Execute the following SQL statement to set the ID column as a normal index instead of a primary key:

mysql> CREATE TABLE `sales` (
   ->   `id` bigint(20) NOT NULL ,
   ->   `uid` varchar(40) NOT NULL,
   ->   `sale_num` bigint DEFAULT NULL,
   ->   INDEX (`id`),
   ->   UNIQUE KEY `ind_uid` (`uid`)
   -> );
Query OK, 0 rows affected (0.17 sec)

For more information about the solutions to solve such conflicts, see Remove the PRIMARY KEY attribute from the column.

Step 4. Configure Amazon S3 access

Follow the instructions in Configure Amazon S3 access to get the role ARN to access the source data.

The following example only lists key policy configurations. Replace the Amazon S3 path with your own values.

{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Sid": "VisualEditor0",
           "Effect": "Allow",
           "Action": [
               "s3:GetObject",
               "s3:GetObjectVersion"
           ],
           "Resource": [
               "arn:aws:s3:::dumpling-s3/*"
           ]
       },
       {
           "Sid": "VisualEditor1",
           "Effect": "Allow",
           "Action": [
               "s3:ListBucket",
               "s3:GetBucketLocation"
           ],

           "Resource": "arn:aws:s3:::dumpling-s3"
       }
   ]
}

Step 5. Perform the data import task

After configuring the Amazon S3 access, you can perform the data import task in the TiDB Cloud console as follows:

  1. Open the Import page for your target cluster.

    1. Log in to the TiDB Cloud console and navigate to the Clusters page of your project.

    2. Click the name of your target cluster to go to its overview page, and then click Import in the left navigation pane.

  2. On the Import page, click Import Data in the upper-right corner, and then select From S3.

  3. On the Import from S3 page, fill in the following information:

    • Data format: select CSV.
    • Bucket URI: fill in the bucket URI of your source data. You can use the second-level directory corresponding to tables, s3://dumpling-s3/store/sales in this example, so that TiDB Cloud can import and merge the data in all MySQL instances into store.sales in one go.
    • Role ARN: enter the Role-ARN you obtained.

    If the location of the bucket is different from your cluster, confirm the compliance of cross region. Click Next.

    TiDB Cloud starts validating whether it can access your data in the specified bucket URI. After validation, TiDB Cloud tries to scan all the files in the data source using the default file naming pattern, and returns a scan summary result on the left side of the next page. If you get the AccessDenied error, see Troubleshoot Access Denied Errors during Data Import from S3.

  4. Modify the file patterns and add the table filter rules if needed.

    • File Pattern: modify the file pattern if you want to import CSV files whose filenames match a certain pattern to a single target table.

      To modify the file pattern, click Modify, specify a custom mapping rule between CSV files and a single target table in the following fields, and then click Scan.

      • Source file name: enter a pattern that matches the names of the CSV files to be imported. If you have one CSV file only, enter the file name here directly. Note that the names of the CSV files must include the suffix ".csv".

        For example:

        • my-data?.csv: all CSV files starting with my-data and one character (such as my-data1.csv and my-data2.csv) will be imported into the same target table.
        • my-data*.csv: all CSV files starting with my-data will be imported into the same target table.
      • Target table name: enter the name of the target table in TiDB Cloud, which must be in the ${db_name}.${table_name} format. For example, mydb.mytable. Note that this field only accepts one specific table name, so wildcards are not supported.

    • Table Filter: If you want to filter which tables to be imported, you can specify one or more table filter rules in this area.

  5. Click Next.

  6. On the Preview page, you can have a preview of the data. If the previewed data is not what you expect, click the Click here to edit csv configuration link to update the CSV-specific configurations, including separator, delimiter, header, backslash escape, and trim last separator.

  7. Click Start Import.

  8. When the import progress shows Finished, check the imported tables.

After the data is imported, if you want to remove the Amazon S3 access of TiDB Cloud, simply delete the policy that you added.

Perform incremental data replication from MySQL to TiDB Cloud

To replicate the data changes based on binlog from a specified position in the upstream cluster to TiDB Cloud, you can use TiDB Data Migration (DM) to perform incremental replication.

Before you begin

The TiDB Cloud console does not provide any feature about incremental data replication yet. You need to deploy TiDB DM to migrate incremental data. For detailed steps, see Deploy a DM Cluster Using TiUP.

Step 1. Add the data source

  1. Create a new data source file dm-source1.yaml to configure an upstream data source into DM. Add the following content:

    # MySQL Configuration.
    source-id: "mysql-replica-01"
    # Specifies whether DM-worker pulls binlogs with GTID (Global Transaction Identifier).
    # The prerequisite is that you have already enabled GTID in the upstream MySQL.
    # If you have configured the upstream database service to switch master between different nodes automatically, you must enable GTID.
    enable-gtid: true
    from:
     host: "${host}"           # For example: 192.168.10.101
     user: "user01"
     password: "${password}"   # Plaintext passwords are supported but not recommended. It is recommended that you use dmctl encrypt to encrypt plaintext passwords.
     port: ${port}             # For example: 3307
    
  2. Create another new data source file dm-source2.yaml, and add the following content:

    # MySQL Configuration.
    source-id: "mysql-replica-02"
    # Specifies whether DM-worker pulls binlogs with GTID (Global Transaction Identifier).
    # The prerequisite is that you have already enabled GTID in the upstream MySQL.
    # If you have configured the upstream database service to switch master between different nodes automatically, you must enable GTID.
    enable-gtid: true
    from:
     host: "192.168.10.102"
     user: "user02"
     password: "${password}"
     port: 3308
    
  3. Run the following command in a terminal. Use tiup dmctl to load the first data source configuration into the DM cluster:

    [root@localhost ~]# tiup dmctl --master-addr ${advertise-addr} operate-source create dm-source1.yaml
    

    The parameters used in the command above are described as follows:

    ParameterDescription
    --master-addrThe {advertise-addr} of any DM-master node in the cluster where dmctl is to be connected. For example: 192.168.11.110:9261
    operate-source createLoads the data source to the DM cluster.

    The following is an example output:

    tiup is checking updates for component dmctl ...
    
    Starting component `dmctl`: /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl --master-addr 192.168.11.110:9261 operate-source create dm-source1.yaml
    
    {
       "result": true,
       "msg": "",
       "sources": [
           {
               "result": true,
               "msg": "",
               "source": "mysql-replica-01",
               "worker": "dm-192.168.11.111-9262"
           }
       ]
    }
    
    
  4. Run the following command in a terminal. Use tiup dmctl to load the second data source configuration into the DM cluster:

    [root@localhost ~]# tiup dmctl --master-addr 192.168.11.110:9261 operate-source create dm-source2.yaml
    

    The following is an example output:

    tiup is checking updates for component dmctl ...
    
    Starting component `dmctl`: /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl --master-addr 192.168.11.110:9261 operate-source create dm-source2.yaml
    
    {
       "result": true,
       "msg": "",
       "sources": [
           {
               "result": true,
               "msg": "",
               "source": "mysql-replica-02",
               "worker": "dm-192.168.11.112-9262"
           }
       ]
    }
    

Step 2. Create a replication task

  1. Create a test-task1.yaml file for the replication task.

  2. Find the starting point in the metadata file of MySQL instance1 exported by Dumpling. For example:

    Started dump at: 2022-05-25 10:16:26
    SHOW MASTER STATUS:
           Log: mysql-bin.000002
           Pos: 246546174
           GTID:b631bcad-bb10-11ec-9eee-fec83cf2b903:1-194801
    Finished dump at: 2022-05-25 10:16:27
    
  3. Find the starting point in the metadata file of MySQL instance2 exported by Dumpling. For example:

    Started dump at: 2022-05-25 10:20:32
    SHOW MASTER STATUS:
           Log: mysql-bin.000001
           Pos: 1312659
           GTID:cd21245e-bb10-11ec-ae16-fec83cf2b903:1-4036
    Finished dump at: 2022-05-25 10:20:32
    
  4. Edit the task configuration file test-task1, to configure the incremental replication mode and replication starting point for each data source.

    ## ********* Task Configuration *********
    name: test-task1
    shard-mode: "pessimistic"
    # Task mode. The "incremental" mode only performs incremental data migration.
    task-mode: incremental
    # timezone: "UTC"
    
    ## ******** Data Source Configuration **********
    ## (Optional) If you need to incrementally replicate data that has already been migrated in the full data migration, you need to enable the safe mode to avoid the incremental data migration error.
    ##  This scenario is common in the following case: the full migration data does not belong to the data source's consistency snapshot, and after that, DM starts to replicate incremental data from a position earlier than the full migration.
    syncers:           # The running configurations of the sync processing unit.
     global:           # Configuration name.
       safe-mode: false # # If this field is set to true, DM changes INSERT of the data source to REPLACE for the target database,
                        # # and changes UPDATE of the data source to DELETE and REPLACE for the target database.
                        # # This is to ensure that when the table schema contains a primary key or unique index, DML statements can be imported repeatedly.
                        # # In the first minute of starting or resuming an incremental migration task, DM automatically enables the safe mode.
    mysql-instances:
    - source-id: "mysql-replica-01"
       block-allow-list:  "bw-rule-1"
       route-rules: ["store-route-rule", "sale-route-rule"]
       filter-rules: ["store-filter-rule", "sale-filter-rule"]
       syncer-config-name: "global"
       meta:
         binlog-name: "mysql-bin.000002"
         binlog-pos: 246546174
         binlog-gtid: "b631bcad-bb10-11ec-9eee-fec83cf2b903:1-194801"
    - source-id: "mysql-replica-02"
       block-allow-list:  "bw-rule-1"
       route-rules: ["store-route-rule", "sale-route-rule"]
       filter-rules: ["store-filter-rule", "sale-filter-rule"]
       syncer-config-name: "global"
       meta:
         binlog-name: "mysql-bin.000001"
         binlog-pos: 1312659
         binlog-gtid: "cd21245e-bb10-11ec-ae16-fec83cf2b903:1-4036"
    
    ## ******** Configuration of the target TiDB cluster on TiDB Cloud **********
    target-database:       # The target TiDB cluster on TiDB Cloud
     host: "tidb.xxxxxxx.xxxxxxxxx.ap-northeast-1.prod.aws.tidbcloud.com"
     port: 4000
     user: "root"
     password: "${password}"  # If the password is not empty, it is recommended to use a dmctl-encrypted cipher. 
    
    ## ******** Function Configuration **********
    routes:
     store-route-rule:
       schema-pattern: "store_*"
       target-schema: "store"
     sale-route-rule:
       schema-pattern: "store_*"
       table-pattern: "sale_*"
       target-schema: "store"
       target-table:  "sales"
    filters:
     sale-filter-rule:
       schema-pattern: "store_*"
       table-pattern: "sale_*"
       events: ["truncate table", "drop table", "delete"]
       action: Ignore
     store-filter-rule:
       schema-pattern: "store_*"
       events: ["drop database"]
       action: Ignore
    block-allow-list:
     bw-rule-1:
       do-dbs: ["store_*"]
    
    ## ******** Ignore check items **********
    ignore-checking-items: ["table_schema","auto_increment_ID"]
    

For detailed task configurations, see DM Task Configurations.

To run a data replication task smoothly, DM triggers a precheck automatically at the start of the task and returns the check results. DM starts the replication only after the precheck is passed. To trigger a precheck manually, run the check-task command:

[root@localhost ~]# tiup dmctl --master-addr 192.168.11.110:9261 check-task dm-task.yaml

The following is an example output:

tiup is checking updates for component dmctl ...

Starting component `dmctl`: /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl --master-addr 192.168.11.110:9261 check-task dm-task.yaml

{
   "result": true,
   "msg": "check pass!!!"
}

Step 3. Start the replication task

Use tiup dmctl to run the following command to start the data replication task:

[root@localhost ~]# tiup dmctl --master-addr ${advertise-addr}  start-task dm-task.yaml

The parameters used in the command above are described as follows:

ParameterDescription
--master-addrThe {advertise-addr} of any DM-master node in the cluster where dmctl is to be connected. For example: 192.168.11.110:9261
start-taskStarts the migration task.

The following is an example output:

tiup is checking updates for component dmctl ...

Starting component `dmctl`: /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl /root/.tiup/components/dmctl/${tidb_version}/dmctl/dmctl --master-addr 192.168.11.110:9261 start-task dm-task.yaml

{
   "result": true,
   "msg": "",
   "sources": [
       {
           "result": true,
           "msg": "",
           "source": "mysql-replica-01",
           "worker": "dm-192.168.11.111-9262"
       },

       {
           "result": true,
           "msg": "",
           "source": "mysql-replica-02",
           "worker": "dm-192.168.11.112-9262"
       }
   ],
   "checkResult": ""
}

If the task fails to start, check the prompt message and fix the configuration. After that, you can re-run the command above to start the task.

If you encounter any problem, refer to DM error handling and DM FAQ.

Step 4. Check the replication task status

To learn whether the DM cluster has an ongoing replication task and view the task status, run the query-status command using tiup dmctl:

[root@localhost ~]# tiup dmctl --master-addr 192.168.11.110:9261 query-status test-task1

The following is an example output:

{
   "result": true,
   "msg": "",
   "sources": [
       {
           "result": true,
           "msg": "",
           "sourceStatus": {
               "source": "mysql-replica-01",
               "worker": "dm-192.168.11.111-9262",
               "result": null,
               "relayStatus": null
           },

           "subTaskStatus": [
               {
                   "name": "test-task1",
                   "stage": "Running",
                   "unit": "Sync",
                   "result": null,
                   "unresolvedDDLLockID": "",
                   "sync": {
                       "totalEvents": "4048",
                       "totalTps": "3",
                       "recentTps": "3",
                       "masterBinlog": "(mysql-bin.000002, 246550002)",
                       "masterBinlogGtid": "b631bcad-bb10-11ec-9eee-fec83cf2b903:1-194813",
                       "syncerBinlog": "(mysql-bin.000002, 246550002)",
                       "syncerBinlogGtid": "b631bcad-bb10-11ec-9eee-fec83cf2b903:1-194813",
                       "blockingDDLs": [
                       ],
                       "unresolvedGroups": [
                       ],
                       "synced": true,
                       "binlogType": "remote",
                       "secondsBehindMaster": "0",
                       "blockDDLOwner": "",
                       "conflictMsg": ""
                   }
               }
           ]
       },
       {
           "result": true,
           "msg": "",
           "sourceStatus": {
               "source": "mysql-replica-02",
               "worker": "dm-192.168.11.112-9262",
               "result": null,
               "relayStatus": null
           },
           "subTaskStatus": [
               {
                   "name": "test-task1",
                   "stage": "Running",
                   "unit": "Sync",
                   "result": null,
                   "unresolvedDDLLockID": "",
                   "sync": {
                       "totalEvents": "33",
                       "totalTps": "0",
                       "recentTps": "0",
                       "masterBinlog": "(mysql-bin.000001, 1316487)",
                       "masterBinlogGtid": "cd21245e-bb10-11ec-ae16-fec83cf2b903:1-4048",
                       "syncerBinlog": "(mysql-bin.000001, 1316487)",
                       "syncerBinlogGtid": "cd21245e-bb10-11ec-ae16-fec83cf2b903:1-4048",
                       "blockingDDLs": [
                       ],
                       "unresolvedGroups": [
                       ],
                       "synced": true,
                       "binlogType": "remote",
                       "secondsBehindMaster": "0",
                       "blockDDLOwner": "",
                       "conflictMsg": ""
                   }
               }
           ]
       }
   ]
}

For a detailed interpretation of the results, see Query Status.

Download PDFRequest docs changesAsk questions on TiDB Forum
Was this page helpful?
Open Source Ecosystem
TiDB
TiKV
TiSpark
Chaos Mesh
© 2023 PingCAP. All Rights Reserved.