百度智能云

All Product Document

          MapReduce

          Sqoop

          Sqoop Introduction

          Example scenario: Sqoop is used to import data on RDS to Hive, and the Hive data table’s location is the BOS path, and partition is dt (string). Data imported every day is distinguished according to dt specified dates.

          Sqoop tool is used to transfer data between Hadoop and the relational database. Hadoop’s MapReduce can be used for two-way data transfer between relational databases (MySQL, Oracle, Postgres, etc.) and HDFS. How to realize:

          1. Read the structure of tables that contain data to import, generate the running class (QueryResult by default), compress the jar package, and submit it to Hadoop.
          2. Based on step needs, Hadoop uses MapReduce to execute Import command:

            1. Split the data, i.e., DataSplit.
            2. Write the range to read easily.
            3. Read the writing range.
            4. Create RecordReader to read data from the database.
            5. Create Map.
            6. RecordReader reads data line by line from the relational database. You set Map’s Key and Value and submit to Map.
            7. Run map. The key finally generated is line data.

          Data Import by Sqoop

          You can use sqoop to import data to RDS to BOS, HDFS, Hbase, or Hive. The specific operation steps are as follows:

          Import Data from Relational Database Service (RDS) to BOS

          1. Use SSH to connect to the master node. For more information, please see SSH Connection to Cluster.
          2. Enter command: su hdfs. Switch to HDFS user.
          3. Execute the command in the following format: sqoop import --connect jdbc:mysql://address:port/db_name --table table_name --username XX --password XX --target-dir XX

            Example: sqoop import --connect jdbc:mysql://mysql56.rdsmiy5q77jfaqp.rds.bj.baidubce.com:3306/sqoop --table test --username sqoop --password sqoop_123 --target-dir bos://abc/sqooptest

          4. You can view the execution outcome in bos. Example of viewing execution outcome in BOS:

          image.png

          Import Data from Relational Database Service (RDS) to HDFS

          1. Execute steps 1 to 2 in "Import data from Relational Database Service (RDS) to BOS".
          2. Execute the command in the following format: sqoop import --connect jdbc:mysql://address:port/db_name --table table_name --username XX --password XX --target-dir XX

            Example: sqoop import --connect jdbc:mysql://mysql56.rdsmiy5q77jfaqp.rds.bj.baidubce.com:3306/sqoop --table test --username sqoop --password sqoop_123 --target-dir /user/hdfs/sqooptest

          3. You can view the execution outcome in HDFS.

          Parameters Parameter Description
          address and port Address and port number of RDS database instance. Obtain it from the basic information of the RDS instance. For more information, please see Connect RDS Instance.
          db_name Name of database which contains data to import. For how to create Relational Database Service (RDS) instance, please see Create Database.
          table_name Name of the data table, which contains data to import. If you need to create a data table, please log in to the RDS instance. For more information, please see Connect RDS Instance.
          --username and --password Account, and password of database which contains data to import. Obtain information from the RDS instance. For more information, please see Administer Database Accounts.
          --target-dir Target address of data import, which is the path of BOS or HDFS.
          Note: If the BOS path is specified, the directory in such a path cannot exist on bos. For example, the sqooptest directory in import path bos://test/sqooptest must not exist on bos.

          Import Data from Relational Database Service (RDS) to HBase

          1. Use SSH to connect to the master node. For more information, please see [SSH Connection to Cluster](BMR/Operation Guide/Access Cluster/SSH Connection to Cluster.md).
          2. Enter command: su hdfs. Switch to HDFS user.
          3. Two types of data import to Hbase by Sqoop:

            • If you import data to table existing in Hbase, please execute the command in the following format: sqoop import --connect jdbc:mysql://address:port/db_name --table table_name --username XX --password XX --hbase-table hbase_table_name --column-family hbase_column_family --hbase-row-key row_key. Example: sqoop import --connect jdbc:mysql://mysql56.rdsmiy5q77jfaqp.rds.bj.baidubce.com:3306/sqoop --table test --username sqoop --password sqoop_123 --hbase-table sqoop_hbase -- column-family message1 --habse-row-key id.
            • If you import data to table not existing in Hbase, please execute the command in the following format: sqoop import --connect jdbc:mysql://address:port/db_name --table table_name --username XX --password XX --hbase-table hbase_table_name --column-family hbase_column_family --hbase-row-key row_key --hbase-create-table. Example: sqoop import --connect jdbc:mysql://mysql56.rdsmiy5q77jfaqp.rds.bj.baidubce.com:3306/sqoop --table test --username sqoop --password sqoop_123 --hbase-table sqoop_hbase -- column-family message1 --habse-row-key id --hbase-create-table.
          4. You can view execution outcome in Hbase. The example of execution outcome is as follows:

            hbase(main):003:0>scan "sqoop_hbase" ROW COLUMN+CELL 1 column=message1:address,timestamp=1439794756361,value=shanghai 1 column=message1:favor,timestamp=1439794756361,value=lanqiu 1 column=message1:name,timestamp=1439794756361,value=xiaoming 1 column=message1:sex,timestamp=1439794756361,value=1 1 column=message1:address,timestamp=1439794756361,value=beijing 1 column=message1:favor,timestamp=1439794756361,value=pingpong 1 column=message1:name,timestamp=1439794756361,value=xiaohong 1 column=message1:sex,timestamp=1439794756361,value=2

          Parameters Parameter Description
          address and port Address and port number of RDS database instance. Obtain it from the basic information of the RDS instance. For more information, please see Connect RDS Instance.
          db_name Name of database containing the data sources. For how to create Relational Database Service (RDS) instance, please see Create Database.
          table_name Name of a data table containing the data sources. If you need to create a data table, please log in to the RDS instance. For more information, please see Connect RDS Instance.
          --username and --password Account, and password of database containing the data sources. Obtain information from the RDS instance. For more information, please see Administer Database Accounts.
          --hbase-table Name of target data table for data import, which is the name of the data table in HBase.
          --column-family Column family of the target data table, which is the column family of data table in HBase.
          Note: sqoop specifies one column family every time.
          --hbase-row-key Field used as hbase row key in the RDS data table containing the data sources.

          Import Data from Relational Database Service (RDS) to Hive

          1. Use SSH to connect to the master node. For more information, please see SSH Connection to Cluster.
          2. Enter command: su hdfs. Switch to HDFS user.
          3. Three types of data import to Hive by Sqoop:

            • If Hive has no data table with the same name as Relational Database Service (RDS), you can execute the command in the following format: sqoop import --connect jdbc:mysql://address:port/db_name --table table_name --username XX --password XX --hive-import. Example: Example: sqoop import --connect jdbc:mysql://mysql56.rdsmiy5q77jfaqp.rds.bj.baidubce.com:3306/sqoop --table test --username sqoop --password sqoop_123 --hive-import.
            • By default, sqoop sets the table name in the hive as the name of the data table used to import data. If you do not want default table name, you can use --hive-table to specify new data table, and execute the command in the following format: sqoop import --connect jdbc:mysql://address:port/db_name --table table_name --username XX --password XX --hive-import --hive-table XX. Example: sqoop import --connect jdbc:mysql://mysql56.rdsmiy5q77jfaqp.rds.bj.baidubce.com:3306/sqoop --table test --username sqoop --password sqoop_123 --hive-import --hive-table sqoop_hive.
            • If Hive has data table with the same name as Relational Database Service (RDS), you need to use --hive-table to specify data table, add --hive-overwrite parameters, and execute the command in the following format: sqoop import --connect jdbc:mysql://address:port/db_name --table table_name --username XX --password XX --hive-import --hive-table XX --hive-overwrite. Use another table test_export in the RDS database to specify the table sqoop_hive which exists in hive. Example: sqoop import --connect jdbc:mysql://mysql56.rdsmiy5q77jfaqp.rds.bj.baidubce.com:3306/sqoop --table test --username sqoop --password sqoop_123 --hive-import --hive-table sqoop_hive -- hive-overwrite.
          4. You can view execution outcomes in Hive. The example of execution outcome is as follows:

            hive>select * from test; OK 1 xiaoming 1 shanghai lanqiu 2 xiaohong 2 beijing pingpong

          Parameters Parameter Description
          address and port Address and port number of RDS database instance. Obtain it from the basic information of the RDS instance. For more information, please see Connect RDS Instance.
          db_name Name of database containing the data sources. For how to create Relational Database Service (RDS) instance, please see Create Database.
          table_name Name of a data table containing the data sources. If you need to create a data table, please log in to the RDS instance. For more information, please see Connect RDS Instance.
          --username and --password Account, and password of database containing the data sources. Obtain information from the RDS instance. For more information, please see Administer Database Accounts.
          --hive-import Import data to hive.
          --hive-table Name of target data table for data import, which is the name of the data table in Hive.
          --hive-overwrite Overwrite data table with the same name as Relational Database Service (RDS) in Hive. Note: Data import can be incorrect in case of transfer of non-INT type to INT type.

          Data Export by Sqoop

          You can use Sqoop to export data in BOS or HDFS to Relational Database Service (RDS). The specific operation steps are as follows:

          Export Data from BOS to Relational Database Service (RDS)

          1. Create a data table in the Relational Database Service (RDS). Please be noted that the data table has the same field type as the data to export, or exception can occur during export. The data is connected to the RDS database before creation. For more information, please see Connect RDS Instance.
          2. Use SSH to connect to the master node. For more information, please see SSH Connection to Cluster.
          3. Enter command: su hdfs. Switch to HDFS user.
          4. Execute the command: sqoop export --connect jdbc:mysql://address:port/db_name --table export_table_name --username XX --password XX --export-dir XX

          Example: sqoop export --connect jdbc:mysql://mysql56.rdsmiy5q77jfaqp.rds.bj.baidubce.com:3306/sqoop --table test --username sqoop --password sqoop_123 --export-dir bos://abc/sqooptest

          1. You can view the execution outcome in PHP Admin interface of RDS, as the following figure shows:

          Export Data from HDFS to Relational Database Service (RDS)

          1. Execute steps 1 to 3 in "Export data from BOS to Relational Database Service (RDS)".
          2. Execute the command: sqoop export --connect jdbc:mysql://address:port/db_name --table export_table_name --username XX --password XX --export-dir XX

          Example: sqoop import --connect jdbc:mysql://mysql56.rdsmiy5q77jfaqp.rds.bj.baidubce.com:3306/sqoop --table test --username sqoop --password sqoop_123 --export-dir /user/hdfs/sqooptest

          1. Execute step 5 in "Export data from BOS to Relational Database Service (RDS)".
          Parameters Parameter Description
          address and port Address and port number of RDS database instance. Obtain it from the basic information of the RDS instance. For more information, please see Connect RDS Instance.
          db_name Name of database which contains data to export. For how to create Relational Database Service (RDS) instance, please see Create Database.
          table_name Name of the data table, which contains data to export. If you need to create a data table, please log in to the RDS instance. For more information, please see Connect RDS Instance.
          --username and --password Account, and password of database which contains data to export. Obtain information from the RDS instance. For more information, please see Administer Database Accounts.
          --export-dir Target address of data export, which is the path of BOS or HDFS.
          Previous
          HBase
          Next
          Pig