当前位置：网站首页>Apache seatunnel 2.1.0 deployment and stepping on the pit

Apache seatunnel 2.1.0 deployment and stepping on the pit

2022-04-23 13:42:00 【Ruo Xiaoyu】

brief introduction

SeaTunnel Original name Waterdrop, since 2021 year 10 month 12 Renamed SeaTunnel.
SeaTunnel It is a very easy to use ultra-high performance distributed data integration platform , Support real-time synchronization of massive data . It can synchronize tens of billions of data stably and efficiently every day , It's near 100 Used in the production of this company .

characteristic

Easy to use , Flexible configuration , Low code development
Real time streaming
Offline multi-source data analysis
High performance 、 Massive data processing capabilities
Modular and plug-in mechanisms , extensible
Supported by SQL Data processing and aggregation
Support Spark Structured streaming media
Support Spark 2.x
- Here we stepped on a pit , Because we tested spark The environment has been upgraded to 3.x edition , Now, SeaTunnel Only support 2.x, So we need to redeploy one 2.x Of spark
  -

Workflow

Insert picture description here

install

Installation document

https://seatunnel.incubator.apache.org/docs/2.1.0/spark/installation

Environmental preparation ： install jdk and spark
config/seatunnel-env.sh
Download installation package
https://www.apache.org/dyn/closer.lua/incubator/seatunnel/2.1.0/apache-seatunnel-incubating-2.1.0-bin.tar.gz
Decompress and edit config/seatunnel-env.sh
Specify the necessary environment configuration , for example SPARK_HOME（SPARK Download and unzip the directory ）

1、 test jdbc-to-jdbc

Create a new config/spark.batch.jdbc.to.jdbc.conf file

env {
  # seatunnel defined streaming batch duration in seconds
  spark.app.name = "SeaTunnel"
  spark.executor.instances = 1
  spark.executor.cores = 1
  spark.executor.memory = "1g"
}

source {
  jdbc {
    driver = "com.mysql.jdbc.Driver"
    url = "jdbc:mysql://0.0.0.0:3306/database?useUnicode=true&characterEncoding=utf8&useSSL=false"
    table = "table_name"
    result_table_name = "result_table_name"
    user = "root"
    password = "password"
	}

}

transform {
  # split data by specific delimiter

  # you can also use other filter plugins, such as sql
  # sql {
  #   sql = "select * from accesslog where request_time > 1000"
  # }

  # If you would like to get more information about how to configure seatunnel and see full list of filter plugins,
  # please go to https://seatunnel.apache.org/docs/spark/configuration/transform-plugins/Sql
}

sink {
  # choose stdout output plugin to output data to console
  # Console {}
  jdbc {
  	#  Configuration here driver Parameters , Otherwise, the data exchange will not succeed 
  	driver = "com.mysql.jdbc.Driver",
    saveMode = "update",
    url = "jdbc:mysql://ip:3306/database?useUnicode=true&characterEncoding=utf8&useSSL=false",
    user = "userName",
    password = "***********",
    dbTable = "tableName",
    customUpdateStmt = "INSERT INTO table (column1, column2, created, modified, yn) values(?, ?, now(), now(), 1) ON DUPLICATE KEY UPDATE column1 = IFNULL(VALUES (column1), column1), column2 = IFNULL(VALUES (column2), column2)"
    }
}

yarn Start command

./bin/start-seatunnel-spark.sh --master 'yarn' --deploy-mode client --config ./config/spark.batch.jdbc.to.jdbc.conf

Step on the pit ： Run times [driver] as non-empty , Locate and find sink It needs to be set in the configuration driver Parameters

ERROR Seatunnel:121 - Plugin[org.apache.seatunnel.spark.sink.Jdbc] contains invalid config, error: please specify [driver] as non-empty