從StringType Spark.SQL提取json數據

問題描述 投票:1回答:1

有帶有單個字符串類型列的配置單元表。

hive> desc logical_control.test1;
OK
test_field_1          string                  test field 1
val df2 = spark.sql("select * from logical_control.test1")
df2.printSchema()
root
|-- test_field_1: string (nullable = true)
df2.show(false)
+------------------------+
|test_field_1            |
+------------------------+
|[[str0], [str1], [str2]]|
+------------------------+

如何將其轉換為如下所示的結構化列?

root
|-- A: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- S: string (nullable = true)

我嘗試使用初始架構來恢復它捕鱼游戏能赚钱的,該初始架構是在將數據寫入hdfs之前對其進行結構化。但是json_data為null。

val schema = StructType(
    Seq(
      StructField("A", ArrayType(
        StructType(
          Seq(
            StructField("S", StringType, nullable = true))
        )
      ), nullable = true)
    )
  )
val df3 = df2.withColumn("json_data", from_json(col("test_field_1"), schema))
df3.printSchema()
root
|-- test_field_1: string (nullable = true)
|-- json_data: struct (nullable = true)
|    |-- A: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- S: string (nullable = true)
df3.show(false)
+------------------------+---------+
|test_field_1            |json_data|
+------------------------+---------+
|[[str0], [str1], [str2]]|null     |
+------------------------+---------+
json scala apache-spark etl
1個回答
1
投票

如果test_field_1的結構是固定的,并且您不介意自己“解析”該字段,則可以使用進行轉換:

case class S(S:String)
def toArray: String => Array[S] = _.replaceAll("[\\[\\]]","").split(",").map(s => S(s.trim))
val toArrayUdf = udf(toArray)
val df3 = df2.withColumn("json_data", toArrayUdf(col("test_field_1")))
df3.printSchema()
df3.show(false)

打印

root
 |-- test_field_1: string (nullable = true)
 |-- json_data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- S: string (nullable = true)
+------------------------+------------------------+
|test_field_1            |json_data               |
+------------------------+------------------------+
|[[str0], [str1], [str2]]|[[str0], [str1], [str2]]|
+------------------------+------------------------+

棘手的部分是創建結構的第二級(element: struct)。我已使用案例類S創建此結構。


推薦問答