我想在Spark表中转换多个列
我只为两列找到了这个解决方案,我想知道如何用三列varA, varB and varC.处理zip函数
import org.apache.spark.sql.functions.{udf, explode}
val zip = udf((xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))
df.withColumn("vars", explode(zip($"varA", $"varB"))).select(
$"userId", $"someString",
$"vars._1".alias("varA"), $"vars._2".alias("varB")).show这是我的dataframe模式:
`root
|-- owningcustomerid: string (nullable = true)
|-- event_stoptime: string (nullable = true)
|-- balancename: string (nullable = false)
|-- chargedvalue: string (nullable = false)
|-- newbalance: string (nullable = false)
`我试过这个代码:
val zip = udf((xs: Seq[String], ys: Seq[String], zs: Seq[String]) => (xs, ys, zs).zipped.toSeq)
df.printSchema
val df4=df.withColumn("vars", explode(zip($"balancename", $"chargedvalue",$"newbalance"))).select(
$"owningcustomerid", $"event_stoptime",
$"vars._1".alias("balancename"), $"vars._2".alias("chargedvalue"),$"vars._2".alias("newbalance"))我发现了一个错误:
cannot resolve 'UDF(balancename, chargedvalue, newbalance)' due to data type mismatch: argument 1 requires array<string> type, however, '`balancename`' is of string type. argument 2 requires array<string> type, however, '`chargedvalue`' is of string type. argument 3 requires array<string> type, however, '`newbalance`' is of string type.;;‘owningcustomerid#1085,event_stoptime#1086,balancename#1159,chargedvalue#1160,newbalance#1161,balancename#1159(balancename#1159,chargedvalue#1160,newbalance#1161)为vars#1167
发布于 2019-02-14 11:30:22
在Scala中,通常可以使用Tuple3.zipped
val zip = udf((xs: Seq[Long], ys: Seq[Long], zs: Seq[Long]) =>
(xs, ys, zs).zipped.toSeq)
zip($"varA", $"varB", $"varC")特别是在Spark (>= 2.4)中,您可以使用arrays_zip函数:
import org.apache.spark.sql.functions.arrays_zip
arrays_zip($"varA", $"varB", $"varC")但是,您必须注意,您的数据不包含array<string>,而是普通的strings --因此不允许使用火花、arrays_zip或but,您应该先解析数据。
发布于 2021-11-11 13:26:52
val zip = udf((a: Seq[String], b: Seq[String], c: Seq[String], d: Seq[String]) => {a.indices.map(i=> (a(i), b(i), c(i), d(i)))})https://stackoverflow.com/questions/54689310
复制相似问题