文章/答案/技术大牛

发布

社区首页 >问答首页 >基于条件的电火花枢轴数据帧

问基于条件的电火花枢轴数据帧
EN

Stack Overflow用户

提问于 2018-05-10 00:06:45

回答 1查看 1.4K关注 0票数 0

我在pyspark中有一个数据框架，如下所示。

df.show()

+---+-------+----+
| id|   type|s_id|
+---+-------+----+
|  1|    ios|  11|
|  1|    ios|  12|
|  1|    ios|  13|
|  1|    ios|  14|
|  1|android|  15|
|  1|android|  16|
|  1|android|  17|
|  2|    ios|  21|
|  2|android|  18|
+---+-------+----+

现在，从这个数据框架中，我希望通过旋转来创建另一个数据框架。

df1.show()
+---+-----+-----+-----+---------+---------+---------+
| id| ios1| ios2| ios3| android1| android2| android3|
+---+-----+-----+-----+---------+---------+---------+
|  1|   11|   12|   13|       15|       16|       17|
|  2|   21| Null| Null|       18|     Null|     Null|
+---+-----+-----+-----+---------+---------+---------+

在这里，我需要考虑一个条件，即对于每个Id，即使有比3 types更多的条件，我也只想考虑3 or less than 3。

我怎么能这么做？

编辑

new_df.show()

+---+-------+----+
| id|   type|s_id|
+---+-------+----+
|  1|    ios|  11|
|  1|    ios|  12|
|  1|       |  13|
|  1|       |  14|
|  1|andriod|  15|
|  1|       |  16|
|  1|       |  17|
|  2|andriod|  18|
|  2|    ios|  21|
+---+-------+----+

我得到的结果如下

+---+----+----+----+--------+----+----+
| id|   1|   2|   3|andriod1|ios1|ios2|
+---+----+----+----+--------+----+----+
|  1|  13|  14|  16|      15|  11|  12|
|  2|null|null|null|      18|  21|null|
+---+----+----+----+--------+----+----+

我想要的是

+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1  |15      |    null|    null|  11|  12|null|
|2  |18      |    null|    null|  21|null|null|
+---+--------+--------+--------+----+----+----+

pyspark

apache-spark

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-05-10 03:16:40

使用下面的逻辑可以得到您想要的结果。

Window函数用于为由s_id排序的每一组id和type生成行号。生成的行号用于filter和concat以及type。最后，分组和旋转将为您提供所需的输出。

from pyspark.sql import Window 
windowSpec = Window.partitionBy("id", "type").orderBy("s_id")

from pyspark.sql import functions as f

df.withColumn("ranks", f.row_number().over(windowSpec))\
    .filter(f.col("ranks") < 4)\
    .withColumn("type", f.concat(f.col("type"), f.col("ranks")))\
    .drop("ranks")\
    .groupBy("id")\
    .pivot("type")\
    .agg(f.first("s_id"))\
    .show(truncate=False)

这应该会给你

+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1  |15      |16      |17      |11  |12  |13  |
|2  |18      |null    |null    |21  |null|null|
+---+--------+--------+--------+----+----+----+

编辑部分的答案

你只需要一个额外的过滤器

df.withColumn("ranks", f.row_number().over(windowSpec)) \
    .filter(f.col("ranks") < 4) \
    .filter(f.col("type") != "") \
    .withColumn("type", f.concat(f.col("type"), f.col("ranks"))) \
    .drop("ranks") \
    .groupBy("id") \
    .pivot("type") \
    .agg(f.first("s_id")) \
    .show(truncate=False)

这会让你

+---+--------+----+----+
|id |andriod1|ios1|ios2|
+---+--------+----+----+
|1  |15      |11  |12  |
|2  |18      |21  |null|
+---+--------+----+----+

现在，android2, android3 and ios3 缺少列。因为它们不存在于更新的输入数据中。您可以使用withColumn api添加它们，并填充空值。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50263695

复制

相似问题

问基于条件的电火花枢轴数据帧
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于条件的电火花枢轴数据帧EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于条件的电火花枢轴数据帧
EN