我有一个简单的JSON文件:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"right":false,"left":false}}}}但当我这样读的时候:
spark.read.option("inferSchema", "true") \
.option("multiline", "true") \
.json(///myfile.json) \
.first() \
.asDict()我得到:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"alarm":false,"muted":false}}}}这是错误的,因为adas_lane_keepAssist参数不正确。
如果在源JSON中,我将一个adas_lane_keepAssist参数更改为"true",则映射是正确的.
我还认为这可能是inferSchema问题的根源,所以我做了一个custom_schema:
custom_schema = StructType([
StructField("adas",StructType([
StructField("parkAssist",StructType([
StructField("rear",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
])),
StructField("front",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
]))
])),
StructField("lane",StructType([
StructField("keepAssist",StructType([
StructField("right",BooleanType(),True),
StructField("left",BooleanType(),True)
]))
]))
]))
])然后读成这样:
spark.read.schema(custom_schema) \
.option("multiline", "true") \
.json(///myfile.json) \
.first() \
.asDict()我得到了同样错误的结果:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"alarm":false,"muted":false}}}}有趣的是,如果我像这样改变我的custom_shema中的顺序:
custom_schema = StructType([
StructField("adas",StructType([
StructField("lane",StructType([
StructField("keepAssist",StructType([
StructField("right",BooleanType(),True),
StructField("left",BooleanType(),True)
]))
])),
StructField("parkAssist",StructType([
StructField("rear",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
])),
StructField("front",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
]))
]))
]))
])现在adas_parkAssist_front/left的每一个论点都是错误的:
{"adas":{"lane":{"keepAssist":{"right":false,"left":false}}, "parkAssist":{"rear":{"right":false,"left":false},"front":{"right":false,"left":false}}}}这是PySpark的限制吗?
发布于 2022-06-08 14:38:33
对我来说也很奇怪。我试过first、head和collect,但它们都返回了相同的扭曲结构。在这些行之前,如果我打印模式,它是正确的。因此,问题在于函数first、head、collect不能正确地使用嵌套结构.
为了找到解决办法,我将整个模式(在读取JSON文件之后是正确的)转换为映射类型。
df = spark.read.json(r"path\test_file.json")
df = df.withColumn('adas', F.create_map(
F.lit('lane'), F.create_map(
F.lit('keepAssist'), F.create_map(
F.lit('left'), F.col('adas.lane.keepAssist.left'),
F.lit('right'), F.col('adas.lane.keepAssist.right')
)
),
F.lit('parkAssist'), F.create_map(
F.lit('front'), F.create_map(
F.lit('alarm'), F.col('adas.parkAssist.front.alarm'),
F.lit('muted'), F.col('adas.parkAssist.front.muted')
),
F.lit('rear'), F.create_map(
F.lit('alarm'), F.col('adas.parkAssist.rear.alarm'),
F.lit('muted'), F.col('adas.parkAssist.rear.muted')
)
)
))
print(df.head().asDict())
# {'adas': {'lane': {'keepAssist': {'left': False, 'right': False}}, 'parkAssist': {'rear': {'alarm': False, 'muted': False}, 'front': {'alarm': False, 'muted': False}}}}发布于 2022-06-08 16:43:06
我的火花版本是3.1.1。
将其更新到3.2.0之后,将按预期读取自定义嵌套模式。
太棒了!
https://stackoverflow.com/questions/72544480
复制相似问题