有没有可能读写拼花文件从一个文件夹到另一个文件夹在s3中,而不是转换成熊猫使用pyarrow。
下面是我的代码:
import pyarrow.parquet as pq
import pyarrow as pa
import s3fs
s3 = s3fs.S3FileSystem()
bucket = 'demo-s3'
pd = pq.ParquetDataset('s3://{0}/old'.format(bucket), filesystem=s3).read(nthreads=4).to_pandas()
table = pa.Table.from_pandas(pd)
pq.write_to_dataset(table, 's3://{0}/new'.format(bucket), filesystem=s3, use_dictionary=True, compression='snappy')发布于 2018-06-27 00:56:13
如果你不想直接复制这些文件,看起来你确实可以避免像pandas这样的文件:
table = pq.ParquetDataset('s3://{0}/old'.format(bucket),
filesystem=s3).read(nthreads=4)
pq.write_to_dataset(table, 's3://{0}/new'.format(bucket),
filesystem=s3, use_dictionary=True, compression='snappy')发布于 2020-01-10 20:41:42
为什么不直接复制(S3 -> S3)并节省内存和I/O?
import awswrangler as wr
SOURCE_PATH = "s3://..."
TARGET_PATH = "s3://..."
wr.s3.copy_objects(
source_path=SOURCE_PATH,
target_path=TARGET_PATH
)https://stackoverflow.com/questions/49513152
复制相似问题