首页
学习
活动
专区
圈层
工具
发布
    • 综合排序
    • 最热优先
    • 最新优先
    时间不限
  • 来自专栏大数据-BigData

    hudi 键的生成(Key Generation)

    hoodie.datasource.write.recordkey.field hoodie.datasource.write.partitionpath.field hoodie.datasource.write.keygenerator.class “” hoodie.deltastreamer.keygen.timebased.input.timezone “” hoodie.deltastreamer.keygen.timebased.output.dateformat “” hoodie.deltastreamer.keygen.timebased.output.dateformat “yyyyMMddHH” hoodie.deltastreamer.keygen.timebased.output.timezone “” hoodie.deltastreamer.keygen.timebased.output.dateformat “yyyyMMddHH” hoodie.deltastreamer.keygen.timebased.output.timezone “” hoodie.deltastreamer.keygen.timebased.input.timezone “UTC” hoodie.deltastreamer.keygen.timebased.output.dateformat

    2K40编辑于 2022-06-12
  • 来自专栏ApacheHudi

    Hudi的管理与运维

    使用desc命令可以查看hudi表的描述信息: hoodie:hoodie_table_1->desc 18/09/06 15:57:19 INFO timeline.HoodieActiveTimeline | hoodie_table_1 | | hoodie.table.type | COPY_ON_WRITE hoodie:trips->help * ! :trips-> 在每次写入开始时,Hudi还将.inflight提交写入.hoodie文件夹。 _hoodie_record_key - 作为每个DFS分区内的主键,是所有更新/插入的基础 _hoodie_commit_time - 该记录上次的提交 _hoodie_file_name - 包含记录的实际文件名

    9.8K21发布于 2021-04-13
  • 来自专栏ApacheHudi

    实战!配置DataDog监控Apache Hudi应用指标

    hoodie.metrics.on=true hoodie.metrics.reporter.type=DATADOG 下面的属性用来配置Datdog API站点。 hoodie.metrics.datadog.api.site=EU # 或者 US hoodie.metrics.datadog.api.key可以让你配置API密匙。 hoodie.metrics.datadog.api.key=<你的API密匙> hoodie.metrics.datadog.api.key.supplier=<你的API密匙提供者> 出于安全性考虑 并把实现类的完整类名设置到 hoodie.metrics.datadog.api.key.supplier。 由于 hoodie.metrics.datadog.api.key有更高的优先级,也要确保它没有设置。 下面的属性用来配置指标前缀,从而区分不同job的指标。

    1K10发布于 2021-04-13
  • 来自专栏ApacheHudi

    如何不加锁地将数据并发写入Apache Hudi?

    option("hoodie.write.concurrency.mode","OPTIMISTIC_CONCURRENCY_CONTROL"). option("hoodie.cleaner.policy.failed.writes "). option("hoodie.metadata.enable","false"). option("hoodie.metadata.enable","false"). option("hoodie.table.services.enabled","false"). option("hoodie.write.concurrency.mode","OPTIMISTIC_CONCURRENCY_CONTROL"). option("hoodie.cleaner.policy.failed.writes "). option("hoodie.metadata.enable","false"). option("hoodie.clustering.inline","true"). option("hoodie.clustering.inline.max.commits

    90330编辑于 2023-09-04
  • 来自专栏大数据-BigData

    使用prometheus监控hudi相关指标

    ' = 'true', 'hoodie.metrics.executor.enable' = 'true', 'hoodie.metrics.reporter.type' = 'PROMETHEUS_PUSHGATEWAY ', 'hoodie.metrics.pushgateway.job.name' = 'hudi-metrics', 'hoodie.metrics.pushgateway.host' = 'hadoop1 ', 'hoodie.metrics.pushgateway.report.period.seconds' = '10', 'hoodie.metrics.pushgateway.delete.on.shutdown ' = 'false', 'hoodie.metrics.pushgateway.random.job.name.suffix' = 'false' );Copy 相比原文,本问在创建hudi目标表时候新增了 hoodie.metrics.

    1.3K10编辑于 2022-01-19
  • 来自专栏ApacheHudi

    Apache Hudi + Flink作业运行指南

    =uuid hoodie.datasource.write.partitionpath.field=ts bootstrap.servers=xxx:9092 hoodie.deltastreamer.keygen.timebased.timestamp.type =EPOCHMILLISECONDS hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd hoodie.datasource.write.keygenerator.class =org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator hoodie.embed.timeline.server=false hoodie.deltastreamer.schemaprovider.source.schema.file =hdfs://olap/hudi/test/config/flink/schema.avsc hoodie.deltastreamer.schemaprovider.target.schema.file |_hoodie_commit_seqno| _hoodie_record_key|_hoodie_partition_path| _hoodie_file_name|

    3.6K20发布于 2021-04-13
  • 来自专栏大数据-BigData

    hudi 模式演化

    | option("hoodie.index.type","SIMPLE"). | option(TABLE_NAME.key, tableName). | string| null| |_hoodie_commit_seqno| string| null| | _hoodie_record_key| string| null| |_hoodie_partition...| string| null| | _hoodie_file_name| string| null| | string| null| |_hoodie_commit_seqno| string| null| | _hoodie_record_key| string| null| |_hoodie_partition...| string| null| | _hoodie_file_name| string| null|

    66720编辑于 2022-01-19
  • 来自专栏ApacheHudi

    详解Apache Hudi如何配置各种类型分区

    ` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` ` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` ` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` ` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` ` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path`

    1.5K20发布于 2021-04-13
  • 来自专栏大数据成神之路

    Hudi小文件问题处理和生产调优个人笔记

    Hudi 会尝试将文件大小保持在此配置值; hoodie.parquet.small.file.limit:文件大小小于这个配置值的均视为小文件; hoodie.copyonwrite.insert.split.size 可以根据 hoodie.parquet.max.file.size 和单条记录的大小进行调整。 假设配置的hoodie.parquet.max.file.size为120MB,hoodie.parquet.small.file.limit为100MB。 每个新数据文件的记录数量由hoodie.copyonwrite.insert.split.size配置确定。 我们建议设置shuffle的并发度,配置项为hoodie.

    2.3K20编辑于 2022-04-13
  • 来自专栏ApacheHudi

    Spark读取变更Hudi数据集Schema实现分析

    option("hoodie.insert.shuffle.parallelism", "10"). option("hoodie.upsert.shuffle.parallelism", "10"). option("hoodie.delete.shuffle.parallelism", "10"). option("hoodie.keep.max.commits", "5"). option("hoodie.keep.min.commits", "4"). option("hoodie.cleaner.commits.retained", "3").

    3.1K20发布于 2021-04-13
  • 来自专栏ApacheHudi

    OnZoom基于Apache Hudi的流批一体架构实践

    如果只关心数据的最终状态,可以根据_hoodie_commit_time来过滤获取增量数据。 (对应参数为 hoodie. hoodie.merge.allow.duplicate.on.inserts 其中:hoodie.combine.before.insert 决定是否对同一批次的数据按 recordKey 进行合并,默认为 false;hoodie.parquet.small.file.limit 和hoodie.merge.allow.duplicate.on.inserts 控制小文件合并阈值和如何进行小文件合并 如果 hoodie.parquet.small.file.limit > 0 并且 hoodie.merge.allow.duplicate.on.inserts 为 false,那么在小文件合并的时候

    2.1K40编辑于 2021-12-21
  • 来自专栏ApacheHudi

    Apache Hudi 0.6.0版本重磅发布

    文件中配置一个新属性hoodie.table.version;无论何时使用Hudi表新版本,如1(从0.6.0以前迁移到0.6.0),将会自动进行升级,并且只会对Hudi表升级一次,升级后hoodie.table.version 类似也提供了一个降级命令行工具(-downgrade),如用户想从0.6.0版本回退到之前的版本,此时hoodie.table.version将会从1变为0。 支持Cleaning与写入并发执行,开启hoodie.clean.async=true以减少commit过程的耗时; Spark Streaming写入支持异步Compaction,可通过hoodie.datasource.compaction.async.enable 支持通过marker文件进行Rollback,而不再对全表进行listing,设置hoodie.rollback.using.markers=true启用。 支持一种新的索引类型hoodie.index.type=SIMPLE,对于updates/deletes覆盖表大多数数据的场景,会比BLOOM_INDEX更快。

    80820发布于 2021-04-13
  • 来自专栏大数据成神之路

    Apache Hudi 使用文件聚类功能 (Clustering) 解决小文件过多的问题

    cat /home/hadoop/hudi_clustering/clusteringjob.properties hoodie.clustering.inline=true hoodie.clustering.inline.max.commits =2 hoodie.clustering.plan.strategy.max.num.groups=40 hoodie.clustering.plan.strategy.target.file.max.bytes =1073741824 hoodie.clustering.plan.strategy.max.bytes.per.group=2147483648 hoodie.clustering.plan.strategy.small.file.limit hoodie.commits.archival.batch=60 hoodie.archive.merge.small.file.limit.bytes=104857600 # When set to =false hoodie.parquet.small.file.limit=124857600 hoodie.cleaner.parallelism=800 hoodie.cleaner.incremental.mode

    1.7K20编辑于 2022-11-11
  • 来自专栏大数据成神之路

    实时数据湖:Flink CDC流式写入Hudi

    ` string, `_hoodie_commit_seqno` string, `_hoodie_record_key ` string, `_hoodie_partition_path` string, `_hoodie_file_name `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, ` string, `_hoodie_commit_seqno` string, `_hoodie_record_key

    2.9K30发布于 2021-07-12
  • 来自专栏ApacheHudi

    一文彻底掌握Apache Hudi的主键和分区配置

    指定record key字段 hoodie.datasource.write.partitionpath.field 指定分区字段 hoodie.datasource.write.keygenerator.class 输出日期类型 hoodie.deltastreamer.keygen.timebased.timezone 数据格式的时区 hoodie.deltastreamer.keygen.timebased.input.dateformat ”” hoodie.deltastreamer.keygen.timebased.input.timezone ”” hoodie.deltastreamer.keygen.timebased.output.dateformat ”” hoodie.deltastreamer.keygen.timebased.output.dateformat “yyyyMMddHH” hoodie.deltastreamer.keygen.timebased.output.timezone ”” hoodie.deltastreamer.keygen.timebased.input.timezone “UTC” hoodie.deltastreamer.keygen.timebased.output.dateformat

    3K30发布于 2021-04-13
  • 来自专栏ApacheHudi

    真香!PySpark整合Apache Hudi实战

    ': tableName, 'hoodie.datasource.write.recordkey.field': 'uuid', 'hoodie.datasource.write.partitionpath.field ': 'partitionpath', 'hoodie.datasource.write.table.name': tableName, 'hoodie.datasource.write.operation ': 'insert', 'hoodie.datasource.write.precombine.field': 'ts', 'hoodie.upsert.shuffle.parallelism , _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show() ': 'delete', 'hoodie.datasource.write.precombine.field': 'ts', 'hoodie.upsert.shuffle.parallelism

    2.2K20发布于 2021-04-13
  • 来自专栏devops

    【架构实战】数据湖架构设计与实践

    ("spark.sql.extensions","org.apache.spark.sql.hudi.HoodieSparkSessionExtension")\.getOrCreate()#写入数据hoodie_options ={'hoodie.table.name':'events','hoodie.datasource.write.recordkey.field':'id','hoodie.datasource.write.partitionpath.field ':'date','hoodie.datasource.write.table.type':'COPY_ON_WRITE','hoodie.datasource.write.operation':'bulk_insert ','hoodie.datasource.write.precombine.field':'ts','hoodie.upsert.shuffle.parallelism':'200','hoodie.insert.shuffle.parallelism ':'200'}df.write\.format("hudi")\.options(**hoodie_options)\.mode("append")\.save("hdfs://namenode:8020

    16610编辑于 2026-04-05
  • 来自专栏陈猿解码

    从hudi持久化文件理解其核心概念

    hoodie.table.precombine.field=ts hoodie.table.name=t1 hoodie.archivelog.folder=archived hoodie.table.type =MERGE_ON_READ hoodie.table.version=2 hoodie.table.partition.fields=partition hoodie.timeline.layout.version 数据) hoodie.table.name:表的名称 hoodie.archivelog.folder:表的归档路径 hoodie.table.type:表的类型(MOR或COW) hoodie.table.version " "_hoodie_record_key" "_hoodie_partition_path" "_hoodie_file_name" 文件示例如下(本身无法直接查看,通过代码解析后打印的相关内容): ": "20211230092036", "_hoodie_commit_seqno": "20211230092036_1_1", "_hoodie_record_key": "id4", "_hoodie_partition_path

    1.3K20编辑于 2023-02-28
  • 来自专栏ApacheHudi

    实战 | 将Kafka流式数据摄取至Hudi

    =2 hoodie.insert.shuffle.parallelism=2 hoodie.bulkinsert.shuffle.parallelism=2 hoodie.datasource.write.recordkey.field =uuid hoodie.datasource.write.partitionpath.field=create_time hoodie.datasource.write.precombine.field =yyyy/MM/dd hoodie.datasource.hive_sync.database=dwd hoodie.datasource.hive_sync.table=test hoodie.datasource.hive_sync.username =用户名 hoodie.datasource.hive_sync.password=密码 hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://..... hoodie.datasource.hive_sync.partition_fields =INCREMENTAL; set hoodie.test.consume.max.commits=3; set hoodie.test.consume.start.timestamp=20200427114546

    2.5K10发布于 2021-04-13
  • 来自专栏大数据技术架构

    Apache Hudi 0.5.1版本重磅发布

    若开启新的Hudi timeline布局方式(layout),即避免重命名,可设置写配置项hoodie.timeline.layout.version=1。 当然,你也可以在CLI中使用repair overwrite-hoodie-props命令来添加hoodie.timeline.layout.version=1至hoodie.properties文件。 CLI支持repair overwrite-hoodie-props来指定文件来重写表的hoodie.properties文件,可以使用此命令来的更新表名或者使用新的timeline布局方式。 注意当写hoodie.properties文件时(毫秒),一些查询将会暂时失败,失败后重新运行即可。 支持DynamicBloomFilter(动态布隆过滤器),默认是关闭的,可以使用索引配置项hoodie.bloom.index.filter.type=DYNAMIC_V0来开启。

    1.4K30发布于 2020-03-11
领券