blocks|key|1268757|text|假设您希望通过编程在Java中这样做，答案是不同的。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1268758|这两个文件都订好了吗？如果是这样，则不需要读取整个文件，只需从两个文件的开头开始，并且|1268759|如果条目匹配，则在两个文件中推进“当前”行。|ordered-list-item|1268760|如果条目不匹配，则确定哪个文件的行将放在第一位，显示该行，并在该文件中推进当前行。|1268761|如果您没有订购文件，那么您也许可以在diff之前订购这些文件。同样，由于您需要一个低内存解决方案，所以不要读取整个文件来排序它。将文件分割成可管理的块，然后对每个块进行排序。然后使用插入排序来组合块。|1268762|entityMap^0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|N|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|O|8|@]|9|@]|A|$]]|$1|G|3|H|5|F|7|P|8|@]|9|@]|A|$]]|$1|I|3|J|5|6|7|Q|8|@]|9|@]|A|$]]|$1|K|3|-4|5|6|7|R|8|@]|9|@]|A|$]]]|L|$]]

Assuming you wish to do this in Java, via programming, the answers are different.

Are both of the files ordered? If so, then you don't need to read in whole files, you simply start at the beginning of both files, and

<ol>
<li>If the entries match, advance the "current" line in both files.</li>
<li>If the entries don't match, determine which file's line would come first, display that line, and advance the current line in that file.</li>
</ol>

If you don't have ordered files, then perhaps you could order the files prior to the diff. Again, since you need a low memory solution, don't read the entire file in to sort it. Chop the file up into manageable chunks, and then sort each chunk. Then use insertion sort to combine the chunks.

blocks|key|624911|text|unix命令diff可以进行精确匹配。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|624912|您还可以使用-b标志运行它，以忽略空格的差异。|style|CODE|624913|entityMap|0|LINK|mutability|MUTABLE|url|http://linux.about.com/library/cmd/blcmdl1_diff.htm^0|0|A|0|0|6|2|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@$A|Q|B|R|1|S]]|C|$]]|$1|D|3|E|5|6|7|T|8|@$A|U|B|V|F|G]]|9|@]|C|$]]|$1|H|3|-4|5|6|7|W|8|@]|9|@]|C|$]]]|I|$J|$5|K|L|M|C|$N|O]]]]

The <a href="http://linux.about.com/library/cmd/blcmdl1_diff.htm" rel="nofollow">unix command diff</a> can work for exact matches.

You can also run it with the <code>-b</code> flag to ignore whitespace only differences.

blocks|key|624981|text|使用单级解析器，因为它提供了速度最快的Java解析器。您可以处理高达100+GB的文件，没有任何问题，而且速度非常快。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|624982|为了比较大型CSV文件，我建议您使用自己的RowProcessor实现，并将其包装到ConcurrentRowProcessor中。|624983|披露:我是这个图书馆的作者。它是开源和免费的(ApacheV2.0许可证)。|624984|entityMap|0|LINK|mutability|MUTABLE|url|https://github.com/uniVocity/univocity-parsers|1|https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/processor/RowProcessor.java|2|https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/processor/ConcurrentRowProcessor.java^0|2|5|0|0|L|C|1|16|M|2|0|0^^$0|@$1|2|3|4|5|6|7|T|8|@]|9|@$A|U|B|V|1|W]]|C|$]]|$1|D|3|E|5|6|7|X|8|@]|9|@$A|Y|B|Z|1|10]|$A|11|B|12|1|13]]|C|$]]|$1|F|3|G|5|6|7|14|8|@]|9|@]|C|$]]|$1|H|3|-4|5|6|7|15|8|@]|9|@]|C|$]]]|I|$J|$5|K|L|M|C|$N|O]]|P|$5|K|L|M|C|$N|Q]]|R|$5|K|L|M|C|$N|S]]]]

Use <a href="https://github.com/uniVocity/univocity-parsers" rel="nofollow">uniVocity-parsers</a> as it comes with the fastest CSV parser for Java. You can process files as big as 100 GB without any issue and very quickly.

For comparison of large CSV files, I suggest you to use your own implementation of <a href="https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/processor/RowProcessor.java" rel="nofollow">RowProcessor</a> and wrap it in a <a href="https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/processor/ConcurrentRowProcessor.java" rel="nofollow">ConcurrentRowProcessor</a>.

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

blocks|key|1268696|text|我建议您逐行比较，不要将整个文件上传到内存中。或者试着上传一组行。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1268697|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

I suggest you compare line by line and not to upload the entire file into memory. Or try uploading just a group of lines.

blocks|key|789351|text|有一个用于解析CSV文件的java库OpenCSV。可以构建文件的延迟加载。检查这篇文章。希望能帮上忙。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|789352|entityMap|0|LINK|mutability|MUTABLE|url|http://opencsv.sourceforge.net/|1|http://java.dzone.com/articles/incrementally-readstream-csv^0|I|7|0|14|4|1|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@$A|O|B|P|1|Q]|$A|R|B|S|1|T]]|C|$]]|$1|D|3|-4|5|6|7|U|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]|L|$5|G|H|I|C|$J|M]]]]

There is a java library <a href="http://opencsv.sourceforge.net/" rel="nofollow">OpenCSV</a> for parsing CSV files. Lazy loading of the file can be built. Check <a href="http://java.dzone.com/articles/incrementally-readstream-csv" rel="nofollow">this article</a>. Hope it helps.

blocks|key|1268936|text|下面是关于堆栈溢出的另一篇类似文章，其中我给出了一个解决方案的大纲，该解决方案只需要将两个文件中较小的文件存储在内存中：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1268937|如何比较两个较大的CSV文件并得到不同的文件|offset|length|1268938|这是不需要对文件进行排序的通用解决方案，因为您在问题中指出，行的顺序可能是不同的。|1268939|无论如何，即使这样也是可以避免的。我不想在这里重复这个解决方案，但是我的想法是索引一个文件，然后遍历另一个文件。通过仅保存哈希表和索引中每一行的位置，可以避免将整个较小的文件存储在内存中。这样，您将不得不在磁盘上多次触摸该文件，但不必将其保存在内存中。|1268940|算法的运行时间为O(N+%2B+M)。内存消耗为O(min(N，M))。|1268941|entityMap|0|LINK|mutability|MUTABLE|url|https://stackoverflow.com/questions/38120211/how-to-compare-two-large-csv-files-and-get-the-difference-file^0|0|0|M|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|T|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|U|8|@]|9|@$D|V|E|W|1|X]]|A|$]]|$1|F|3|G|5|6|7|Y|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|Z|8|@]|9|@]|A|$]]|$1|J|3|K|5|6|7|10|8|@]|9|@]|A|$]]|$1|L|3|-4|5|6|7|11|8|@]|9|@]|A|$]]]|M|$N|$5|O|P|Q|A|$R|S]]]]

Here is another similar post on Stack Overflow in which I have given the outline of a solution which requires only the smaller of the two files to be stored in memory:

<a href="https://stackoverflow.com/questions/38120211/how-to-compare-two-large-csv-files-and-get-the-difference-file">How to compare two large CSV files and get the difference file</a>

This is the general solution which doesn't require the files to be ordered, as you are stating in the question that order of lines may be different.

Anyway, even that can be avoided. I don't want to repeat the solution here, but the idea is to index one file and then walk through the other file. You can avoid storing entire smaller file in memory by only holding the hash table and location of each row in the index. In that way, you will have to touch the file many times on disk, but you won't have to keep it in memory.

Running time of the algorithm is O(N + M). Memory consumption is O(min(N, M)).

I have to compare two csv files with a size of 2-3 GB each, contained in Windows platform.

I've tried to put the first one in a HashMap to compare it with the second one, but the result (as expected) is a very high memory cosumption.

The target is to get the differences in another file.

The lines may appear in diffent order, and maybe missed also.

Any suggetions?

How to compare differences in very large csv files

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

EdgeOne AI 安全实战专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我必须比较包含在Windows中的两个大小为2-3 GB的csv文件。我尝试将第一个放在HashMap中，以便将其与第二个比较，但结果(如预期的)是非常高的内存占用。目标是获取另一个文件中的差异。这些线可能会以不同的顺序出现，也可能会漏掉。有什么建议吗？

问如何比较大型csv文件中的差异
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何比较大型csv文件中的差异EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何比较大型csv文件中的差异
EN