文章/答案/技术大牛

发布

社区首页 >问答首页 >在大文件中跳过C++ /快速随机访问

问在大文件中跳过C++ /快速随机访问
EN

Stack Overflow用户

提问于 2016-12-21 22:33:24

回答 3查看 1.5K关注 0票数 5

我有大文件，包含少量的大型数据集。每个数据集包含一个名称和数据集大小(以字节为单位)，允许跳过它并转到下一个数据集。

我想非常快地建立一个数据集名称的索引。一个文件示例大小约为21 88，包含88个数据集。使用std::ifstream和seekg()在数据集之间跳过88个名称需要大约1300 88，我想减少这一点。

事实上，我在一个21 at文件中的给定位置读取了88块约30字节的数据，它需要1300 at。

是否有一种方法来改善这一点，或者它是操作系统和文件系统的限制？我正在Windows 7 64位下运行测试。

我知道在文件开始时有一个完整的索引会更好，但是文件格式没有这个，我们不能改变它。

c++

file

random-access

回答 3

Stack Overflow用户

回答已采纳

发布于 2016-12-21 23:12:51

您可以扫描文件并在单独的文件中使用键和索引创建自己的头。根据您的用例，您可以在程序启动时和每次文件更改时执行一次。在访问大数据之前，在较小的文件中查找可以为您提供所需的索引。

票数 2

Stack Overflow用户

发布于 2016-12-21 22:46:51

您可以使用内存映射文件接口(我建议使用boost的实现。)

这将打开文件到虚拟页面，以便您的应用程序更快地查找，而无需返回磁盘。

票数 5

Stack Overflow用户

发布于 2016-12-21 23:15:36

您可以使用多线程执行缓冲区队列处理。您可以创建一个可以存储各种数据量的自定义结构。

你说过：

每个数据集包含一个名称和数据集大小(以字节为单位)，允许跳过它并转到下一个数据集。

因此，由于一次又一次地打开和关闭文件很慢，所以您可以一次读取所有文件，然后将其存储到一个完整的缓冲区对象中，然后解析它或将其存储到批处理中。这还将取决于您是否正在以文本或二进制模式读取文件，这取决于解析文件有多容易。我将通过填充多个批来演示后面的内容，同时从文件中读取缓冲大小的数据量。

伪码

struct Batch {
    std::string name; // Name of Dataset
    unsigned size;    // Size of Dataset
    unsigned indexOffset;  // Index to next read location
    bool empty = true;     // Flag to tell if this batch is full or empty
    std::vector<DataType> dataset; // Container of Data
}; 

std::vector<Batch> finishedBatches;

// This doesn't matter on the size of the data set; this is just a buffer size on how much memory to digest in reading the file
const unsigned bufferSize = "Set to Your Preference" 1MB - 4MB etc.

void loadDataFromFile( const std::string& filename, unsigned bufferSize, std::vector<Batch>& batches ) {

    // Set ifstream's buffer size 

    // OpenFile For Reading and read in and upto buffer size

    // Spawn different thread to populate the Batches and while that batch is loading 
    // in data read in that much buffer data again. You will need to have a couple local 
    // stack batches to work with. So if a batch is complete and you reached the next index 
    // location from the file you can fill another batch.

    // When a single batch is complete push it into the vector to store that batch.
    // Change its flag and clear its vector and you can then use that empty batch again.

    // Continue this until you reached end of file.           

}

这将是一个2线程系统这里。主线程用于从文件中打开和读取文件，并使用工作线程填充批，将批推入容器并交换使用下一批。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/41273039

复制

相似问题

问在大文件中跳过C++ /快速随机访问
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在大文件中跳过C++ /快速随机访问EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在大文件中跳过C++ /快速随机访问
EN