首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从64字节数组中查找字节的最快方法

从64字节数组中查找字节的最快方法
EN

Stack Overflow用户
提问于 2016-09-12 01:17:02
回答 4查看 126关注 0票数 0

我在代码中有一个关键的地方:我需要从64字节的数组中查找大约1'000'000次。

最小代码:

代码语言:javascript
复制
#include <iostream>
#include <stdint.h>
#include <random>
#include <chrono>
#include <ctime>
#define TYPE uint8_t
#define n_lookup 64

int main(){
    const int n_indices = 1000000;
    TYPE lookup[n_lookup];
    TYPE indices[n_indices];
    TYPE result[n_indices];

    // preparations
    std::default_random_engine generator;
    std::uniform_int_distribution<int> distribution(0, n_lookup);
    for (int i=0; i < n_indices; i++) indices[i] = distribution(generator);
    for (int i=0; i < n_lookup; i++) lookup[i] = distribution(generator);

    std::chrono::time_point<std::chrono::system_clock> start = std::chrono::system_clock::now();
    // main loop:
    for (int i=0; i < n_indices; i++) {
        result[i] = lookup[indices[i]];
    }
    std::chrono::time_point<std::chrono::system_clock> end = std::chrono::system_clock::now();

    std::chrono::duration<double> elapsed_seconds = end - start;
    std::cout << "computation took " << elapsed_seconds.count() * 1e9 / n_indices  <<  " ns per element"<< std::endl;
    // printing random numbers to avoid code elimination
    std::cout << result[12] << result[45]; 
    return 0;
}

在使用g++ lookup.cpp -std=gnu++11 -O3 -funroll-loops进行编译后,我在现代CPU上得到的每个元素略低于1 1ns。

我需要这个操作的工作速度快2-3倍(没有线程)。我该怎么做呢?

另外,我也在研究AVX512 (512bit正好是查找表的大小!)指令集,但它缺少8位聚集操作!

EN

回答 4

Stack Overflow用户

发布于 2016-09-12 01:46:45

indicesresult向量在内存中的不同位置,但同时被访问。这会导致缓存未命中。我建议你将结果和索引合并到一个向量中。代码如下:

代码语言:javascript
复制
#include <iostream>
#include <stdint.h>
#include <random>
#include <chrono>
#include <ctime>
#define TYPE uint8_t
#define n_lookup 64

int main(){
    const int n_indices = 2000000;
    TYPE lookup[n_lookup];
    // Merge indices and result
    // If i is index, then i+1 is result
    TYPE ind_res[n_indices];

    // preparations
    std::default_random_engine generator;
    std::uniform_int_distribution<int> distribution(0, n_lookup);
    for (int i=0; i < n_indices; i += 2) ind_res[i] = distribution(generator);
    for (int i=0; i < n_lookup; i++) lookup[i] = distribution(generator);

    std::chrono::time_point<std::chrono::system_clock> start = std::chrono::system_clock::now();
    // main loop:
    for (int i=0; i < n_indices; i += 2) {
        ind_res[i+1] = lookup[ind_res[i]]; // more dense access here, no cache-miss
    }
    std::chrono::time_point<std::chrono::system_clock> end = std::chrono::system_clock::now();

    std::chrono::duration<double> elapsed_seconds = end - start;
    std::cout << "computation took " << elapsed_seconds.count() * 1e9 / n_indices  <<  " ns per element"<< std::endl;
    // printing random numbers to avoid code elimination
    std::cout << ind_res[24] << ind_res[90]; 
    return 0;
}

我的测试表明,这段代码运行得更快。

票数 3
EN

Stack Overflow用户

发布于 2016-09-12 01:34:04

使用-march=native,这是你的循环编译成的:

代码语言:javascript
复制
        movq    %rax, %rbx
        xorl    %eax, %eax
.L145:
        movzbl  128(%rsp,%rax), %edx
        movzbl  64(%rsp,%rdx), %edx
        movb    %dl, 1000128(%rsp,%rax)
        addq    $1, %rax
        cmpq    $1000000, %rax
        jne     .L145

我很难理解在没有并行化的情况下,它会变得更快。

通过将类型更改为int32_t,它将被矢量化:

代码语言:javascript
复制
    vpcmpeqd        %ymm2, %ymm2, %ymm2
    movq    %rax, %rbx
    xorl    %eax, %eax
.L145:
        vmovdqa -8000048(%rbp,%rax), %ymm1
        vmovdqa %ymm2, %ymm3
        vpgatherdd      %ymm3, -8000304(%rbp,%ymm1,4), %ymm0
        vmovdqa %ymm0, -4000048(%rbp,%rax)
        addq    $32, %rax
        cmpq    $4000000, %rax
        jne     .L145
        vzeroupper

这会有帮助吗?

票数 2
EN

Stack Overflow用户

发布于 2016-09-12 04:28:38

首先,有一个bug,分布(0,64)产生数字0到64,64不能放入数组中。

通过一次查找两个值,可以将查找速度提高2倍:

代码语言:javascript
复制
#include <iostream>
#include <stdint.h>
#include <random>
#include <chrono>
#include <ctime>
#define TYPE uint8_t
#define TYPE2 uint16_t
#define n_lookup 64

void tst() {
    const int n_indices = 1000000;// has to be multiple of 2
    TYPE lookup[n_lookup];
    TYPE indices[n_indices];
    TYPE result[n_indices];
    TYPE2 lookup2[n_lookup * 256];

    // preparations
    std::default_random_engine generator;
    std::uniform_int_distribution<int> distribution(0, n_lookup-1);
    for (int i = 0; i < n_indices; i++) indices[i] = distribution(generator);
    for (int i = 0; i < n_lookup; i++) lookup[i] = distribution(generator);

    for (int i = 0; i < n_lookup; ++i) {
        for (int j = 0; j < n_lookup; ++j) {
            lookup2[(i << 8) | j] = (lookup[i] << 8) | lookup[j];
        }
    }

    std::chrono::time_point<std::chrono::system_clock> start = std::chrono::system_clock::now();

    TYPE2* indices2 = (TYPE2*)indices;
    TYPE2* result2 = (TYPE2*)result;

    // main loop:
    for (int i = 0; i < n_indices / 2; ++i) {
        *result2++ = lookup2[*indices2++];
    }

    std::chrono::time_point<std::chrono::system_clock> end = std::chrono::system_clock::now();

    for (int i = 0; i < n_indices; i++) {
        if (result[i] != lookup[indices[i]]) {
            std::cout << "!!!!!!!!!!!!!ERROR!!!!!!!!!!!!!";
        }
    }

    std::chrono::duration<double> elapsed_seconds = end - start;
    std::cout << "computation took " << elapsed_seconds.count() * 1e9 / n_indices << " ns per element" << std::endl;
    // printing random numbers to avoid code elimination
    std::cout << result[12] << result[45];
}

int main() {
    tst();
    std::cin.get();
    return 0;
}
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/39438675

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档