首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >有问题的基准,令人困惑的装配

有问题的基准,令人困惑的装配
EN

Stack Overflow用户
提问于 2022-03-25 13:53:06
回答 1查看 75关注 0票数 2

这里的大会新手。我写了一个基准来测量机器在计算转置矩阵张量积时的浮点性能。

鉴于我的机器有32 Given (带宽~37 CPU /s)和Intel(R) Core(TM) i5-8400 CPU@ 2.80GHz (Turbo4.0GHz)处理器,我估计最大性能(有流水线和数据在寄存器中)为6核x 4.0GHz =24 24GFLOP/s。然而,当我运行基准时,我测量的是127 24GFLOP/s,这显然是错误的。

注:为了衡量FP的性能,我正在测量运算计数:n*n*n*n*6 (矩阵矩阵乘法的n^3,在复杂数据点的n片上执行),即假设一个复复乘法有6次失败,除以每次运行所需的平均时间。

主函数中的代码片段

代码语言:javascript
复制
// benchmark runs
auto avg_dur = 0.0;
for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
{
    #pragma noinline
    do_timed_run(n, avg_dur);
}
avg_dur /= static_cast<double>(experiment_count);

代码片段: do_timed_run

代码语言:javascript
复制
void do_timed_run(const std::size_t& n, double& avg_dur)
{
    // create the data and lay first touch
    auto operand0 = matrix<double>(n, n);
    auto operand1 = tensor<double>(n, n, n);
    auto result = tensor<double>(n, n, n);
    
    // first touch
    #pragma omp parallel
    {
        set_first_touch(operand1);
        set_first_touch(result);
    }
    
    // do the experiment
    const auto dur1 = omp_get_wtime() * 1E+6;
    #pragma omp parallel firstprivate(operand0)
    {
        #pragma noinline
        transp_matrix_tensor_mult(operand0, operand1, result);
    }
    const auto dur2 = omp_get_wtime() * 1E+6;
    avg_dur += dur2 - dur1;
}

备注:

此时,我没有为函数relevant.

  • the提供代码,因为我不认为它是一个调试工具,我用它来更好地理解disassembler.

的输出

现在用于函数do_timed_run的反汇编。

代码语言:javascript
复制
0000000000403a20 <_Z12do_timed_runRKmRd>:
  403a20:   48 81 ec d8 00 00 00    sub    $0xd8,%rsp
  403a27:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403a2e:   00 
  403a2f:   48 89 fd                mov    %rdi,%rbp
  403a32:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403a39:   00 
  403a3a:   48 89 f3                mov    %rsi,%rbx
  403a3d:   48 89 ee                mov    %rbp,%rsi
  403a40:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403a45:   48 89 ea                mov    %rbp,%rdx
  403a48:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403a4f:   00 
  403a50:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403a57:   00 
  403a58:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403a5f:   00 
  403a60:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403a67:   00 
  403a68:   e8 03 f8 ff ff          callq  403270 <_ZN5s3dft6matrixIdEC1ERKmS3_@plt>
  403a6d:   48 89 ee                mov    %rbp,%rsi
  403a70:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403a75:   48 89 ea                mov    %rbp,%rdx
  403a78:   48 89 e9                mov    %rbp,%rcx
  403a7b:   e8 80 f8 ff ff          callq  403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
  403a80:   48 89 ee                mov    %rbp,%rsi
  403a83:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403a88:   48 89 ea                mov    %rbp,%rdx
  403a8b:   48 89 e9                mov    %rbp,%rcx
  403a8e:   e8 6d f8 ff ff          callq  403300 <_ZN5s3dft6tensorIdEC1ERKmS3_S3_@plt>
  403a93:   bf 88 f3 44 00          mov    $0x44f388,%edi
  403a98:   e8 53 f7 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403a9d:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403aa4:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403aa9:   33 c0                   xor    %eax,%eax
  403aab:   e8 20 f6 ff ff          callq  4030d0 <__kmpc_ok_to_fork@plt>
  403ab0:   85 c0                   test   %eax,%eax
  403ab2:   74 21                   je     403ad5 <_Z12do_timed_runRKmRd+0xb5>
  403ab4:   ba a5 3c 40 00          mov    $0x403ca5,%edx
  403ab9:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403abe:   be 02 00 00 00          mov    $0x2,%esi
  403ac3:   48 8d 4c 24 08          lea    0x8(%rsp),%rcx
  403ac8:   33 c0                   xor    %eax,%eax
  403aca:   4c 8d 41 38             lea    0x38(%rcx),%r8
  403ace:   e8 cd f5 ff ff          callq  4030a0 <__kmpc_fork_call@plt>
  403ad3:   eb 41                   jmp    403b16 <_Z12do_timed_runRKmRd+0xf6>
  403ad5:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403ada:   33 c0                   xor    %eax,%eax
  403adc:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403ae3:   e8 58 f7 ff ff          callq  403240 <__kmpc_serialized_parallel@plt>
  403ae8:   be 9c 13 47 00          mov    $0x47139c,%esi
  403aed:   48 8d bc 24 d0 00 00    lea    0xd0(%rsp),%rdi
  403af4:   00 
  403af5:   48 8d 54 24 08          lea    0x8(%rsp),%rdx
  403afa:   48 8d 4a 38             lea    0x38(%rdx),%rcx
  403afe:   e8 a2 01 00 00          callq  403ca5 <_Z12do_timed_runRKmRd+0x285>
  403b03:   bf c0 f3 44 00          mov    $0x44f3c0,%edi
  403b08:   33 c0                   xor    %eax,%eax
  403b0a:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b11:   e8 aa f7 ff ff          callq  4032c0 <__kmpc_end_serialized_parallel@plt>
  403b16:   e8 85 f6 ff ff          callq  4031a0 <omp_get_wtime@plt>
  403b1b:   c5 fb 11 04 24          vmovsd %xmm0,(%rsp)
  403b20:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b25:   33 c0                   xor    %eax,%eax
  403b27:   e8 a4 f5 ff ff          callq  4030d0 <__kmpc_ok_to_fork@plt>
  403b2c:   85 c0                   test   %eax,%eax
  403b2e:   74 25                   je     403b55 <_Z12do_timed_runRKmRd+0x135>
  403b30:   ba 0b 3c 40 00          mov    $0x403c0b,%edx
  403b35:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b3a:   be 03 00 00 00          mov    $0x3,%esi
  403b3f:   48 8d 4c 24 08          lea    0x8(%rsp),%rcx
  403b44:   33 c0                   xor    %eax,%eax
  403b46:   4c 8d 41 38             lea    0x38(%rcx),%r8
  403b4a:   4c 8d 49 70             lea    0x70(%rcx),%r9
  403b4e:   e8 4d f5 ff ff          callq  4030a0 <__kmpc_fork_call@plt>
  403b53:   eb 45                   jmp    403b9a <_Z12do_timed_runRKmRd+0x17a>
  403b55:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b5a:   33 c0                   xor    %eax,%eax
  403b5c:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b63:   e8 d8 f6 ff ff          callq  403240 <__kmpc_serialized_parallel@plt>
  403b68:   be a0 13 47 00          mov    $0x4713a0,%esi
  403b6d:   48 8d bc 24 d0 00 00    lea    0xd0(%rsp),%rdi
  403b74:   00 
  403b75:   48 8d 54 24 08          lea    0x8(%rsp),%rdx
  403b7a:   48 8d 4a 38             lea    0x38(%rdx),%rcx
  403b7e:   4c 8d 42 70             lea    0x70(%rdx),%r8
  403b82:   e8 84 00 00 00          callq  403c0b <_Z12do_timed_runRKmRd+0x1eb>
  403b87:   bf f8 f3 44 00          mov    $0x44f3f8,%edi
  403b8c:   33 c0                   xor    %eax,%eax
  403b8e:   8b b4 24 d0 00 00 00    mov    0xd0(%rsp),%esi
  403b95:   e8 26 f7 ff ff          callq  4032c0 <__kmpc_end_serialized_parallel@plt>
  403b9a:   e8 01 f6 ff ff          callq  4031a0 <omp_get_wtime@plt>
  403b9f:   c5 fb 5c 0c 24          vsubsd (%rsp),%xmm0,%xmm1
  403ba4:   c5 fb 10 05 cc c4 01    vmovsd 0x1c4cc(%rip),%xmm0        # 420078 <alpha_beta.61562.0.0.28+0x28>
  403bab:   00 
  403bac:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403bb1:   c4 e2 f9 a9 0b          vfmadd213sd (%rbx),%xmm0,%xmm1
  403bb6:   c5 fb 11 0b             vmovsd %xmm1,(%rbx)
  403bba:   e8 71 f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bbf:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403bc4:   e8 67 f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bc9:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403bce:   e8 5d f5 ff ff          callq  403130 <_ZN5s3dft9data_packIdED1Ev@plt>
  403bd3:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403bda:   00 
  403bdb:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403be2:   00 
  403be3:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403bea:   00 
  403beb:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403bf2:   00 
  403bf3:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403bfa:   00 
  403bfb:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403c02:   00 
  403c03:   48 81 c4 d8 00 00 00    add    $0xd8,%rsp
  403c0a:   c3                      retq   
  403c0b:   48 81 ec d8 00 00 00    sub    $0xd8,%rsp
  403c12:   4c 89 c6                mov    %r8,%rsi
  403c15:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403c1c:   00 
  403c1d:   4c 8d 24 24             lea    (%rsp),%r12
  403c21:   4c 89 e7                mov    %r12,%rdi
  403c24:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403c2b:   00 
  403c2c:   48 89 cd                mov    %rcx,%rbp
  403c2f:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403c36:   00 
  403c37:   48 89 d3                mov    %rdx,%rbx
  403c3a:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403c41:   00 
  403c42:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403c49:   00 
  403c4a:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403c51:   00 
  403c52:   e8 49 03 00 00          callq  403fa0 <_ZN5s3dft6matrixIdEC1ERKS1_> # <--- Here starts the part with the function call...
  403c57:   4c 89 e7                mov    %r12,%rdi
  403c5a:   48 89 de                mov    %rbx,%rsi
  403c5d:   48 89 ea                mov    %rbp,%rdx
  403c60:   e8 8b 01 00 00          callq  403df0 <_Z25transp_matrix_tensor_multIdEvRKN5s3dft6matrixIT_EERKNS0_6tensorIS2_EERS7_>
  403c65:   4c 89 e7                mov    %r12,%rdi
  403c68:   e8 63 01 00 00          callq  403dd0 <_ZN5s3dft6matrixIdED1Ev>     # <--- ...and here it ends
  403c6d:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403c74:   00 
  403c75:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403c7c:   00 
  403c7d:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403c84:   00 
  403c85:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403c8c:   00 
  403c8d:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403c94:   00 
  403c95:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403c9c:   00 
  403c9d:   48 81 c4 d8 00 00 00    add    $0xd8,%rsp
  403ca4:   c3                      retq   
  403ca5:   48 81 ec d8 00 00 00    sub    $0xd8,%rsp
  403cac:   48 89 d7                mov    %rdx,%rdi
  403caf:   48 89 ac 24 c8 00 00    mov    %rbp,0xc8(%rsp)
  403cb6:   00 
  403cb7:   48 89 9c 24 c0 00 00    mov    %rbx,0xc0(%rsp)
  403cbe:   00 
  403cbf:   48 89 cb                mov    %rcx,%rbx
  403cc2:   4c 89 bc 24 a0 00 00    mov    %r15,0xa0(%rsp)
  403cc9:   00 
  403cca:   4c 89 b4 24 a8 00 00    mov    %r14,0xa8(%rsp)
  403cd1:   00 
  403cd2:   4c 89 ac 24 b0 00 00    mov    %r13,0xb0(%rsp)
  403cd9:   00 
  403cda:   4c 89 a4 24 b8 00 00    mov    %r12,0xb8(%rsp)
  403ce1:   00 
  403ce2:   e8 99 f4 ff ff          callq  403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt> # <--- here are the calls to set-first-touch
  403ce7:   48 89 df                mov    %rbx,%rdi
  403cea:   e8 91 f4 ff ff          callq  403180 <_Z15set_first_touchIdEvRN5s3dft6tensorIT_EE@plt>
  403cef:   4c 8b bc 24 a0 00 00    mov    0xa0(%rsp),%r15
  403cf6:   00 
  403cf7:   4c 8b b4 24 a8 00 00    mov    0xa8(%rsp),%r14
  403cfe:   00 
  403cff:   4c 8b ac 24 b0 00 00    mov    0xb0(%rsp),%r13
  403d06:   00 
  403d07:   4c 8b a4 24 b8 00 00    mov    0xb8(%rsp),%r12
  403d0e:   00 
  403d0f:   48 8b 9c 24 c0 00 00    mov    0xc0(%rsp),%rbx
  403d16:   00 
  403d17:   48 8b ac 24 c8 00 00    mov    0xc8(%rsp),%rbp
  403d1e:   00 
  403d1f:   48 81 c4 d8 00 00 00    add    $0xd8,%rsp
  403d26:   c3                      retq   
  403d27:   48 89 04 24             mov    %rax,(%rsp)
  403d2b:   bf 30 f4 44 00          mov    $0x44f430,%edi
  403d30:   e8 bb f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d35:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d3c:   48 8d 7c 24 40          lea    0x40(%rsp),%rdi
  403d41:   e8 9a 00 00 00          callq  403de0 <_ZN5s3dft6tensorIdED1Ev>
  403d46:   48 8d 7c 24 08          lea    0x8(%rsp),%rdi
  403d4b:   e8 90 00 00 00          callq  403de0 <_ZN5s3dft6tensorIdED1Ev>
  403d50:   48 8d 7c 24 78          lea    0x78(%rsp),%rdi
  403d55:   e8 76 00 00 00          callq  403dd0 <_ZN5s3dft6matrixIdED1Ev>
  403d5a:   48 8b 3c 24             mov    (%rsp),%rdi
  403d5e:   e8 5d f3 ff ff          callq  4030c0 <_Unwind_Resume@plt>
  403d63:   48 89 04 24             mov    %rax,(%rsp)
  403d67:   bf 68 f4 44 00          mov    $0x44f468,%edi
  403d6c:   e8 7f f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d71:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d78:   eb cc                   jmp    403d46 <_Z12do_timed_runRKmRd+0x326>
  403d7a:   48 89 04 24             mov    %rax,(%rsp)
  403d7e:   bf a0 f4 44 00          mov    $0x44f4a0,%edi
  403d83:   e8 68 f4 ff ff          callq  4031f0 <__kmpc_global_thread_num@plt>
  403d88:   89 84 24 d0 00 00 00    mov    %eax,0xd0(%rsp)
  403d8f:   eb bf                   jmp    403d50 <_Z12do_timed_runRKmRd+0x330>
  403d91:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  403d98:   00 
  403d99:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)

初级问题

假设函数是在时间区域之外调用的,这是正确的吗?如果上面的happening?

  • If是正确的,为什么上面的是不正确的,我如何才能找出为什么我的基准测试是错误的?

二级问题

  1. 为什么代码中存在无条件跳转(在403 ad3、403 b53、403 d78和403 d8f)?
  2. 为什么在同一个函数中有3个retq实例,只有一个返回路径(403 c0a、403 ca4和403 d26)?

请考虑我只提供了我认为有关的资料。如有要求,我们会乐意提供更多资料。提前谢谢你抽出时间。

编辑

@PeterCordes,我确实是在启用调试符号的情况下构建的。上面张贴的程序集是使用objdump获得的,它不知何故没有检索到所需的符号。下面是(使用icpc获得的程序集的片段)

代码语言:javascript
复制
#       omp_get_wtime()
        call      omp_get_wtime                                 #122.23
..___tag_value__Z12do_timed_runRKmRd.267:
..LN419:
                                # LOE rbx xmm0
..B4.12:                        # Preds ..B4.11
                                # Execution count [1.00e+00]
..LN420:
        vmovsd    %xmm0, (%rsp)                                 #122.23[spill]
..LN421:
                                # LOE rbx
..B4.13:                        # Preds ..B4.12
                                # Execution count [1.00e+00]
..LN422:
    .loc    1  123  is_stmt 1
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN423:
        xorl      %eax, %eax                                    #123.5
..___tag_value__Z12do_timed_runRKmRd.269:
..LN424:
        call      __kmpc_ok_to_fork                             #123.5
..___tag_value__Z12do_timed_runRKmRd.270:
..LN425:
                                # LOE rbx eax
..B4.14:                        # Preds ..B4.13
                                # Execution count [1.00e+00]
..LN426:
        testl     %eax, %eax                                    #123.5
..LN427:
        je        ..B4.17       # Prob 50%                      #123.5
..LN428:
                                # LOE rbx
..B4.15:                        # Preds ..B4.14
                                # Execution count [0.00e+00]
..LN429:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN430:
        xorl      %edx, %edx                                    #123.5
..LN431:
        incq      %rdx                                          #123.5
..LN432:
        xorl      %eax, %eax                                    #123.5
..LN433:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.271:
..LN434:
        call      __kmpc_push_num_threads                       #123.5
..___tag_value__Z12do_timed_runRKmRd.272:
..LN435:
                                # LOE rbx
..B4.16:                        # Preds ..B4.15
                                # Execution count [0.00e+00]
..LN436:
        movl      $L__Z12do_timed_runRKmRd_123__par_region1_2.5, %edx #123.5
..LN437:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN438:
        movl      $3, %esi                                      #123.5
..LN439:
        lea       8(%rsp), %rcx                                 #123.5
..LN440:
        xorl      %eax, %eax                                    #123.5
..LN441:
        lea       56(%rcx), %r8                                 #123.5
..LN442:
        lea       112(%rcx), %r9                                #123.5
..___tag_value__Z12do_timed_runRKmRd.273:
..LN443:
        call      __kmpc_fork_call                              #123.5
..___tag_value__Z12do_timed_runRKmRd.274:
..LN444:
        jmp       ..B4.20       # Prob 100%                     #123.5
..LN445:
                                # LOE rbx
..B4.17:                        # Preds ..B4.14
                                # Execution count [0.00e+00]
..LN446:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN447:
        xorl      %eax, %eax                                    #123.5
..LN448:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.275:
..LN449:
        call      __kmpc_serialized_parallel                    #123.5
..___tag_value__Z12do_timed_runRKmRd.276:
..LN450:
                                # LOE rbx
..B4.18:                        # Preds ..B4.17
                                # Execution count [0.00e+00]
..LN451:
        movl      $___kmpv_zero_Z12do_timed_runRKmRd_1, %esi    #123.5
..LN452:
        lea       208(%rsp), %rdi                               #123.5
..LN453:
        lea       8(%rsp), %rdx                                 #123.5
..LN454:
        lea       56(%rdx), %rcx                                #123.5
..LN455:
        lea       112(%rdx), %r8                                #123.5
..___tag_value__Z12do_timed_runRKmRd.277:
..LN456:
        call      L__Z12do_timed_runRKmRd_123__par_region1_2.5  #123.5
..___tag_value__Z12do_timed_runRKmRd.278:
..LN457:
                                # LOE rbx
..B4.19:                        # Preds ..B4.18
                                # Execution count [0.00e+00]
..LN458:
        movl      $.2.40_2_kmpc_loc_struct_pack.65, %edi        #123.5
..LN459:
        xorl      %eax, %eax                                    #123.5
..LN460:
        movl      208(%rsp), %esi                               #123.5
..___tag_value__Z12do_timed_runRKmRd.279:
..LN461:
        call      __kmpc_end_serialized_parallel                #123.5
..___tag_value__Z12do_timed_runRKmRd.280:
..LN462:
                                # LOE rbx
..B4.20:                        # Preds ..B4.16 ..B4.19
                                # Execution count [1.00e+00]
..___tag_value__Z12do_timed_runRKmRd.281:
..LN463:
    .loc    1  128  is_stmt 1
#       omp_get_wtime()
        call      omp_get_wtime                                 #128.23

正如您所看到的,输出非常冗长,很难阅读。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-03-25 19:33:27

对于现代超标量CPU来说,每个核心时钟周期1 FP操作将是可悲的。您的Skylake派生CPU实际上可以完成2x4宽的SIMD双精度FMA操作,每个核心时钟每个FMA计算为两个触发器,因此理论max =16个双精度触发器,因此24 * 16 = 384 GFLOP/S (使用4 doubles的向量,即256位宽AVX)。请参阅FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

在赋时区域callq 403c0b <_Z12do_timed_runRKmRd+0x1eb> (以及__kmpc_end_serialized_parallel )中有一个函数调用。

没有与这个调用目标相关的符号,所以我想您没有在启用调试信息的情况下编译。(这与优化级别是分开的,例如,gcc -g -O3 -march=native -fopenmp应该运行相同的asm,只是有更多的调试元数据。)即使是OpenMP发明的函数,也应该有一个符号名在某个点上关联。

就基准有效性而言,一个好的试金石是它是否与问题的大小相称。除非您超出了L3缓存大小,或者没有出现更小或更大的问题,否则时间应该以某种合理的方式改变。如果不是,那么您就会担心它优化掉了,或者时钟速度热身效应(Idiomatic way of performance evaluation?用于这一点和更多,比如页面错误)。

  1. 为什么代码中存在无条件跳转(在403 ad3,403 b53,403 d78和403 d8f)?

一旦您已经在if块中,您就无条件地知道不应该运行else块,所以您可以在它上运行jmp而不是jcc (即使仍然设置了FLAGS,所以您不必再次测试条件)。或者将一个或另一个块放置在行(比如函数的末尾,或入口点之前),然后将jcc放到它,然后它将jmp返回到另一侧。这样,快速路径就可以与不被占用的分支相连。

  1. 为什么同一个函数中只有一个返回路径(在403 c0a、403 ca4和403 d26处)有3个retq实例?

重复的ret来自“尾复制”优化,其中所有返回的多条执行路径都可以得到自己的ret,而不是跳转到ret。(以及任何必要的清理的副本,比如还原regs和堆栈指针。)

票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/71618068

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档