Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher simulation time in NEURON 9 compared to NEURON 8.2.4 #2787

Open
SteMasoli opened this issue Mar 19, 2024 · 15 comments
Open

Higher simulation time in NEURON 9 compared to NEURON 8.2.4 #2787

SteMasoli opened this issue Mar 19, 2024 · 15 comments
Assignees
Labels
Milestone

Comments

@SteMasoli
Copy link

Context

Testing NEURON 9 on a recent CPU shows longer simulation time compared to NEURON 8.2.4

Overview of the issue

Brand new installation of Ubuntu 23.10 on a 16 core Zen 4 AMD processor.
With the same model, same code, same number of cores, etc.., NEURON 9 does a simulation in 90s whereas NEURON 8.2.4 does it in 37s.

Expected result/behavior

Similar or better performance changing from NEURON 8 to NEURON 9.

NEURON setup

  • Version: VERSION 9.0a-176-g17be061e0 master (17be061) 2024-03-13 and NEURON 8.2.4

  • Installation method: Cmake from source in both cases with:
    -DCMAKE_BUILD_TYPE=RelWithDebInfo
    -DNRN_ENABLE_INTERVIEWS=ON
    -DNRN_ENABLE_MPI=ON
    -DNRN_ENABLE_PYTHON=ON
    -DNRN_ENABLE_RX3D:BOOL=ON
    -DCORENRN_ENABLE_NMODL=OFF
    -DNRN_ENABLE_CORENEURON=OFF

  • OS + Version: Ubuntu 23.10 latest upgrades

  • Compiler + Version: gcc version 13.2.0 (Ubuntu 13.2.0-4ubuntu3)

Minimal working example - MWE

https://modeldb.science/266806?tab=1
Use the first model "Morphology_1".
Change protocol 01_SS.py with "p.change_nthread(12,1)" and h.tstop = 5000.
nrnivmodl mod_files
time nrniv -python protocols/01_SS.py


@SteMasoli SteMasoli added the bug label Mar 19, 2024
@pramodk pramodk added this to the Release v9.0 milestone Mar 19, 2024
@nrnhines nrnhines self-assigned this Mar 19, 2024
@nrnhines
Copy link
Member

Thanks for the notice. I'm seeing pretty much the same thing. It will need to be looked into in greater detail and resolved before the v9 release. On my machine ( also 16 cores) but tstop=1000

nthread      v8.2.4        master
  1             46s         87s
  2             21          47
  4             17          29
  8             12          24
 16             10          21

@mgeplf
Copy link
Collaborator

mgeplf commented Mar 27, 2024

If I did it right, a bisect seems to point to 6c529a4:

2c3a21d89: 49.67user 2.12system 0:11.58elapsed 447%CPU (0avgtext+0avgdata 67344maxresident)k
a3bbaab7c: 50.71user 2.00system 0:12.08elapsed 436%CPU (0avgtext+0avgdata 93484maxresident)k
936c7d3ae: 85.93user 1.98system 0:20.43elapsed 430%CPU (0avgtext+0avgdata 80288maxresident)k
522c86681: 46.77user 1.94system 0:10.98elapsed 443%CPU (0avgtext+0avgdata 93448maxresident)k
67a672a2c: 46.29user 2.03system 0:11.16elapsed 432%CPU (0avgtext+0avgdata 93764maxresident)k
baf3b145a: 85.80user 1.98system 0:20.13elapsed 435%CPU (0avgtext+0avgdata 80360maxresident)k
fb1712738: 85.58user 2.00system 0:20.18elapsed 434%CPU (0avgtext+0avgdata 80544maxresident)k
dfdee8442: 46.50user 2.15system 0:11.24elapsed 432%CPU (0avgtext+0avgdata 93196maxresident)k
d2d811040: 46.34user 2.15system 0:11.41elapsed 424%CPU (0avgtext+0avgdata 93560maxresident)k
6c529a415: 86.48user 2.06system 0:20.35elapsed 434%CPU (0avgtext+0avgdata 80244maxresident)k
b4ade5509: 46.42user 2.06system 0:11.39elapsed 425%CPU (0avgtext+0avgdata 93228maxresident)k

Where the good ones are < 50s, and the ones above that I marked bad.

@ramcdougal
Copy link
Member

Given that (#2027) was a mega-commit altering 208 files, I wonder if it's possible to bisect on the (presumably deleted) branch that created it?

@nrnhines
Copy link
Member

nrnhines commented Mar 27, 2024

Edited after fixing the Caliper region for state-update in version 8.2

Thanks! That is a great hint. Meanwhile I tried Caliper to see where the performance is very different. Here is what seems to me to be the relevant selection

master
fadvance                0.242032 78.523456   0.304608  98.825198 
  timestep              0.476875 78.281423   0.600168  98.520590 
    deliver-events      0.210294  0.285367   0.264664   0.359147 
      check-threshold   0.075073  0.075073   0.094483   0.094483 
    setup-tree-matrix   2.034998  5.770918   2.561134   7.262953 
...
    matrix-solver       0.126477  0.126477   0.159177   0.159177 
    second-order-cur    0.069821  0.069821   0.087873   0.087873 
    update              0.083985  0.083985   0.105698   0.105698 
    state-update        0.917186 71.467980   1.154318  89.945574 
      state-cdp5StCmod 49.011724 49.011724  61.683394  61.683394 
      state-Cav2_3      0.225994  0.225994   0.284424   0.284424 
      state-Cav3_1      0.353412  0.353412   0.444785   0.444785 
      state-GRC_CA      0.997766  0.997766   1.255732   1.255732 
      state-GRC_KM      0.106439  0.106439   0.133958   0.133958 
      state-HCN1        0.080375  0.080375   0.101155   0.101155 
      state-HCN2        0.078690  0.078690   0.099035   0.099035 
      state-Kca1_1      4.351285  4.351285   5.476282   5.476282 
      state-Kca2_2      1.352222  1.352222   1.701831   1.701831 
      state-Kca3_1      0.074793  0.074793   0.094131   0.094131 
      state-Kv1_1       0.074700  0.074700   0.094013   0.094013 
      state-Kv3_4       0.242558  0.242558   0.305269   0.305269 
      state-Kv4_3       0.089487  0.089487   0.112623   0.112623 
      state-Leak        0.070016  0.070016   0.088118   0.088118 
      state-Nav1_6     13.441334 13.441334  16.916505  16.916505 

real	1m21.576s
8.2
fadvance                0.132501 50.456262   0.259746  98.911012 
  timestep              0.470033 50.323761   0.921421  98.651267 
    deliver-events      0.208092  0.278673   0.407930   0.546292 
      check-threshold   0.070581  0.070581   0.138362   0.138362 
    setup-tree-matrix   2.143723  5.734394   4.202409  11.241314 
...
    matrix-solver       0.169878  0.169878   0.333018   0.333018 
    second-order-cur    0.070763  0.070763   0.138719   0.138719 
    update              0.109914  0.109914   0.215469   0.215469 
    state-update        0.906806 43.490105   1.777640  85.255034 
      state-cdp5StCmod 24.232049 24.232049  47.502855  47.502855 
      state-Cav2_3      0.229292  0.229292   0.449488   0.449488 
      state-Cav3_1      0.348932  0.348932   0.684023   0.684023 
      state-GRC_CA      0.518979  0.518979   1.017370   1.017370 
      state-GRC_KM      0.093657  0.093657   0.183599   0.183599 
      state-HCN1        0.078251  0.078251   0.153397   0.153397 
      state-HCN2        0.077570  0.077570   0.152064   0.152064 
      state-Kca1_1      4.129362  4.129362   8.094920   8.094920 
      state-Kca2_2      1.515642  1.515642   2.971161   2.971161 
      state-Kca3_1      0.072181  0.072181   0.141499   0.141499 
      state-Kv1_1       0.072002  0.072002   0.141148   0.141148 
      state-Kv3_4       0.235797  0.235797   0.462241   0.462241 
      state-Kv4_3       0.078983  0.078983   0.154832   0.154832 
      state-Leak        0.068475  0.068475   0.134233   0.134233 
      state-Nav1_6     10.832127 10.832127  21.234563  21.234563 

real	0m52.622s

I interpret for example advance 0.242032 78.523456 ... to mean that the exclusive time within advance was 0.24s but the time interval between entry and exit was 78.5s. The last two numbers of both lines is supposed to be % times.
From this perspective the largest master lines

    state-update        0.917186 71.467980   1.154318  89.945574 
      state-cdp5StCmod 49.011724 49.011724  61.683394  61.683394 
      state-Kca1_1      4.351285  4.351285   5.476282   5.476282
      state-Nav1_6     13.441334 13.441334  16.916505  16.916505

suggest for example that state integration is dominated by cdp5StCmod.
The equivalent lines for 8.2 are

    state-update        0.906806 43.490105   1.777640  85.255034 
      state-cdp5StCmod 24.232049 24.232049  47.502855  47.502855 
      state-Kca1_1      4.129362  4.129362   8.094920   8.094920
      state-Nav1_6     10.832127 10.832127  21.234563  21.234563

I conclude that the largest part of the master performance problem is with the KINETIC block of cdp5StCmod and I'll continue by focusing my attention there.

@nrnhines
Copy link
Member

Given that (#2027) was a mega-commit altering 208 files, I wonder if it's possible to bisect on the (presumably deleted) branch that created it?

I hadn't realized that yet when I wrote my above comment. That being the introduction of SoA and data handles, it may be the case that many of the intermediate commits will not result in a working system.

@nrnhines
Copy link
Member

Now might be a good time to mention that I'm using a slightly modified version of git@github.com:ModelDBRepository/266806
for both 8.2, master, and coreneuron. Those mods are:

~/models/modeldb/266806/Morphology_1$ git diff --staged
diff --git a/Morphology_1/mod_files/Hcn1.mod b/Morphology_1/mod_files/Hcn1.mod
index 1df9e10..f985669 100755
--- a/Morphology_1/mod_files/Hcn1.mod
+++ b/Morphology_1/mod_files/Hcn1.mod
@@ -15,7 +15,7 @@ NEURON {
        NONSPECIFIC_CURRENT ih
        RANGE Q10_diff,Q10_channel,gbar_Q10, ic
        RANGE o_fast_inf, o_slow_inf, tau_f, tau_s, Erev
-       RANGE gbar,r,g, o
+       RANGE gbar,g, o
 }       
         
 UNITS {
diff --git a/Morphology_1/mod_files/cdp5StCmod.mod b/Morphology_1/mod_files/cdp5StCmod.mod
index bbdd811..9be5b05 100755
--- a/Morphology_1/mod_files/cdp5StCmod.mod
+++ b/Morphology_1/mod_files/cdp5StCmod.mod
@@ -123,9 +123,10 @@ ASSIGNED {
        mgi     (mM)
        vrat    (1)     
         icazz (nA)
+        cao (mM)
 }
 
-CONSTANT { cao = 2     (mM) }
+:CONSTANT { cao = 2    (mM) }
 
 STATE {
        : ca[0] is equivalent to cai
@@ -246,10 +247,11 @@ PROCEDURE factors() {
 }
 
 
-LOCAL dsq, dsqvol  : can't define local variable in KINETIC block
+: LOCAL dsq, dsqvol  : can't define local variable in KINETIC block
                    :   or use in COMPARTMENT statement
 
 KINETIC state {
+  LOCAL dsq, dsqvol
   COMPARTMENT diam*diam*vrat {ca mg Buff1 Buff1_ca Buff2 Buff2_ca BTC BTC_ca DMNPE DMNPE_ca PV PV_ca PV_mg}
   COMPARTMENT (1e10)*parea {pump pumpca}

and the temporary changes (there is currently an array bug in NMODL)

~/models/modeldb/266806/Morphology_1$ git diff
diff --git a/Morphology_1/mod_files/Cav2_3.mod b/Morphology_1/mod_files/Cav2_3.mod
index 0d10ca1..db53b82 100755
--- a/Morphology_1/mod_files/Cav2_3.mod
+++ b/Morphology_1/mod_files/Cav2_3.mod
@@ -10,7 +10,7 @@ NEURON {
          THREADSAFE
          USEION ca READ eca WRITE ica
          RANGE gcabar, m, h, g, gmax, ica
-         RANGE inf, tau
+         RANGE inf0, inf1, tau0, tau1
 }
 
 UNITS {
@@ -29,8 +29,10 @@ STATE {      m h }            : unknown activation and inactivation parameters to be
 
 ASSIGNED {               : parameters needed to solve DE
          ica    (mA/cm2)
-    inf[2]
-         tau[2] (ms)
+    inf0
+         tau0 (ms)
+    inf1
+         tau1 (ms)
     g      (mho/cm2)
     gmax   (mho/cm2)
 }
@@ -46,8 +48,8 @@ BREAKPOINT {
 
 INITIAL {
     mhn(v)
-    m = inf[0]
-    h = inf[1]
+    m = inf0
+    h = inf1
     g = gcabar*m*m*m*h
     ica = g*(v - eca) : initial Ca++ current value
     gmax = g
@@ -55,8 +57,8 @@ INITIAL {
 
 DERIVATIVE states {
          mhn(v)
-         m' =  (inf[0] - m)/tau[0]
-         h' =  (inf[1] - h)/tau[1]
+         m' =  (inf0 - m)/tau0
+         h' =  (inf1 - h)/tau1
 }      
 
 FUNCTION varss(v (mV), i) {
@@ -79,11 +81,11 @@ FUNCTION vartau(v (mV), i) (ms) {
 }      
 
 PROCEDURE mhn(v (mV)) {LOCAL a, b :rest = -70
-    TABLE inf, tau DEPEND celsius FROM -100 TO 100 WITH 200
-       FROM i=0 TO 1 {
-             tau[i] = vartau(v,i)
-                   inf[i] = varss(v,i)
-         }
+:    TABLE inf, tau DEPEND celsius FROM -100 TO 100 WITH 200
+             tau0 = vartau(v,0)
+                   inf0 = varss(v,0)
+             tau1 = vartau(v,1)
+                   inf1 = varss(v,1)
 }
 
 
diff --git a/Morphology_1/protocols/01_SS.py b/Morphology_1/protocols/01_SS.py
index 4e5cc4b..ff0e622 100755
--- a/Morphology_1/protocols/01_SS.py
+++ b/Morphology_1/protocols/01_SS.py
@@ -21,9 +21,10 @@ Fixed_step.active(0)
 cpu = multiprocessing.cpu_count()
 h.load_file("parcom.hoc")
 p = h.ParallelComputeTool()
-p.change_nthread(cpu,1)
-p.multisplit(1)
-print(cpu)
+if h.mycpu > 1.0:
+    p.change_nthread(h.mycpu,1)
+    p.multisplit(1)
+print(h.mycpu)
 
 #Voltage graph
 h('load_file("vm.ses")')
@@ -37,13 +38,23 @@ h.celsius = 32
 h.tstop = 1000
 h.v_init = -65
 
+pc = h.ParallelContext()
+pc.set_gid2node(0, 0)
+pc.cell(0, cell.nc_spike)
+pc.set_maxstep(10)
+from neuron import coreneuron
+coreneuron.enable = 0
+
 #Initialization 
 def initialize():
     h.finitialize()
     h.run()
+#    h.stdinit()
+#    pc.psolve(h.tstop)
 
 initialize()
 
+'''
 #Save the results into an image
 fig, ax = plt.subplots()
 ax.plot(np.array(cell.time_vector), np.array(cell.vm), 'b', label='spikes')
@@ -59,4 +70,4 @@ plt.ylabel("membrane voltage (mv) ")
 
 plt.savefig('01_SS_trace.eps')
 plt.close()
-
+'''

@mgeplf
Copy link
Collaborator

mgeplf commented Mar 28, 2024

@ramcdougal > Given that (#2027) was a mega-commit altering 208 files, I wonder if it's possible to bisect on the (presumably deleted) branch that created it?

I gave it a quick shot, but many of the commits don't even build; one of the only ones I got to build was 1bb5bf1, and it was slow, so that at least partly brackets which of the many commits it was. However, out of the 19 other commits I randomly tried, none of them built without error.

@nrnhines > Those mods are:
I did something similar; I haven't compared the output between the fast commit and the regression, though.

@mgeplf
Copy link
Collaborator

mgeplf commented Mar 28, 2024

A couple other short notes:

perf stat for:
b4ade5509: (fast commit)

     Performance counter stats for 'python 01_SS.py':

             48,928.26 msec task-clock                #    4.263 CPUs utilized
               631,731      context-switches          #    0.013 M/sec
                    29      cpu-migrations            #    0.001 K/sec
                28,020      page-faults               #    0.573 K/sec
       234,176,288,198      cycles                    #    4.786 GHz                      (31.06%)
       555,016,050,075      instructions              #    2.37  insn per cycle           (38.92%)
        70,189,788,913      branches                  # 1434.545 M/sec                    (38.80%)
            99,889,950      branch-misses             #    0.14% of all branches          (38.56%)
       224,807,810,920      L1-dcache-loads           # 4594.642 M/sec                    (38.56%)
         1,585,023,985      L1-dcache-load-misses     #    0.71% of all L1-dcache hits    (38.43%)
           329,502,640      LLC-loads                 #    6.734 M/sec                    (30.57%)
             1,118,627      LLC-load-misses           #    0.34% of all LL-cache hits     (30.47%)
       <not supported>      L1-icache-loads
           625,499,048      L1-icache-load-misses                                         (30.37%)
       225,842,474,219      dTLB-loads                # 4615.788 M/sec                    (30.51%)
               784,591      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.62%)
            45,054,175      iTLB-loads                #    0.921 M/sec                    (30.96%)
            28,229,053      iTLB-load-misses          #   62.66% of all iTLB cache hits   (31.07%)
       <not supported>      L1-dcache-prefetches
       <not supported>      L1-dcache-prefetch-misses

          11.476448534 seconds time elapsed

          46.780273000 seconds user
           3.108108000 seconds sys

6c529a415: Regression:

     Performance counter stats for 'python 01_SS.py':

             88,420.11 msec task-clock                #    4.289 CPUs utilized
               700,525      context-switches          #    0.008 M/sec
                    38      cpu-migrations            #    0.000 K/sec
                23,665      page-faults               #    0.268 K/sec
       424,431,007,847      cycles                    #    4.800 GHz                      (30.92%)
     1,157,209,550,388      instructions              #    2.73  insn per cycle           (38.58%)
       168,992,981,295      branches                  # 1911.251 M/sec                    (38.30%)
           112,577,199      branch-misses             #    0.07% of all branches          (38.41%)
       474,884,904,576      L1-dcache-loads           # 5370.779 M/sec                    (38.33%)
         2,880,849,265      L1-dcache-load-misses     #    0.61% of all L1-dcache hits    (38.44%)
           464,556,719      LLC-loads                 #    5.254 M/sec                    (30.96%)
               943,476      LLC-load-misses           #    0.20% of all LL-cache hits     (30.84%)
       <not supported>      L1-icache-loads
           884,842,762      L1-icache-load-misses                                         (30.84%)
       476,144,529,914      dTLB-loads                # 5385.025 M/sec                    (30.63%)
             1,151,626      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.74%)
            62,166,895      iTLB-loads                #    0.703 M/sec                    (30.76%)
            31,708,346      iTLB-load-misses          #   51.01% of all iTLB cache hits   (30.83%)
       <not supported>      L1-dcache-prefetches
       <not supported>      L1-dcache-prefetch-misses

          20.613380152 seconds time elapsed

          85.971388000 seconds user
           3.464232000 seconds sys

Looks to be that many more instructions are executed - not cache missing or branch missing.

perf record / `perf report:

b4ade5509: (fast commit)

Overhead  Command  Shared Object                                     Symbol
  23.57%  python   libnrnmech.so                                     [.] _nrn_state__cdp5StCmod
  17.57%  python   libnrnmech.so                                     [.] state__cdp5StCmod
   7.86%  python   libnrnmech.so                                     [.] _nrn_state__Nav1_6
   6.57%  python   libnrnmech.so                                     [.] neuron::scopmath::_nrn_thread_getelm
   6.31%  python   libnrnmech.so                                     [.] activation__Nav1_6
   5.81%  python   libnrnmech.so                                     [.] neuron::scopmath::sparse_thread<double*, int (*)(void*, double*, double*, Datum*, Datum*, NrnThread*), int*, double*&, Datum*&, Datum*&, NrnThread*&>
   5.43%  python   libm-2.31.so                                      [.] __ieee754_pow_fma
   3.41%  python   libm-2.31.so                                      [.] __ieee754_exp_fma
   3.14%  python   libnrnmech.so                                     [.] activation__Kca1_1

6c529a415: Regression:

Overhead  Command  Shared Object                                     Symbol
  18.32%  python   libnrnmech.so                                     [.] state__cdp5StCmod
  15.51%  python   libnrnmech.so                                     [.] _nrn_state__cdp5StCmod
  14.35%  python   libnrnmech.so                                     [.] neuron::scopmath::_nrn_thread_getelm
  11.44%  python   libnrnmech.so                                     [.] _ZNK6neuron9container19generic_data_handlecvNS0_11data_handleIT_EEIdEEv
   4.42%  python   libnrnmech.so                                     [.] _nrn_state__Nav1_6
   3.43%  python   libnrnmech.so                                     [.] activation__Nav1_6
   3.20%  python   libnrnmech.so                                     [.] neuron::container::data_handle<double>::get_ptr_helper<neuron::container::data_handle<double> >
   2.86%  python   libm-2.31.so                                      [.] __ieee754_pow_fma
   2.21%  python   libnrnmech.so                                     [.] _nrn_state__Kca1_1

Since it's a statistical profiler, we'll miss things, but it seems that _ZNK6neuron9container19generic_data_handlecvNS0_11data_handleIT_EEIdEEv might be something to look into

@nrnhines
Copy link
Member

@mgeplf I like that tool (perf?) you are using but I'm not familiar with it. (I was going to use gprof to delve more into the mod file functions but was wondering if that itself would change the profile considerably). Please send a link for me to get started. Your fragments above that say % seem more indirect than actual time. Does one just multiply % by total time?

I see that _nrn_thread_getelm went from 7% to 14% but that seems small to account for half the time. But I'm likely misinterpreting.

@mgeplf
Copy link
Collaborator

mgeplf commented Mar 28, 2024

@nrnhines > @mgeplf I like that tool (perf?)

It's this one: https://en.wikipedia.org/wiki/Perf_(Linux)

It's like a console version of vtune (or whatever it's called now), but is no where near as slick. Whenever I turn to it, I use this to remember some useful examples:
https://www.brendangregg.com/perf.html

Unfortunately, it seems the dwarf debug isn't propagated fully into libnrnmech.so, so the (very useful) --call-graph dwarf can't be used with perf record; I haven't spent time to try and figure out how to get it to show up.

Does one just multiply % by total time?

That may be a loose heuristic, but the nature of sampling profiling means that it won't be accurate.

I'd probably try valgrind callgrind if I wanted accurate timings; which would probably illuminate things better. I mainly used perf to check that the hardware counters weren't showing red flags like cache missing or branch missing.

I see that _nrn_thread_getelm went from 7% to 14%

That seems suspicious, too.

@nrnhines
Copy link
Member

Here is an interesting observation. The time between 8.2 and master narrows considerably (running perf)

57.909s master
47.478s 8.2

merely by introducing an ASSIGNED variable d, settting d = diam in the INITIAL block and replacing all diam with d in the KINETIC state block in cdp5StdCmod.mod

This change was suggested by a comparison of the 8.2 and master perf results

8.2
# Samples: 195K of event 'cycles:Pu'
# Event count (approx.): 178697138584
#
# Overhead       Samples  Command  Shared Object                                      Symbol                                          >
# ........  ............  .......  .................................................  ................................................>
#
    26.40%         49954  nrniv    libnrniv.so                                        [.] matsol
    17.62%         33393  nrniv    libnrniv.so                                        [.] _nrn_thread_getelm
    14.13%         26805  nrniv    libnrnmech.so                                      [.] state__cdp5StCmod
    11.72%         22198  nrniv    libnrniv.so                                        [.] sparse_thread
     5.55%         10506  nrniv    libnrnmech.so                                      [.] _nrn_thread_getelm@plt
     3.62%          6861  nrniv    libnrnmech.so                                      [.] activation__Nav1_6
     3.07%          5820  nrniv    libm.so.6                                          [.] __ieee754_pow_fma
     2.83%          5357  nrniv    libm.so.6                                          [.] __ieee754_exp_fma
     1.90%          3596  nrniv    libm.so.6                                          [.] exp@@GLIBC_2.29
     1.77%          3368  nrniv    libnrnmech.so                                      [.] activation__Kca1_1
     1.07%          2022  nrniv    libnrnmech.so                                      [.] rates__Nav1_6.constprop.0.isra.0
     1.01%          1915  nrniv    libnrniv.so                                        [.] hoc_Exp
     0.81%          1539  nrniv    libnrnmech.so                                      [.] kin__Kca2_2
     0.62%          1180  nrniv    libm.so.6                                          [.] pow@@GLIBC_2.29

with the master per result using diam

master with diam
# Samples: 310K of event 'cycles:Pu'
# Event count (approx.): 305901402865
#
# Overhead       Samples  Command  Shared Object                                      Symbol                                                                                                  >
# ........  ............  .......  .................................................  ........................................................................................................>
#
    31.96%         97514  nrniv    libnrniv.so                                        [.] double* neuron::container::generic_data_handle::get<double*>() const
    15.67%         47797  nrniv    libnrnmech.so                                      [.] neuron::scopmath::detail::sparse_thread::matsol(neuron::scopmath::SparseObj*)
    12.22%         37287  nrniv    libnrnmech.so                                      [.] state__cdp5StCmod(void*, double*, neuron::cache::MechanismRange<65ul, 8ul>*, unsigned long, neuron::>
     9.18%         27980  nrniv    libnrnmech.so                                      [.] neuron::scopmath::_nrn_thread_getelm(neuron::scopmath::SparseObj*, int, int)
     4.95%         15139  nrniv    libnrnmech.so                                      [.] _nrn_state__cdp5StCmod(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
     3.39%         10338  nrniv    libnrnmech.so                                      [.] activation__Nav1_6(void*, double*, neuron::cache::MechanismRange<69ul, 3ul>*, unsigned long, neuron:>
     3.03%          9227  nrniv    libnrnmech.so                                      [.] neuron::scopmath::_nrn_thread_getelm(neuron::scopmath::SparseObj*, int, int)@plt
     2.19%          6667  nrniv    libm.so.6                                          [.] __ieee754_pow_fma
     1.66%          5073  nrniv    libm.so.6                                          [.] __ieee754_exp_fma
     1.54%          4699  nrniv    libnrnmech.so                                      [.] _nrn_state__Nav1_6(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
     1.32%          4038  nrniv    libnrnmech.so                                      [.] activation__Kca1_1(void*, double*, neuron::cache::MechanismRange<53ul, 5ul>*, unsigned long, neuron:>
     1.15%          3502  nrniv    libnrnmech.so                                      [.] rates__Nav1_6(neuron::cache::MechanismRange<69ul, 3ul>*, unsigned long, neuron::container::generic_d>
     1.11%          3393  nrniv    libnrnmech.so                                      [.] _nrn_state__Kca1_1(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
     0.99%          3026  nrniv    libm.so.6                                          [.] exp@@GLIBC_2.29
     0.66%          2025  nrniv    libnrnmech.so                                      [.] double* neuron::container::generic_data_handle::get<double*>() const@plt
     0.57%          1724  nrniv    libnrnmech.so                                      [.] states__GRC_CA(neuron::cache::MechanismRange<30ul, 3ul>*, unsigned long, neuron::container::generic_>
     0.56%          1711  nrniv    libnrniv.so                                        [.] hoc_Exp(double)

Note the top use of get<double>.
Avoiding that with d = diam in the INITIAL block, the perf results become

master with d=diam
# Samples: 215K of event 'cycles:Pu'
# Event count (approx.): 209229806356
#
# Overhead       Samples  Command  Shared Object                                      Symbol                                                                                                         >
# ........  ............  .......  .................................................  ...............................................................................................................>
#
    23.10%         48383  nrniv    libnrnmech.so                                      [.] neuron::scopmath::detail::sparse_thread::matsol(neuron::scopmath::SparseObj*)
    18.91%         39595  nrniv    libnrnmech.so                                      [.] state__cdp5StCmod(void*, double*, neuron::cache::MechanismRange<66ul, 8ul>*, unsigned long, neuron::contain>
    13.94%         29208  nrniv    libnrnmech.so                                      [.] neuron::scopmath::_nrn_thread_getelm(neuron::scopmath::SparseObj*, int, int)
     6.72%         14087  nrniv    libnrnmech.so                                      [.] _nrn_state__cdp5StCmod(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
     5.02%         10527  nrniv    libnrnmech.so                                      [.] activation__Nav1_6(void*, double*, neuron::cache::MechanismRange<69ul, 3ul>*, unsigned long, neuron::contai>
     4.52%          9464  nrniv    libnrnmech.so                                      [.] neuron::scopmath::_nrn_thread_getelm(neuron::scopmath::SparseObj*, int, int)@plt
     3.31%          6940  nrniv    libm.so.6                                          [.] __ieee754_pow_fma
     2.43%          5103  nrniv    libm.so.6                                          [.] __ieee754_exp_fma
     2.32%          4873  nrniv    libnrnmech.so                                      [.] _nrn_state__Nav1_6(neuron::model_sorted_token const&, NrnThread*, Memb_list*, int)
     1.91%          4000  nrniv    libnrnmech.so                                      [.] activation__Kca1_1(void*, double*, neuron::cache::MechanismRange<53ul, 5ul>*, unsigned long, neuron::contai>
     1.71%          3579  nrniv    libnrnmech.so                                      [.] rates__Nav1_6(neuron::cache::MechanismRange<69ul, 3ul>*, unsigned long, neuron::container::generic_data_han>
     1.45%          3038  nrniv    libm.so.6                                          [.] exp@@GLIBC_2.29

@nrnhines
Copy link
Member

Just for my reference.
Install perf

sudo apt install linux-tools-common linux-tools-generic linux-tools-`uname -r`

Avoid the error

$ perf record nrniv -c 'quit()'
Error:
Access to performance monitoring and observability operations is limited.
...

with

sudo sysctl kernel.perf_event_paranoid=2

I find the sample counts helpful, so am using

perf report --stdio -n

It would be helpful to collect useful idioms in https://nrn.readthedocs.io/en/latest/install/debug.html#profiling-and-performance-benchmarking

@mgeplf
Copy link
Collaborator

mgeplf commented Apr 17, 2024

I ran this under valgrind --tool=callgrind and kcachegrind seems to point to neuron::container::generic_data_handler:
2787-callgrind

@nrnhines
Copy link
Member

I forked git@github.com:ModelDBRepository/266806 and my changes are in git@github.com:nrnhines/266806 in branch hines/v9

@nrnhines
Copy link
Member

Another performance data point... I installed intel oneapi in hopes of using vtune (I have not yet figured out how to focus on the data handle performance) but here are the results for CC=icx CXX=icpx for master and 8.2. For the model, I'm using the hines/v9 branch mentioned above.

      icx compiler
nthread  master  v8.2
1     32.9031    38.4446
2     19.5026    22.4375
4     13.5151    14.2496
8     11.7145    8.35472
16    11.6525    6.88371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants