next up previous contents
Next: Empirical Performance Analysis Up: Faster Integral Calculation Previous: General Algorithm

Theoretical Performance Analysis

A common way to measure the performance of integral techniques is to count the number of floating point operations (flops) required. Table 5.1 lists the flop count for each part of the new algorithm. Only those flops in the O(N2) parts of the code are counted; for any large molecule the rest are insignificant. It is also assumed that ${\bf A}\ne {\bf B}$ and that the bra and www-theor are far apart (again this is true for the majority of shell-quartets in any large system).


  
Table 5.1: Flop costs for each part of the (0)(m) construction
Task When Called Cost n=10 cost
Geometric constants Once 33 33
Form ${\bf H}$ Once per m value 5n2-9n-4+2m 406+2m
Generate ${\bf Hv}$ For each unique ${\bf Hv}$ n2 100
Generate ${\bf u}^{t}{\bf Hv}$ For each unique (0)(m) 2n+2 22
Copy duplicate (0)(m)s For each duplicate 1 1

Note that most of the routines above have a cost depending on the length of the expansion, n. Decreasing this length will mean fewer shell-quartets pass the convergence tests, forcing more of the work to be done the traditional way. Thus there is a trade-off here and n=10 has proven to be a good compromise.

Using the above table, and how often each routine is called, a cost per class can be calculated. This cost is usually described in terms of the primitive, half-contracted and contracted contributions,
\begin{align}\mathrm{Cost} = xK^{4} + yK^{2} + z,
\end{align}
where K is the degree of contraction of each of the four shells. Table 5.2 compares the contraction-first method (CO) described above with the fastest operator-first method (OC) of the COLD PRISM for the low momentum integrals.


  
Table 5.2: Comparison of CO and OC flop counts
  CO   OC  
Class x y z   x y z Ktot CO win
(ss|ss) 0 0 561   8 0 0 12
(ps|ss) 0 0 1235   14 8 0 21
(ds|ss) 0 0 2121   22 24 0 31
(pp|ss) 0 0 2189   22 30 0 31
(ps|ps) 0 0 2368   26 28 0 32
(dp|ss) 0 0 3514   32 76 0 40
(ds|ps) 0 0 4019   42 76 0 41
(pp|ps) 0 0 4228   42 94 0 42
(dd|ss) 0 0 5554   44 176 0 48
(ds|ds) 0 0 7729   82 198 0 50
(dp|ps) 0 0 7035   62 224 0 50
(ds|pp) 0 0 8245   82 242 0 51
(pp|pp) 0 0 9493   94 296 0 52
(dp|dp) 0 0 32578   254 1482 0 63
(dd|dd) 0 0 109201   586 5962 0 67

In addition to the numbers presented in Table 5.2, the OC calculations require 1 square root and 2 divides as x-type work and CO needs 1 square root and 1 divide as z-type work. By making reasonable assumptions for the flop cost of these functions, the point at which CO becomes faster than OC can be estimated. These cross-over points are listed in the final column. The momentumless class shows the lowest CO/OC crossover. Here CO is faster whenever $K \ge 2$, making the new path extremely useful. The CO costs do grow rather fast with momentum though, indicating that it will only rarely be required for classes with a total momentum greater than 4. An exception may be calculations with transition metals and heavy main-group elements, where highly contracted d functions are used.

While flop-counts are a good guide to the performance of an algorithm, they are not the whole story. The number of memory operations, for example, is also an important consideration. Use of the computer's architecture is also relevant. For these reasons it is always important to perform a timings analysis.


next up previous contents
Next: Empirical Performance Analysis Up: Faster Integral Calculation Previous: General Algorithm
Ross D. Adamson
1999-01-27