@@ -56,39 +56,48 @@ julia> sum(big.(x))
56
56
# But the standard summation algorithms computes this sum very inaccurately
57
57
# (not even the sign is correct)
58
58
julia> sum (x)
59
- - 136 .0
59
+ - 8 .0
60
60
61
61
62
62
# Compensated summation algorithms should compute this more accurately
63
63
julia> using AccurateArithmetic
64
64
65
65
# Algorithm by Ogita, Rump and Oishi
66
66
julia> sum_oro (x)
67
- 0.9999999999999716
67
+ 1.0000000000000084
68
68
69
69
# Algorithm by Kahan, Babuska and Neumaier
70
70
julia> sum_kbn (x)
71
- 0.9999999999999716
71
+ 1.0000000000000084
72
72
```
73
73
74
74
75
- ![ ] ( test/figs/qual.svg )
75
+ ![ ] ( test/figs/sum_accuracy.svg )
76
+ ![ ] ( test/figs/dot_accuracy.svg )
76
77
77
- In the graph above, we see the relative error vary as a function of the
78
+ In the graphs above, we see the relative error vary as a function of the
78
79
condition number, in a log-log scale. Errors lower than ϵ are arbitrarily set to
79
80
ϵ; conversely, when the relative error is more than 100% (i.e no digit is
80
81
correctly computed anymore), the error is capped there in order to avoid
81
- affecting the scale of the graph too much. What we see is that the pairwise
82
+ affecting the scale of the graph too much. What we see on the left is that the pairwise
82
83
summation algorithm (as implemented in Base.sum) starts losing accuracy as soon
83
84
as the condition number increases, computing only noise when the condition
84
- number exceeds 1/ϵ≃10¹⁶. In contrast, both compensated algorithms
85
+ number exceeds 1/ϵ≃10¹⁶. The same goes for the naive summation algorithm.
86
+ In contrast, both compensated algorithms
85
87
(Kahan-Babuska-Neumaier and Ogita-Rump-Oishi) still accurately compute the
86
88
result at this point, and start losing accuracy there, computing meaningless
87
89
results when the condition nuber reaches 1/ϵ²≃10³². In effect these (simply)
88
90
compensated algorithms produce the same results as if a naive summation had been
89
91
performed with twice the working precision (128 bits in this case), and then
90
92
rounded to 64-bit floats.
91
93
94
+ The same comments can be made for the dot product implementations shown on the
95
+ right. Uncompensated algorithms, as implemented in
96
+ ` AccurateArithmetic.dot_naive ` or ` Base.dot ` (which internally calls BLAS in
97
+ this case) exhibit typical loss of accuracy. In contrast, the implementation of
98
+ Ogita, Rump & Oishi's compentated algorithm effectively doubles the working
99
+ precision.
100
+
92
101
<br />
93
102
94
103
Performancewise, compensated algorithms perform a lot better than alternatives
@@ -97,24 +106,31 @@ such as arbitrary precision (`BigFloat`) or rational arithmetic (`Rational`) :
97
106
``` julia
98
107
julia> using BenchmarkTools
99
108
109
+ julia> length (x)
110
+ 10001
111
+
100
112
julia> @btime sum ($ x)
101
- 1.305 μs (0 allocations: 0 bytes)
102
- - 136.0
113
+ 1.320 μs (0 allocations: 0 bytes)
114
+ - 8.0
115
+
116
+ julia> @btime sum_naive ($ x)
117
+ 1.026 μs (0 allocations: 0 bytes)
118
+ - 1.121325337906356
103
119
104
120
julia> @btime sum_oro ($ x)
105
- 3.421 μs (0 allocations: 0 bytes)
106
- 0.9999999999999716
121
+ 3.348 μs (0 allocations: 0 bytes)
122
+ 1.0000000000000084
107
123
108
124
julia> @btime sum_kbn ($ x)
109
- 3.792 μs (0 allocations: 0 bytes)
110
- 0.9999999999999716
125
+ 3.870 μs (0 allocations: 0 bytes)
126
+ 1.0000000000000084
111
127
112
- julia> @btime sum (big .($ x ))
113
- 874.203 μs (20006 allocations: 1.14 MiB )
128
+ julia> @btime sum ($ ( big .(x) ))
129
+ 437.495 μs (2 allocations: 112 bytes )
114
130
1.0
115
131
116
- julia> @btime sum (Rational {BigInt} .(x))
117
- 22.702 ms (582591 allocations: 10.87 MiB)
132
+ julia> @btime sum ($ ( Rational {BigInt} .(x) ))
133
+ 10.894 ms (259917 allocations: 4.76 MiB)
118
134
1 // 1
119
135
```
120
136
@@ -124,32 +140,37 @@ than their naive floating-point counterparts. As such, they usually perform
124
140
worse. However, leveraging the power of modern architectures via vectorization,
125
141
the slow down can be kept to a small value.
126
142
127
- ![ ] ( test/figs/perf.svg )
128
-
129
- In the graph above, the time spent in the summation (renormalized per element)
130
- is plotted against the vector size (the units in the y-axis label should be
131
- “ns/elem”). What we see with the standard summation is that, once vectors start
132
- having significant sizes (say, more than 1000 elements), the implementation is
133
- memory bound (as expected of a typical BLAS1 operation). Which is why we see
134
- significant decreases in the performance when the vector can’t fit into the L2
135
- cache (around 30k elements, or 256kB on my machine) or the L3 cache (around 400k
136
- elements, or 3MB on y machine).
137
-
138
- The Ogita-Rump-Oishi algorithm, when implemented with a suitable unrolling level
139
- (ushift=2, i.e 2²=4 unrolled iterations), is CPU-bound when vectors fit inside
140
- the cache. However, when vectors are to large to fit into the L3 cache, the
141
- implementation becomes memory-bound again (on my system), which means we get the
142
- same performance as the standard summation.
143
+ ![ ] ( test/figs/sum_performance.svg )
144
+ ![ ] ( test/figs/dot_performance.svg )
145
+
146
+ Benchmarks presented in the above graphs were obtained in an Intel® Xeon® Gold
147
+ 6128 CPU @ 3.40GHz. The time spent in the summation (renormalized per element)
148
+ is plotted against the vector size. What we see with the standard summation is
149
+ that, once vectors start having significant sizes (say, more than a few
150
+ thousands of elements), the implementation is memory bound (as expected of a
151
+ typical BLAS1 operation). Which is why we see significant decreases in the
152
+ performance when the vector can’t fit into the L1, L2 or L3 cache.
153
+
154
+ On this AVX512-enabled system, the Kahan-Babuska-Neumaier implementation tends
155
+ to be a little more efficient than the Ogita-Rump-Oishi algorithm (this would
156
+ generally the opposite for AVX2 systems). When implemented with a suitable
157
+ unrolling level and cache prefetching, these implementations are CPU-bound when
158
+ vectors fit inside the L1 or L2 cache. However, when vectors are too large to
159
+ fit into the L2 cache, the implementation becomes memory-bound again (on this
160
+ system), which means we get the same performance as the standard
161
+ summation. Again, the same could be said as well for dot product calculations
162
+ (graph on the right), where the implementations from ` AccurateArithmetic.jl `
163
+ compete against MKL's dot product.
143
164
144
165
In other words, the improved accuracy is free for sufficiently large
145
- vectors. For smaller vectors, the accuracy comes with a slow-down that can reach
146
- values slightly above 3 for vectors which fit in the L2 cache.
166
+ vectors. For smaller vectors, the accuracy comes with a slow-down by a factor of
167
+ approximately 3 in the L2 cache.
147
168
148
169
149
170
### Tests
150
171
151
172
The graphs above can be reproduced using the ` test/perftests.jl ` script in this
152
- repository. Before running them, be aware that it takes around one hour to
173
+ repository. Before running them, be aware that it takes around tow hours to
153
174
generate the performance graph, during which the benchmark machine should be as
154
175
low-loaded as possible in order to avoid perturbing performance measurements.
155
176
0 commit comments