Skip to content

Commit 1fa3f88

Browse files
authored
Merge pull request #19 from JuliaMath/ff/prefetch
@explicit/@fusible + cache prefetching
2 parents 974593c + a284b4c commit 1fa3f88

19 files changed

+15621
-1449
lines changed

.travis.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ os:
55
julia:
66
- 1.3
77
- 1.4
8+
- 1.5
89
- nightly
910
matrix:
1011
fast_finish: true

Project.toml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ name = "AccurateArithmetic"
22
uuid = "22286c92-06ac-501d-9306-4abd417d9753"
33
license = "MIT"
44
repo = "https://github.com/JuliaMath/AccurateArithmetic.jl.git"
5-
version = "0.3.3"
5+
version = "0.3.4"
66

77
[deps]
88
JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
@@ -12,9 +12,7 @@ SIMDPirates = "21efa798-c60a-11e8-04d3-e1a92915a26a"
1212
VectorizationBase = "3d5dd08c-fd9d-11e8-17fa-ed2836048c2f"
1313

1414
[compat]
15-
BenchmarkTools = "0"
1615
JSON = "0"
17-
Plots = "0"
1816
SIMDPirates = "0.2, 0.3, 0.5, 0.6.3, 0.7, 0.8"
1917
VectorizationBase = "0"
2018
julia = "1"

README.md

Lines changed: 57 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -56,39 +56,48 @@ julia> sum(big.(x))
5656
# But the standard summation algorithms computes this sum very inaccurately
5757
# (not even the sign is correct)
5858
julia> sum(x)
59-
-136.0
59+
-8.0
6060

6161

6262
# Compensated summation algorithms should compute this more accurately
6363
julia> using AccurateArithmetic
6464

6565
# Algorithm by Ogita, Rump and Oishi
6666
julia> sum_oro(x)
67-
0.9999999999999716
67+
1.0000000000000084
6868

6969
# Algorithm by Kahan, Babuska and Neumaier
7070
julia> sum_kbn(x)
71-
0.9999999999999716
71+
1.0000000000000084
7272
```
7373

7474

75-
![](test/figs/qual.svg)
75+
![](test/figs/sum_accuracy.svg)
76+
![](test/figs/dot_accuracy.svg)
7677

77-
In the graph above, we see the relative error vary as a function of the
78+
In the graphs above, we see the relative error vary as a function of the
7879
condition number, in a log-log scale. Errors lower than ϵ are arbitrarily set to
7980
ϵ; conversely, when the relative error is more than 100% (i.e no digit is
8081
correctly computed anymore), the error is capped there in order to avoid
81-
affecting the scale of the graph too much. What we see is that the pairwise
82+
affecting the scale of the graph too much. What we see on the left is that the pairwise
8283
summation algorithm (as implemented in Base.sum) starts losing accuracy as soon
8384
as the condition number increases, computing only noise when the condition
84-
number exceeds 1/ϵ≃10¹⁶. In contrast, both compensated algorithms
85+
number exceeds 1/ϵ≃10¹⁶. The same goes for the naive summation algorithm.
86+
In contrast, both compensated algorithms
8587
(Kahan-Babuska-Neumaier and Ogita-Rump-Oishi) still accurately compute the
8688
result at this point, and start losing accuracy there, computing meaningless
8789
results when the condition nuber reaches 1/ϵ²≃10³². In effect these (simply)
8890
compensated algorithms produce the same results as if a naive summation had been
8991
performed with twice the working precision (128 bits in this case), and then
9092
rounded to 64-bit floats.
9193

94+
The same comments can be made for the dot product implementations shown on the
95+
right. Uncompensated algorithms, as implemented in
96+
`AccurateArithmetic.dot_naive` or `Base.dot` (which internally calls BLAS in
97+
this case) exhibit typical loss of accuracy. In contrast, the implementation of
98+
Ogita, Rump & Oishi's compentated algorithm effectively doubles the working
99+
precision.
100+
92101
<br/>
93102

94103
Performancewise, compensated algorithms perform a lot better than alternatives
@@ -97,24 +106,31 @@ such as arbitrary precision (`BigFloat`) or rational arithmetic (`Rational`) :
97106
```julia
98107
julia> using BenchmarkTools
99108

109+
julia> length(x)
110+
10001
111+
100112
julia> @btime sum($x)
101-
1.305 μs (0 allocations: 0 bytes)
102-
-136.0
113+
1.320 μs (0 allocations: 0 bytes)
114+
-8.0
115+
116+
julia> @btime sum_naive($x)
117+
1.026 μs (0 allocations: 0 bytes)
118+
-1.121325337906356
103119

104120
julia> @btime sum_oro($x)
105-
3.421 μs (0 allocations: 0 bytes)
106-
0.9999999999999716
121+
3.348 μs (0 allocations: 0 bytes)
122+
1.0000000000000084
107123

108124
julia> @btime sum_kbn($x)
109-
3.792 μs (0 allocations: 0 bytes)
110-
0.9999999999999716
125+
3.870 μs (0 allocations: 0 bytes)
126+
1.0000000000000084
111127

112-
julia> @btime sum(big.($x))
113-
874.203 μs (20006 allocations: 1.14 MiB)
128+
julia> @btime sum($(big.(x)))
129+
437.495 μs (2 allocations: 112 bytes)
114130
1.0
115131

116-
julia> @btime sum(Rational{BigInt}.(x))
117-
22.702 ms (582591 allocations: 10.87 MiB)
132+
julia> @btime sum($(Rational{BigInt}.(x)))
133+
10.894 ms (259917 allocations: 4.76 MiB)
118134
1//1
119135
```
120136

@@ -124,32 +140,37 @@ than their naive floating-point counterparts. As such, they usually perform
124140
worse. However, leveraging the power of modern architectures via vectorization,
125141
the slow down can be kept to a small value.
126142

127-
![](test/figs/perf.svg)
128-
129-
In the graph above, the time spent in the summation (renormalized per element)
130-
is plotted against the vector size (the units in the y-axis label should be
131-
“ns/elem”). What we see with the standard summation is that, once vectors start
132-
having significant sizes (say, more than 1000 elements), the implementation is
133-
memory bound (as expected of a typical BLAS1 operation). Which is why we see
134-
significant decreases in the performance when the vector can’t fit into the L2
135-
cache (around 30k elements, or 256kB on my machine) or the L3 cache (around 400k
136-
elements, or 3MB on y machine).
137-
138-
The Ogita-Rump-Oishi algorithm, when implemented with a suitable unrolling level
139-
(ushift=2, i.e 2²=4 unrolled iterations), is CPU-bound when vectors fit inside
140-
the cache. However, when vectors are to large to fit into the L3 cache, the
141-
implementation becomes memory-bound again (on my system), which means we get the
142-
same performance as the standard summation.
143+
![](test/figs/sum_performance.svg)
144+
![](test/figs/dot_performance.svg)
145+
146+
Benchmarks presented in the above graphs were obtained in an Intel® Xeon® Gold
147+
6128 CPU @ 3.40GHz. The time spent in the summation (renormalized per element)
148+
is plotted against the vector size. What we see with the standard summation is
149+
that, once vectors start having significant sizes (say, more than a few
150+
thousands of elements), the implementation is memory bound (as expected of a
151+
typical BLAS1 operation). Which is why we see significant decreases in the
152+
performance when the vector can’t fit into the L1, L2 or L3 cache.
153+
154+
On this AVX512-enabled system, the Kahan-Babuska-Neumaier implementation tends
155+
to be a little more efficient than the Ogita-Rump-Oishi algorithm (this would
156+
generally the opposite for AVX2 systems). When implemented with a suitable
157+
unrolling level and cache prefetching, these implementations are CPU-bound when
158+
vectors fit inside the L1 or L2 cache. However, when vectors are too large to
159+
fit into the L2 cache, the implementation becomes memory-bound again (on this
160+
system), which means we get the same performance as the standard
161+
summation. Again, the same could be said as well for dot product calculations
162+
(graph on the right), where the implementations from `AccurateArithmetic.jl`
163+
compete against MKL's dot product.
143164

144165
In other words, the improved accuracy is free for sufficiently large
145-
vectors. For smaller vectors, the accuracy comes with a slow-down that can reach
146-
values slightly above 3 for vectors which fit in the L2 cache.
166+
vectors. For smaller vectors, the accuracy comes with a slow-down by a factor of
167+
approximately 3 in the L2 cache.
147168

148169

149170
### Tests
150171

151172
The graphs above can be reproduced using the `test/perftests.jl` script in this
152-
repository. Before running them, be aware that it takes around one hour to
173+
repository. Before running them, be aware that it takes around tow hours to
153174
generate the performance graph, during which the benchmark machine should be as
154175
low-loaded as possible in order to avoid perturbing performance measurements.
155176

src/SIMDops.jl

Lines changed: 83 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,50 @@
11
module SIMDops
22
import SIMDPirates
3-
using SIMDPirates: Vec, vload, vsum, vabs, vfma, vifelse, vless
3+
using SIMDPirates: Vec, vload, vsum, vabs, vfma, vifelse, vless, vzero
4+
5+
6+
# * Generic operations on SIMD packs
7+
8+
fma(x...) = Base.fma(x...)
9+
fma(x::T, y::T, z::T) where {T<:NTuple} = SIMDPirates.vfma(x, y, z)
10+
11+
fptype(::Type{Vec{W, T}}) where {W, T} = T
12+
13+
14+
# * Macros rewriting mathematical operations
15+
16+
# ** Helper functions
17+
18+
replace_simd_ops(x, _, _) = x
19+
20+
function replace_simd_ops(sym::Symbol, ops, updates)
21+
# Replace e.g: +
22+
# => eadd
23+
for (op, replacement) in ops
24+
sym === op && return replacement
25+
end
26+
sym
27+
end
28+
29+
function replace_simd_ops(expr::Expr, ops, updates)
30+
# Replace e.g: lhs += rhs
31+
# => lhs = eadd(lhs, rhs)
32+
for (op, replacement) in updates
33+
if expr.head === op
34+
lhs = expr.args[1]
35+
rhs = replace_simd_ops(expr.args[2], ops, updates)
36+
return :($lhs = $replacement($lhs, $rhs))
37+
end
38+
end
39+
40+
newexpr = Expr(replace_simd_ops(expr.head, ops, updates))
41+
for arg in expr.args
42+
push!(newexpr.args, replace_simd_ops(arg, ops, updates))
43+
end
44+
newexpr
45+
end
46+
47+
# ** Explicit/exact operations (non-fusible)
448

549
eadd(x...) = Base.:+(x...)
650
eadd(x::T, y::T) where {T<:NTuple} = SIMDPirates.evadd(x, y)
@@ -13,20 +57,47 @@ esub(x::T, y::T) where {T<:NTuple} = SIMDPirates.evsub(x, y)
1357
emul(x...) = Base.:*(x...)
1458
emul(x::T, y::T) where {T<:NTuple} = SIMDPirates.evmul(x, y)
1559

16-
fma(x...) = Base.fma(x...)
17-
fma(x::T, y::T, z::T) where {T<:NTuple} = SIMDPirates.vfma(x, y, z)
60+
function explicit(expr::Expr)
61+
replace_simd_ops(expr,
62+
(:+ => eadd,
63+
:- => esub,
64+
:* => emul,
65+
:fma => fma),
66+
(:+= => eadd,
67+
:-= => esub,
68+
:*= => emul))
69+
end
1870

19-
macro explicit()
20-
quote
21-
$(esc(:+)) = eadd
22-
$(esc(:-)) = esub
23-
$(esc(:*)) = emul
24-
end
71+
macro explicit(expr)
72+
explicit(expr) |> esc
2573
end
2674

27-
vzero(x) = zero(x)
28-
vzero(::Type{Vec{W, T}}) where {W, T} = SIMDPirates.vbroadcast(Vec{W, T}, 0)
2975

30-
fptype(::Type{Vec{W, T}}) where {W, T} = T
76+
# * Fast/fusible operations
77+
78+
fadd(x...) = Base.:+(x...)
79+
fadd(x::T, y::T) where {T<:NTuple} = SIMDPirates.vadd(x, y)
80+
81+
fsub(x...) = Base.:-(x...)
82+
fsub(x::NTuple) = broadcast(esub, x)
83+
fsub(x::T, y::T) where {T<:NTuple} = SIMDPirates.vsub(x, y)
84+
85+
fmul(x...) = Base.:*(x...)
86+
fmul(x::T, y::T) where {T<:NTuple} = SIMDPirates.vmul(x, y)
87+
88+
function fusible(expr::Expr)
89+
replace_simd_ops(expr,
90+
(:+ => fadd,
91+
:- => fsub,
92+
:* => fmul,
93+
:fma => fma),
94+
(:+= => fadd,
95+
:-= => fsub,
96+
:*= => fmul))
97+
end
98+
99+
macro fusible(expr)
100+
fusible(expr) |> esc
101+
end
31102

32103
end

src/Summation.jl

Lines changed: 54 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ import VectorizationBase
66

77
import ..SIMDops
88
using ..SIMDops: Vec, vload, vsum, vzero, fptype
9+
using SIMDPirates
910

1011
using ..EFT: two_sum, fast_two_sum, two_prod
1112

@@ -19,10 +20,11 @@ include("accumulators/compDot.jl")
1920
# SIAM Journal on Scientific Computing, 6(26), 2005.
2021
# DOI: 10.1137/030601818
2122
@generated function accumulate(x::NTuple{A, AbstractArray{T}},
22-
accType::F,
23-
rem_handling = Val(:scalar),
24-
::Val{Ushift} = Val(2)
25-
) where {F, A, T <: Union{Float32,Float64}, Ushift}
23+
accType::F,
24+
rem_handling = Val(:scalar),
25+
::Val{Ushift} = Val(2),
26+
::Val{Prefetch} = Val(0),
27+
) where {F, A, T <: Union{Float32,Float64}, Ushift, Prefetch}
2628
@assert 0 Ushift < 6
2729
U = 1 << Ushift
2830

@@ -43,6 +45,10 @@ include("accumulators/compDot.jl")
4345
Nshift = N >> $(shift + Ushift)
4446
offset = 0
4547
for n in 1:Nshift
48+
if $Prefetch > 0
49+
SIMDPirates.prefetch.(px.+offset.+$(Prefetch*WT), Val(3), Val(0))
50+
end
51+
4652
Base.Cartesian.@nexprs $U u -> begin
4753
xi = vload.($V, px.+offset)
4854
add!(acc_u, xi...)
@@ -85,11 +91,50 @@ include("accumulators/compDot.jl")
8591
end
8692
end
8793

88-
sum_naive(x) = accumulate((x,), sumAcc, Val(:scalar), Val(3))
89-
sum_kbn(x) = accumulate((x,), compSumAcc(fast_two_sum), Val(:scalar), Val(2))
90-
sum_oro(x) = accumulate((x,), compSumAcc(two_sum), Val(:scalar), Val(2))
94+
# Default values for unrolling
95+
@inline default_ushift(::SumAcc) = Val(3)
96+
@inline default_ushift(::CompSumAcc) = Val(2)
97+
@inline default_ushift(::DotAcc) = Val(3)
98+
@inline default_ushift(::CompDotAcc) = Val(2)
99+
# dispatch
100+
# either default_ushift(x, acc)
101+
# or default_ushift((x,), acc)
102+
@inline default_ushift(x::AbstractArray, acc) = default_ushift(acc(eltype(x)))
103+
@inline default_ushift(x::NTuple, acc) = default_ushift(first(x), acc)
104+
105+
106+
# Default values for cache prefetching
107+
@inline default_prefetch(::SumAcc) = Val(0)
108+
@inline default_prefetch(::CompSumAcc) = Val(35)
109+
@inline default_prefetch(::DotAcc) = Val(0)
110+
@inline default_prefetch(::CompDotAcc) = Val(20)
111+
# dispatch
112+
# either default_prefetch(x, acc)
113+
# or default_prefetch((x,), acc)
114+
@inline default_prefetch(x::AbstractArray, acc) = default_prefetch(acc(eltype(x)))
115+
@inline default_prefetch(x::NTuple, acc) = default_prefetch(first(x), acc)
116+
117+
118+
@inline _sum(x, acc) = if length(x) < 500
119+
# no cache prefetching for small vectors
120+
accumulate((x,), acc, Val(:scalar), default_ushift(x, acc), Val(0))
121+
else
122+
accumulate((x,), acc, Val(:scalar), default_ushift(x, acc), default_prefetch(x, acc))
123+
end
124+
125+
sum_naive(x) = _sum(x, sumAcc)
126+
sum_kbn(x) = _sum(x, compSumAcc(fast_two_sum))
127+
sum_oro(x) = _sum(x, compSumAcc(two_sum))
128+
129+
130+
@inline _dot(x, y, acc) = if length(x) < 500
131+
# no cache prefetching for small vectors
132+
accumulate((x,y), acc, Val(:scalar), default_ushift(x, acc), Val(0))
133+
else
134+
accumulate((x,y), acc, Val(:scalar), default_ushift(x, acc), default_prefetch(x, acc))
135+
end
91136

92-
dot_naive(x, y) = accumulate((x,y), dotAcc, Val(:scalar), Val(3))
93-
dot_oro(x, y) = accumulate((x,y), compDotAcc, Val(:scalar), Val(2))
137+
dot_naive(x, y) = _dot(x, y, dotAcc)
138+
dot_oro(x, y) = _dot(x, y, compDotAcc)
94139

95140
end

0 commit comments

Comments
 (0)