-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathdespikeLF.Rd
179 lines (154 loc) · 7.84 KB
/
despikeLF.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Quality_checking.R
\name{despikeLF}
\alias{despikeLF}
\title{Low Frequency Data Despiking}
\usage{
despikeLF(
x,
var,
qc_flag,
name_out = "-",
var_thr = NULL,
iter = 10,
plot = FALSE,
light = c("PAR", "GR"),
night_thr = 10,
nVals = 50,
z = 7,
c = 4.4478
)
}
\arguments{
\item{x}{A data frame with column names representing required variables. See
'Details' below.}
\item{var}{A character string. Specifies the variable name in \code{x} with
values to be despiked.}
\item{qc_flag}{A character string. Specifies the column name in \code{x} with
\code{var} related quality control flag.}
\item{name_out}{A character string providing \code{varnames} attribute value
of the output.}
\item{var_thr}{A numeric vector with 2 non-missing values. Specifies fixed
thresholds for \code{var} values. Values outside this range will be flagged
as spikes (flag 2). If \code{var_thr = NULL}, thresholds are not applied.}
\item{iter}{An integer value. Defines number of despiking iterations.}
\item{plot}{A logical value. If \code{TRUE}, list of \code{\link{ggplot}}
objects visualizing the spikes is also produced.}
\item{light}{A character string. Selects preferred variable for incoming
light intensity. \code{"PAR"} or \code{"GR"} is allowed. Can be
abbreviated. If \code{light = NULL}, \code{var} values are not separated to
night/day subsets and \code{night_thr} is not used.}
\item{night_thr}{A numeric value that defines the threshold between night
(for \code{light} values equal or lower than \code{night_thr}) and day (for
\code{light} values higher than \code{night_thr}) for incoming light.}
\item{nVals}{A numeric value. Number of values within 13-day blocks required
to obtain robust statistics.}
\item{z}{A numeric value. \eqn{MAD} scale factor.}
\item{c}{A numeric value. \code{\link{mad}} scale factor. Default is \code{3
* \link{mad} constant} (\code{i.e. 3 * 1.4826 = 4.4478}).}
}
\value{
If \code{plot = FALSE}, an integer vector with attributes
\code{"varnames"} and \code{"units"}. If \code{plot = TRUE}, a list with
elements \code{SD} and \code{plots}. \code{SD} contains identical output as
produced when \code{plot = FALSE}, \code{plots} contains list of
\code{ggplot} objects for respective iteration, \code{light} subset and
13-day period.
Side effect: the counts of spikes detected in each iteration are printed to
console.
}
\description{
Scaled median absolute deviation from the median is applied to
double-differenced time series to identify outliers.
}
\details{
Low Frequency Data Despiking is not an additive quality control (QC) test.
\code{despikeLF} follows the QC scheme using QC flag range 0 - 2.
\code{varnames} attribute of returned vector should be chosen to follow the
'Naming Strategy' described in \code{\link{extract_QC}}, i.e. to be
distinguished by suffix \code{"_spikesLF"}.
The data frame \code{x} is expected to have certain properties. It is
required that it contains column named \code{"timestamp"} of class
\code{"POSIXt"} with regular sequence of date-time values, typically with
(half-)hourly time interval. Any missing values in \code{"timestamp"} are not
allowed. Thus, if no records exist for given date-time value, it still has to
be included. It also has to contain required (depends on the argument values)
column names. If QC flags are not available for \code{var}, \code{qc_flag}
still has to be included in \code{x} as a named column with all values set to
\code{0} (i.e. all values will be checked for outliers).
Only non-missing \code{var} values with corresponding \code{qc_flag} values
below \code{2} are used to detect the outliers. Missing \code{var} values or
those with assigned flag \code{2} or \code{NA} are not checked and marked by
\code{NA} flag in the output. Thus \code{NA} values of \code{despikeLF}
should be considered as not checked records and therefore interpreted as
\code{0} flag within the \code{0 - 2} quality control scheme.
\code{var_thr} is intended for exclusion of data clearly outside of
theoretically acceptable range for the whole dataset. If \code{var_thr} is
specified, \code{var} values below \code{var_thr[1]} and above
\code{var_thr[2]} are marked as spikes (flag 2) in the output. Such values
are further not used for computing statistics on double-differenced time
series.
\code{light} and \code{night_thr} are intended to separate data to night and
day subsets with different statistical properties. \code{NA}s in
\code{x[light]} are thus not allowed due to the subsetting. Despiking is then
applied to individual subsets and combined QC flags are returned.
Despiking is done within blocks of 13 consecutive days to account for
seasonality of measured variable. Within each block, all records are compared
with its neighbours and \eqn{d[i]} scores are produced. This is achieved by
double-differencing: \deqn{d[i] = (var[i] - var[i-1]) - (var[i+1] - var[i])}
In order to obtain maximum amount of \eqn{d[i]} scores, all missing
\code{var} values are removed from the block before \eqn{d[i]} scores are
produced. \code{var} values are marked as spikes if \eqn{d[i]} is higher
(lower) than median of \eqn{d[i]} scores (\eqn{M[d]}) + (-) scaled median
absolute deviation: \deqn{d[i] > M[d] + (z * MAD / 0.6745)} \deqn{d[i] < M[d]
- (z * MAD / 0.6745)} MAD is defined as: \deqn{MAD = median(abs(d[i] -
M[d]))}
The algorithm tends to flag also values that are neighbours of spikes. To
prevent false flagging, \code{\link{median}} and \code{\link{mad}} of
\code{var} values within given block (\eqn{M[var]} and \eqn{mad[var]},
respectively) is computed. Values can be marked as spikes only if
\deqn{var[i] > M[var] + (c * mad / 1.4826)} or \deqn{var[i] < M[var] - (c *
mad / 1.4826)}
Number of available double-differenced \code{var} values (\code{nVals}) is
checked within each block. If equal or below \code{nVals}, \eqn{d[i]} or
\eqn{var[i]} values are checked against the statistics computed using entire
dataset to ensure robustness.
The whole process is repeated iteratively if \code{iter > 1}. This way new
statistics are produced for each iteration after exclusion of already
detected outliers and new spikes can be identified.
}
\section{Plotting}{
Plots are produced as a list of \code{ggplot} objects.
Thus they can be assigned to an object and modified as needed before actual
plotting. Each plot consists of two panels. The upper one shows the
double-differenced time series, the bottom one the actual \code{var}
values. Grey bands mark the respective intervals in which \code{var} value
cannot be considered as an outlier. The red points in upper panel show all
points that would be marked as spikes if \code{c = 0}. Only the points
marked by blue color (bottom panel) will be considered spikes. The spike
detection tolerance (width of grey bands) can be modified by scale factors
\code{z} (upper panel) and \code{c} (bottom panel).
}
\section{Abbreviations}{
\itemize{\item QC: Quality Control \item PAR:
Photosynthetic Active Radiation [umol m-2 s-1] \item GR: Global Radiation
[W m-2]}
}
\section{References}{
Mauder, M., Cuntz, M., Drue, C., Graf, A., Rebmann, C.,
Schmid, H.P., Schmidt, M., Steinbrecher, R., 2013. A strategy for quality
and uncertainty assessment of long-term eddy-covariance measurements.
Agric. For. Meteorol. 169, 122-135.
\url{https://doi.org/10.1016/j.agrformet.2012.09.006}
Papale, D., Reichstein, M., Canfora, E., Aubinet, M., Bernhofer, C.,
Longdoz, B., Kutsch, W., Rambal, S., Valentini, R., Vesala, T., Yakir, D.,
2006. Towards a more harmonized processing of eddy covariance CO2 fluxes:
algorithms and uncertainty estimation. Biogeosciences Discuss. 3, 961-992.
\url{https://doi.org/10.5194/bgd-3-961-2006}
Sachs, L., 1996. Angewandte Statistik: Anwendung Statistischer Methoden,
Springer, Berlin.
}
\seealso{
\code{\link{combn_QC}}, \code{\link{extract_QC}},
\code{\link{median}} and \code{\link{mad}}.
}