man/merge_eddy.Rd

% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/data_handling.R
\name{merge_eddy}
\alias{merge_eddy}
\title{Merge Regular Date-Time Sequence and Data Frames}
\usage{
merge_eddy(
  x,
  start = NULL,
  end = NULL,
  check_dupl = TRUE,
  interval = NULL,
  format = "\%Y-\%m-\%d \%H:\%M",
  tz = "GMT",
  storage.mode = "integer"
)
}
\arguments{
\item{x}{List of data frames, each with \code{"timestamp"} column of class
\code{"POSIXt"}. Optionally with attributes \code{varnames} and
\code{units} for each column.}

\item{start, end}{A value specifying the first (last) value of the generated
date-time sequence. If \code{NULL}, \code{\link[=min]{min()}} (\code{\link[=max]{max()}})
is taken across the values in \code{"timestamp"} columns across \code{x}
elements. If numeric, the value specifies the year for which the first
(last) date-time value will be generated, considering given time
\code{interval} and convention of assigning of measured records to the end
of the time interval. Otherwise, character representation of specific half
hour is expected with given \code{format} and \code{tz}.}

\item{check_dupl}{A logical value specifying whether rows with duplicated
date-time values checked across \code{x} elements should be excluded before
merging.}

\item{interval}{A numeric value specifying the time interval (in seconds) of
the generated date-time sequence.}

\item{format}{A character string. Format of \code{start} (\code{end}) if
provided as a character string.The default \link[=strptime]{format}
is \code{"\%Y-\%m-\%d \%H:\%M"}.}

\item{tz}{A time zone (see \link{time zones}) specification to be used
for the conversion of \code{start} (\code{end}) if provided as a character
string.}

\item{storage.mode}{A character string. Either \code{"integer"} (default) or
\code{"double"} (see Details).}
}
\value{
A data frame with attributes \code{varnames} and \code{units} for
each column, containing date-time information in column \code{"timestamp"}.
}
\description{
Merge generated regular date-time sequence (timestamp) with single or
multiple data frames containing timestamp.
}
\details{
The primary purpose of \code{merge_eddy()} is to combine chunks of data vertically
along their column \code{"timestamp"} with date-time information. This timestamp
is expected to be regular with given time \code{interval}. The resulting data
frame contains added rows with expected date-time values missing in
timestamp, followed by \code{NA}s across respective rows. In case that
\code{check_dupl = TRUE} and timestamp values across \code{x} elements overlap,
detected duplicated rows are removed (the order in which duplicates are
evaluated depends on the order of \code{x} elements). A special case when \code{x} has
only one element allows to fill missing date-time values in \code{"timestamp"}
column of given data frame.

The list of data frames, each with column \code{"timestamp"}, is sequentially
\code{\link[=merge]{merge()}}d using \code{\link[=Reduce]{Reduce()}}. A \emph{(full) outer join},
i.e. \code{merge(..., all = TRUE)}, is performed to keep all columns of
\code{x} elements. The order of \code{x} elements can affect the result.
Duplicated column names within \code{x} elements are corrected using
\code{\link[=make.unique]{make.unique()}}. The merged data frame is then merged on the
validated \code{"timestamp"} column that can be either automatically
extracted from \code{x} or manually specified.

For horizontal merging (adding columns instead of rows) \code{check_dupl = FALSE} must be set but simple \code{\link[=merge]{merge()}} could be preferred.
Combination of vertical and horizontal merging should be avoided as it
depends on the order of \code{x} elements and can lead to row duplication.
Instead, data chunks from different data sources should be first separately
vertically merged and then merged horizontally in a following step.

If \code{interval = NULL}, automated recognition of \code{interval} is applied. This is
preferred to setting \code{interval} value manually. Only in rare cases when
original time interval is not present in \code{x} due to gaps, it is not possible
to infer the original time interval from the timestamps. The inferred
interval represents the shortest time interval present among \code{x} records.
Thus if the expected interval is shorter, it needs to be set manually.

The default \code{\link{storage.mode}} of \code{"timestamp"} column is set to be
\code{"integer"} instead of \code{"double"}. This simplifies the application of
\code{\link[=round_df]{round_df()}} (it avoids rounding) but could lead to an unexpected behavior
if the date-time information is expected to resolve fractional seconds (it
\code{\link[=trunc]{trunc()}}ates decimals).
}
\examples{
str(ex(eddy_data, 1:6, 1:6))

# fill gaps in timestamp of single data frame
a <- ex(eddy_data, c(1:3, 6), 1:6)
merge_eddy(list(a))

# merge overlapping data frames
(b1 <- ex(eddy_data, 1:5, 1:6))
(b2 <- ex(eddy_data, 4:6, 1:6))
(b <- merge_eddy(list(b1, b2)))
str(b)
attributes(b$timestamp)
typeof(b$timestamp)

# merge data frames with different number of columns
(c1 <- ex(eddy_data, 8:20, 1:3))
(c <- merge_eddy(list(b1, c1)))

# horizontal merging
(d1 <- ex(eddy_data, 1:6, 1:5))
(d2 <- ex(eddy_data, c(2:4, 6:8), c(1, 4:7)))
(d <- merge_eddy(list(d1, d2), check_dupl = FALSE))

}
\seealso{
\code{\link[=merge]{merge()}}, \code{\link[=Reduce]{Reduce()}}, \code{\link[=strptime]{strptime()}},
\link{time zones}, \code{\link[=make.unique]{make.unique()}}
}