diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 7f7634c..8341a22 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -4,13 +4,46 @@ - Follow the workflow: https://lbm364dl.github.io/follow-the-workflow/ - Follow tidyverse style guide: https://style.tidyverse.org/ -- In the documentation section of functions, ensure these tags exist: `@description`, `@param` (for each function parameter), `@return`, `@export`, `@examples` - Maximum line width is 80 characters -- There must be one space after `#'` -- If you start the text for an annotation in that same line, then the next lines should be indented two spaces to easily know it's for that same section -- Finish all doc sentences with full stop -- Use `snake_case` for column namings +- For code formatting, use `air format` if this is available in the user's computer +- Functions should be short. Ideally no more than 25 lines. There might be exceptions if the code is easy to follow, but try to keep functions short and modular. If necessary, split large functions into smaller ones, with meaningful names. +- Main exported functions should come first in the file. The private helpers should come at the end, after all exported functions. +- Use `snake_case` for column namings in tibbles - Always make sure that all rules in this document and all lintrs have passed after a change in the code +- Don't use imported functions without namespace prefix (e.g. use `dplyr::filter()` instead of just `filter()`). This also means in the package functions you must not use `@importFrom` in the roxygen2 documentation. +- This is a tidy data project. Don't use `data.frame` or `data.table`. Always use `tibble` from tidyverse. +- For argument validation try to use `rlang` functions as much as possible instead of base R. For example, instead of `time_col %in% names(data)` use `rlang::has_name(data, time_col)`. +- For error messages, try to use `cli::cli_abort()` instead of `stop()`. For example, instead of `stop("Time column '", time_col, "' not found in data")` use `cli::cli_abort("Time column '{time_col}' not found in data")`. This also applies to warnings with `cli::cli_warn()`, info messages, etc... Try to use well formatted cli messages instead of base R messages. +- If the code uses regex, keep in mind that escaped characters must be double-escaped in R strings. For example, to match a dot (`.`) you must use `\\.` in the R string, so `\.` is not enough. +- When defining functions that expect column names as arguments, expect symbolic names (unquoted) instead of strings (quoted). For example, use `function(data, time_col)` instead of `function(data, time_col_name)`. Inside the function then, use `{{ time_col }}` to refer to the column. +- There must not be functions inside functions. All functions must be defined at the top level. This means that if you need a helper function, define it as a private function (with a name starting with a dot) at the end of the script, instead of including its definition inside the function where it's used. +- Always strive to use pipes, native R pipes (`|>`). But in general, use piped expressions as much as possible to improve readability. If you can, make functions ideally look like this: +``` +result <- data |> + dplyr::filter(...) |> + dplyr::mutate(...) |> + some_other_function(...) |> + dplyr::summarise(...) |> + some_final_function(...) +``` +- Make as many functions as necessary to make code look like the above example. Avoid long intermediate expressions that assign to variables, unless necessary for readability. +- For the same purpose, avoid things like for loops. Try to use vectorised operations, use `purrr` if necessary or just plain functions from `dplyr` or `tidyr` to operate on entire columns or datasets at once. Explicit loops should be the last resort. + + +## Documentation + +- Use roxygen2 for function documentation +- It's enough to document only exported functions. The private ones can remain undocumented. The private functions' names should start with a dot (`.`). +- Finish all doc sentences with full stop +- The first line must be the documentation title. No need to use `@title` tag. Keep it short and start with a verb in imperative form. +- Next part should be the description, starting with `@description` tag. This part can be multiple lines. It can become long if necessary, but don't fill it with useless sentences that say nothing. +- Next part should be the parameters, each starting with `@param` tag. Each parameter should have its own `@param` tag. +- Next part should be the return value, starting with `@return` tag. Describe what the function returns. +- Next part should be the `@export` tag. +- Last part should be the examples, starting with `@examples` tag. Provide meaningful examples that show how to use the function. Avoid unnecessary examples. +- In examples, the last line should show the output of the function, so that when the user runs the example, they can see what the function returns. +- There must be one space after `#'` for each roxygen2 line +- If you start the text for one tag in that same line, then the next lines should be indented two spaces to easily know it's for that same section. For example, if each parameter description starts in the same line as `@param`, then the next lines should be indented two spaces. ## Tests diff --git a/DESCRIPTION b/DESCRIPTION index 69061c6..0fe0ab0 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -15,6 +15,7 @@ Description: A set of tools for processing and analyzing data developed in the License: MIT + file LICENSE Imports: cli, + data.table, dplyr, fs, FAOSTAT, @@ -26,6 +27,7 @@ Imports: readr, rlang, stringr, + tibble, tidyr, withr, yaml, @@ -40,8 +42,7 @@ Suggests: knitr, pointblank, rmarkdown, - testthat (>= 3.0.0), - tibble + testthat (>= 3.0.0) Config/testthat/edition: 3 VignetteBuilder: knitr URL: https://eduaguilera.github.io/whep/, https://github.com/eduaguilera/whep diff --git a/NAMESPACE b/NAMESPACE index f75e7ea..119ec7b 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -8,7 +8,11 @@ export(add_item_cbs_name) export(add_item_prod_code) export(add_item_prod_name) export(build_supply_use) +export(calculate_lmdi) export(expand_trade_sources) +export(fill_growth) +export(fill_linear) +export(fill_sum) export(get_bilateral_trade) export(get_faostat_data) export(get_feed_intake) @@ -16,11 +20,8 @@ export(get_primary_production) export(get_primary_residues) export(get_processing_coefs) export(get_wide_cbs) -export(linear_fill) -export(proxy_fill) -export(sum_fill) export(whep_list_file_versions) export(whep_read_file) +importFrom(data.table,":=") importFrom(pins,pin_fetch) -importFrom(rlang,":=") importFrom(stats,ave) diff --git a/NEWS.md b/NEWS.md index d29f82d..1a62f1d 100644 --- a/NEWS.md +++ b/NEWS.md @@ -2,7 +2,7 @@ # whep 0.2.0 -* Add gapfilling functions `linear_fill()`, `proxy_fill()`, `sum_fill()` (@eduaguilera, #11). +* Add gapfilling functions `fill_linear()`, `fill_sum()` (@eduaguilera, #11). * Now examples can't fail because of unavailable Internet resources (#58). # whep 0.1.0 diff --git a/R/calculate_lmdi.R b/R/calculate_lmdi.R new file mode 100644 index 0000000..7beeea3 --- /dev/null +++ b/R/calculate_lmdi.R @@ -0,0 +1,821 @@ +#' @title Calculate LMDI decomposition +#' +#' @description +#' Performs LMDI (Log Mean Divisia Index) decomposition analysis with +#' flexible identity parsing, automatic factor detection, and support for +#' multiple periods and groupings. Supports sectoral decomposition using +#' bracket notation for both summing and grouping operations. +#' +#' @details +#' The LMDI method decomposes changes in a target variable into contributions +#' from multiple factors using logarithmic mean weights. This implementation +#' supports: +#' +#' **Flexible identity specification:** +#' - Automatic factor detection from identity string. +#' - Support for ratio calculations (implicit division). +#' - Sectoral aggregation with `[]` notation. +#' - Sectoral grouping with `{}` notation. +#' +#' **Period analysis:** +#' The function can decompose changes over single or multiple periods. +#' Periods are defined by consecutive pairs in the `periods` vector. +#' +#' **Grouping capabilities:** +#' Use `analysis_by` to perform separate decompositions for different +#' groups (e.g., countries, regions) while maintaining consistent factor +#' structure. +#' +#' @section Identity Syntax: +#' The identity parameter uses a special syntax to define decomposition: +#' +#' **Basic format:** `"target:factor1*factor2*factor3"` +#' +#' **Simple decomposition (no sectors):** +#' - Basic: `"emissions:gdp*(emissions/gdp)"` +#' - Complete: `"emissions:(emissions/gdp)*(gdp/population)*population"` +#' +#' **Understanding bracket notation:** +#' +#' Square brackets `[]` specify variables to sum across categories, enabling +#' structural decomposition. The bracket aggregates values BEFORE calculating +#' ratios. +#' +#' **Single-level structural decomposition:** +#' - `"emissions:activity*(activity[sector]/activity)*(emissions[sector]/activity[sector])"` +#' - Creates 3 factors: Activity level, Sectoral structure, Sectoral +#' intensity. +#' +#' **Multi-level structural decomposition:** +#' - Two levels: `"emissions:activity*(activity[sector]/activity)*(activity[sector+fuel]/activity[sector])*(emissions[sector+fuel]/activity[sector+fuel])"` +#' - Creates 4 factors: Activity level, Sector structure, Fuel structure, +#' Sectoral-fuel intensity. +#' +#' @section Data Requirements: +#' The input data frame must contain: +#' - All variables mentioned in the identity. +#' - The time variable (default: "year"). +#' - Grouping variables if using `analysis_by`. +#' - No missing values in key variables for decomposition periods. +#' +#' @param data A data frame containing the variables for decomposition. Must +#' include all variables specified in the identity, time variable, and any +#' grouping variables. +#' @param identity Character. Decomposition identity in format +#' `"target:factor1*factor2*..."`. The target appears before the colon, +#' factors after, separated by asterisks. Supports explicit ratios with +#' `/` and structural decomposition with `[]`. +#' @param identity_labels Named character vector. Custom labels for factors +#' to use in output instead of variable names. Default: NULL uses variable +#' names as-is. +#' @param time_var Character. Name of the time variable column in the data. +#' Default: "year". Must be numeric or coercible to numeric. +#' @param periods Numeric vector. Years defining analysis periods. Each +#' consecutive pair defines one period. Default: NULL uses all available +#' years. +#' @param periods_2 Numeric vector. Additional period specification for +#' complex multi-period analyses. Default: NULL. +#' @param analysis_by Character vector. Grouping variables for performing +#' separate decompositions. Default: NULL (single decomposition for all +#' data). +#' @param rolling_mean Numeric. Window size for rolling mean smoothing +#' applied before decomposition. Default: NULL (no smoothing). +#' @param output_format Character. Format of output data frame. Options: +#' `"clean"` (default) or `"detailed"`. +#' @param verbose Logical. If TRUE (default), prints progress messages during +#' decomposition. +#' +#' @return +#' A data frame with LMDI decomposition results containing: +#' - Time variables and grouping variables (if specified). +#' - `additive`: Additive contributions (sum equals total change in target). +#' - `multiplicative`: Multiplicative indices (product equals target ratio). +#' - `multiplicative_log`: Log of multiplicative indices. +#' - Period identifiers and metadata. +#' +#' @export +#' +#' @examples +#' # Simple LMDI decomposition +#' data <- data.frame( +#' year = rep(2010:2015, 2), +#' country = rep(c("ESP", "FRA"), each = 6), +#' emissions = c(100, 105, 110, 115, 120, 125, 200, 210, 220, 230, 240, 250), +#' gdp = c( +#' 1000, 1050, 1100, 1150, 1200, 1250, +#' 2000, 2100, 2200, 2300, 2400, 2500 +#' ), +#' population = c(46, 46.5, 47, 47.5, 48, 48.5, 65, 65.5, 66, 66.5, 67, 67.5) +#' ) +#' +#' # Example 1: Simple decomposition by country +#' lmdi_basic <- calculate_lmdi( +#' data, +#' identity = "emissions:gdp*(emissions/gdp)", +#' time_var = "year", +#' analysis_by = "country" +#' ) +#' +#' # Example 2: Complete form with multiple factors +#' lmdi_complete <- calculate_lmdi( +#' data, +#' identity = "emissions:(emissions/gdp)*(gdp/population)*population", +#' time_var = "year", +#' analysis_by = "country" +#' ) +#' +#' @family calculations +calculate_lmdi <- function( + data, + identity, + identity_labels = NULL, + time_var = "year", + periods = NULL, + periods_2 = NULL, + analysis_by = NULL, + rolling_mean = NULL, + output_format = "clean", + verbose = TRUE +) { + id <- .parse_identity(identity) + target_var <- id$target + factors <- id$factors + + data <- .lmdi_prepare_rolling_mean( + data, + identity, + target_var, + time_var, + rolling_mean, + verbose + ) + + labels <- .lmdi_handle_identity_labels(identity_labels, factors, target_var) + target_label_final <- labels$target_label_final + factor_labels <- labels$factor_labels + + selectors_detected <- .extract_selectors(identity) + group_vars <- NULL + if (length(selectors_detected) > 0) { + group_vars <- selectors_detected + if (verbose) { + message(sprintf( + "Auto-detected selectors: %s", + paste(selectors_detected, collapse = ", ") + )) + } + } + + periods <- .lmdi_setup_periods(data, time_var, periods, periods_2) + + if (!is.null(group_vars)) { + if (!all(group_vars %in% names(data))) { + cli::cli_abort("All group_vars must exist in the data") + } + } + if (!is.null(analysis_by)) { + analysis_by <- unique(analysis_by) + if (!all(analysis_by %in% names(data))) { + cli::cli_abort("All columns in analysis_by must exist in the data") + } + } + + output_format <- match.arg(tolower(output_format), c("clean", "total")) + analysis_cols <- if (is.null(analysis_by)) character(0) else analysis_by + analysis_groups <- if (length(analysis_cols) == 0) { + tibble::tibble(.analysis_id = 1) + } else { + data |> dplyr::distinct(dplyr::across(dplyr::all_of(analysis_cols))) + } + + tolerance_add <- 1e-6 + tolerance_mult <- 1e-6 + results_all <- list() + period_targets_all <- list() + + for (idx in seq_len(nrow(analysis_groups))) { + if (length(analysis_cols) == 0) { + subset_data <- data + analysis_values <- NULL + } else { + analysis_values <- as.list(analysis_groups[idx, , drop = FALSE]) + subset_data <- data + for (nm in names(analysis_values)) { + subset_data <- subset_data |> + dplyr::filter(.data[[nm]] == analysis_values[[nm]]) + } + if (nrow(subset_data) == 0) { + next + } + } + period_results <- list() + period_targets <- list() + for (i in seq_len(nrow(periods))) { + period_out <- .lmdi_calc_period( + subset_data, + periods, + i, + time_var, + group_vars, + target_var, + factors, + factor_labels, + target_label_final, + identity, + analysis_cols, + analysis_values + ) + if (is.null(period_out)) { + next + } + period_results[[length(period_results) + 1]] <- period_out$result + period_targets[[length(period_targets) + 1]] <- period_out$target + } + if (length(period_results) == 0) { + next + } + analysis_result <- dplyr::bind_rows(period_results) + analysis_period_targets <- dplyr::bind_rows(period_targets) + .lmdi_closure_check( + analysis_result, + analysis_cols, + tolerance_add, + tolerance_mult + ) + results_all[[length(results_all) + 1]] <- analysis_result + period_targets_all[[ + length(period_targets_all) + 1 + ]] <- analysis_period_targets + } + + if (length(results_all) == 0) { + if (verbose) { + message("No results produced.") + } + return(tibble::tibble()) + } + out <- dplyr::bind_rows(results_all) + period_targets_df <- dplyr::bind_rows(period_targets_all) + if (output_format == "clean") { + select_cols <- c( + analysis_cols, + "period", + "period_years", + "factor_label", + "component_type", + "identity", + "identity_var", + "additive", + "multiplicative", + "multiplicative_log" + ) + select_cols <- select_cols[select_cols %in% names(out)] + out <- out |> dplyr::select(dplyr::all_of(select_cols)) + } + attributes_list <- list( + identity = identity, + identity_labels = identity_labels, + identity_var = target_var, + period_targets = period_targets_df, + periods = periods, + analysis_by = analysis_cols, + rolling_mean = rolling_mean + ) + for (nm in names(attributes_list)) { + attr(out, nm) <- attributes_list[[nm]] + } + if (verbose) { + message("LMDI calculation complete.") + } + out +} + +.lmdi_prepare_rolling_mean <- function( + data, + identity, + target_var, + time_var, + rolling_mean, + verbose +) { + if (is.null(rolling_mean)) { + return(data) + } + if (!is.numeric(rolling_mean) || rolling_mean < 2) { + cli::cli_abort( + "rolling_mean must be a numeric value >= 2 (number of years for centered moving average)" + ) + } + all_vars <- unique(c( + target_var, + stringr::str_extract_all(identity, "[a-zA-Z0-9_]+")[[1]] + )) + numeric_vars <- all_vars[all_vars %in% names(data)] + numeric_vars <- numeric_vars[sapply(data[numeric_vars], is.numeric)] + group_cols <- setdiff(names(data), c(time_var, numeric_vars)) + if (verbose) { + message("\n--- LMDI Data Preparation ---") + message("Step 1: Panel balancing") + } + n_before <- nrow(data) + if (length(group_cols) > 0) { + data <- data |> + tidyr::complete( + !!!rlang::syms(c(time_var, group_cols)), + fill = as.list( + stats::setNames(rep(NA_real_, length(numeric_vars)), numeric_vars) + ) + ) + } + n_after <- nrow(data) + n_added <- n_after - n_before + if (verbose && n_added > 0) { + message(sprintf(" - Original: %d rows", n_before)) + message(sprintf( + " - Balanced: %d rows (+%d rows added)", + n_after, + n_added + )) + } + if (verbose) { + message("\nStep 2: Zero/NA treatment (Ang, 2015 methodology)") + } + epsilon <- 1e-10 + data <- data |> + dplyr::mutate( + dplyr::across( + dplyr::all_of(numeric_vars), + ~ dplyr::if_else(is.na(.x) | .x == 0, epsilon, .x) + ) + ) + if (verbose) { + message(sprintf(" - Replacement value: %.2e (epsilon)", epsilon)) + message("-----------------------------\n") + } + years_per_group <- if (length(group_cols) > 0) { + data |> + dplyr::group_by(dplyr::across(dplyr::all_of(group_cols))) |> + dplyr::summarise( + n_years = dplyr::n_distinct(.data[[time_var]]), + .groups = "drop" + ) + } else { + tibble::tibble(n_years = dplyr::n_distinct(data[[time_var]])) + } + min_years <- min(years_per_group$n_years) + k_orig <- as.integer(rolling_mean) + k_eff <- k_orig + if (min_years < k_orig) { + k_eff <- max( + 3L, + min(min_years, ifelse(k_orig %% 2 == 1, k_orig, k_orig - 1)) + ) + if (k_eff > min_years) { + k_eff <- max(3L, min_years - (1L - (min_years %% 2))) + } + if (verbose) { + message(sprintf( + "rolling_mean=%d larger than available years (min=%d). Using k=%d.", + k_orig, + min_years, + k_eff + )) + } + } + if (verbose) { + message(sprintf( + "Applying %d-year centered rolling mean to: %s", + k_eff, + paste(numeric_vars, collapse = ", ") + )) + } + if (length(group_cols) > 0) { + data <- data |> + dplyr::arrange(!!!rlang::syms(group_cols), .data[[time_var]]) |> + dplyr::group_by(dplyr::across(dplyr::all_of(group_cols))) |> + dplyr::mutate(dplyr::across( + dplyr::all_of(numeric_vars), + ~ zoo::rollmean(.x, k = k_eff, fill = NA, align = "center") + )) |> + dplyr::ungroup() |> + dplyr::filter(!is.na(.data[[target_var]])) + } else { + data <- data |> + dplyr::arrange(.data[[time_var]]) |> + dplyr::mutate(dplyr::across( + dplyr::all_of(numeric_vars), + ~ zoo::rollmean(.x, k = k_eff, fill = NA, align = "center") + )) |> + dplyr::filter(!is.na(.data[[target_var]])) + } + if (nrow(data) == 0) { + cli::cli_warn( + "No data remaining after applying rolling mean. Consider using a smaller rolling_mean value." + ) + } + if (verbose) { + message(sprintf( + "Data reduced to %d rows after rolling mean", + nrow(data) + )) + } + data +} + +.lmdi_handle_identity_labels <- function(identity_labels, factors, target_var) { + if (!is.null(identity_labels)) { + if (length(identity_labels) != length(factors) + 1) { + stop(sprintf( + "identity_labels must have %d elements: 1 for target + %d for factors.", + length(factors) + 1, + length(factors) + )) + } + target_label_final <- identity_labels[1] + factor_labels <- identity_labels[-1] + } else { + target_label_final <- target_var + factor_labels <- factors + } + list(target_label_final = target_label_final, factor_labels = factor_labels) +} + +.lmdi_setup_periods <- function(data, time_var, periods, periods_2) { + if (is.null(periods)) { + years <- sort(unique(data[[time_var]])) + if (length(years) < 2) { + cli::cli_abort("Need at least two periods to perform the decomposition") + } + periods <- data.frame(t0 = years[-length(years)], tT = years[-1]) + } else { + if (is.matrix(periods) || is.data.frame(periods)) { + periods <- as.data.frame(periods) + names(periods) <- c("t0", "tT") + } else if (is.vector(periods)) { + if (length(periods) == 2) { + periods <- data.frame(t0 = periods[1], tT = periods[2]) + } else if (length(periods) > 2) { + periods <- data.frame( + t0 = periods[-length(periods)], + tT = periods[-1] + ) + } else { + cli::cli_abort("periods must be a vector of at least 2 years") + } + } else { + cli::cli_abort( + "periods must be a 2-column matrix/data.frame or a vector of years" + ) + } + } + if (!is.null(periods_2)) { + if (is.matrix(periods_2) || is.data.frame(periods_2)) { + periods_2 <- as.data.frame(periods_2) + names(periods_2) <- c("t0", "tT") + } else if (is.vector(periods_2) && length(periods_2) == 2) { + periods_2 <- data.frame(t0 = periods_2[1], tT = periods_2[2]) + } else { + cli::cli_abort( + "periods_2 must be a 2-column matrix/data.frame or a vector of length 2" + ) + } + periods <- rbind(periods, periods_2) + } + periods +} + +.lmdi_calc_period <- function( + subset_data, + periods, + i, + time_var, + group_vars, + target_var, + factors, + factor_labels, + target_label_final, + identity, + analysis_cols, + analysis_values +) { + t0 <- periods$t0[i] + t_final <- periods$tT[i] + d0 <- subset_data |> dplyr::filter(.data[[time_var]] == t0) + d_final <- subset_data |> dplyr::filter(.data[[time_var]] == t_final) + if (nrow(d0) == 0 || nrow(d_final) == 0) { + return(NULL) + } + if (!is.null(group_vars)) { + groups <- d0 |> + dplyr::select(dplyr::all_of(group_vars)) |> + dplyr::distinct() + group_index_str <- paste(group_vars, collapse = "+") + } else { + groups <- data.frame(dummy = 1) + group_index_str <- "" + } + y0_total <- sum(d0[[target_var]], na.rm = TRUE) + y_final_total <- sum(d_final[[target_var]], na.rm = TRUE) + total_change <- y_final_total - y0_total + period_years <- suppressWarnings( + as.numeric(as.character(t_final)) - as.numeric(as.character(t0)) + ) + if (is.na(period_years)) { + period_years <- 0 + } + period_contribs_add <- rep(0, length(factors)) + period_contribs_mult_log <- rep(0, length(factors)) + l_total <- .log_mean(y_final_total, y0_total) + for (g in seq_len(nrow(groups))) { + if (!is.null(group_vars)) { + group_vals <- as.list(groups[g, , drop = FALSE]) + } else { + group_vals <- list() + } + d0g <- .filter_by_group(d0, group_index_str, group_vals) + d_final_g <- .filter_by_group(d_final, group_index_str, group_vals) + y0 <- sum(d0g[[target_var]], na.rm = TRUE) + y_final <- sum(d_final_g[[target_var]], na.rm = TRUE) + if (y0 == 0 || y_final == 0) { + next + } + f0 <- sapply( + factors, + function(f) as.numeric(.eval_factor(d0, f, group_vals)) + ) + f_final <- sapply( + factors, + function(f) as.numeric(.eval_factor(d_final, f, group_vals)) + ) + if (any(is.na(f0)) || any(is.na(f_final))) { + next + } + l_val <- .log_mean(y_final, y0) + for (j in seq_along(factors)) { + if (f0[j] > 0 && f_final[j] > 0) { + period_contribs_add[j] <- period_contribs_add[j] + + l_val * log(f_final[j] / f0[j]) + } + } + if (l_total != 0) { + weight <- l_val / l_total + for (j in seq_along(factors)) { + if (f0[j] > 0 && f_final[j] > 0) { + period_contribs_mult_log[j] <- period_contribs_mult_log[j] + + weight * log(f_final[j] / f0[j]) + } + } + } + } + period_contribs_mult <- if (l_total == 0) { + rep(1, length(factors)) + } else { + exp(period_contribs_mult_log) + } + period_contribs_mult_log_vals <- ifelse( + period_contribs_mult > 0, + log(period_contribs_mult), + NA_real_ + ) + additive_gap <- total_change - sum(period_contribs_add) + target_ratio <- ifelse( + y0_total > 0, + y_final_total / y0_total, + NA_real_ + ) + mult_product <- prod(period_contribs_mult, na.rm = TRUE) + multiplicative_gap <- ifelse( + is.na(target_ratio) || mult_product == 0, + NA_real_, + target_ratio / mult_product + ) + period_id <- paste(t0, t_final, sep = "-") + factor_rows <- data.frame( + period = rep(period_id, length(factors)), + period_years = period_years, + factor_label = factor_labels, + factor_formula = factors, + component_type = "factor", + identity = identity, + identity_var = target_var, + target_initial = NA_real_, + target_final = NA_real_, + total_change = NA_real_, + percentage_change = NA_real_, + additive = period_contribs_add, + multiplicative = period_contribs_mult, + multiplicative_log = period_contribs_mult_log_vals, + closure_gap_additive = NA_real_, + closure_gap_ratio = NA_real_, + stringsAsFactors = FALSE + ) + target_row <- data.frame( + period = period_id, + period_years = period_years, + factor_label = target_label_final, + factor_formula = target_var, + component_type = "target", + identity = identity, + identity_var = target_var, + target_initial = y0_total, + target_final = y_final_total, + total_change = total_change, + percentage_change = if (y0_total != 0) { + (total_change / y0_total) * 100 + } else { + NA_real_ + }, + additive = total_change, + multiplicative = target_ratio, + multiplicative_log = ifelse( + !is.na(target_ratio) && target_ratio > 0, + log(target_ratio), + NA_real_ + ), + closure_gap_additive = additive_gap, + closure_gap_ratio = multiplicative_gap, + stringsAsFactors = FALSE + ) + if (length(analysis_cols) > 0) { + for (nm in analysis_cols) { + factor_rows[[nm]] <- analysis_values[[nm]] + target_row[[nm]] <- analysis_values[[nm]] + } + } + result <- dplyr::bind_rows(factor_rows, target_row) + period_target_df <- data.frame( + period = period_id, + t0 = t0, + tT = t_final, + target_initial = y0_total, + target_final = y_final_total, + total_change = total_change, + stringsAsFactors = FALSE, + check.names = FALSE + ) + if (length(analysis_cols) > 0) { + for (nm in analysis_cols) { + period_target_df[[nm]] <- analysis_values[[nm]] + } + } + list(result = result, target = period_target_df) +} + +.lmdi_closure_check <- function( + analysis_result, + analysis_cols, + tolerance_add, + tolerance_mult +) { + closure_group_cols <- c("period", analysis_cols) + closure_summary <- analysis_result |> + dplyr::group_by(dplyr::across(dplyr::all_of(closure_group_cols))) |> + dplyr::summarise( + target_add = dplyr::first( + .data$additive[.data$component_type == "target"] + ), + sum_factors_add = sum( + .data$additive[.data$component_type == "factor"], + na.rm = TRUE + ), + target_mult = dplyr::first( + .data$multiplicative[.data$component_type == "target"] + ), + prod_factors_mult = { + vals <- .data$multiplicative[.data$component_type == "factor"] + if (length(vals) == 0) 1 else prod(vals, na.rm = TRUE) + }, + .groups = "drop" + ) + for (row_idx in seq_len(nrow(closure_summary))) { + add_diff <- closure_summary$target_add[row_idx] - + closure_summary$sum_factors_add[row_idx] + if (!is.na(add_diff) && abs(add_diff) > tolerance_add) { + context <- closure_summary[row_idx, closure_group_cols, drop = TRUE] + warning(sprintf( + "[%s] Additive contributions differ from target change by %.4g", + paste(context, collapse = ", "), + add_diff + )) + } + target_mult_val <- closure_summary$target_mult[row_idx] + prod_mult_val <- closure_summary$prod_factors_mult[row_idx] + if ( + !is.na(target_mult_val) && !is.na(prod_mult_val) && prod_mult_val != 0 + ) { + mult_diff <- abs(target_mult_val / prod_mult_val - 1) + if (mult_diff > tolerance_mult) { + context <- closure_summary[row_idx, closure_group_cols, drop = TRUE] + warning(sprintf( + "[%s] Multiplicative contributions differ from target ratio by %.4g", + paste(context, collapse = ", "), + mult_diff + )) + } + } + } +} + + +.log_mean <- function(a, b) { + ifelse(a == b, a, ifelse(a > 0 & b > 0, (a - b) / log(a / b), 0)) +} + +.parse_identity <- function(identity_expr) { + identity_expr <- gsub(" ", "", identity_expr) + parts <- strsplit(identity_expr, ":")[[1]] + if (length(parts) != 2) { + cli::cli_abort( + "identity must follow the pattern 'target:factor1*factor2*...'" + ) + } + target <- parts[1] + rhs <- parts[2] + factors <- strsplit(rhs, "\\*")[[1]] + factors <- gsub("^\\((.*)\\)$", "\1", factors) + list(target = target, factors = factors) +} + +.extract_selectors <- function(identity_expr) { + matches <- stringr::str_extract_all( + identity_expr, + "\\[([a-zA-Z0-9_+]+)\\]" + )[[1]] + if (length(matches) == 0) { + return(character(0)) + } + selectors <- gsub("\\[|\\]", "", matches) + unique(unlist(strsplit(selectors, "\\+"))) +} + +.filter_by_group <- function(df, index_str, group_vals) { + if (is.null(index_str) || index_str == "" || is.na(index_str)) { + return(df) + } + indices <- strsplit(index_str, "\\+")[[1]] + out <- df + for (idx in indices) { + val <- group_vals[[idx]] + if (!is.null(val) && !is.na(val)) { + out <- out |> dplyr::filter(.data[[idx]] == val) + } + } + out +} + +.eval_factor <- function(df_period, factor, group_vals) { + pat1 <- "^([a-zA-Z0-9_]+)\\[([a-zA-Z0-9_+]+)\\]$" + pat2 <- "^([a-zA-Z0-9_]+)\\/([a-zA-Z0-9_]+)$" + pat3 <- "^([a-zA-Z0-9_]+)\\[([a-zA-Z0-9_+]+)\\]\\/([a-zA-Z0-9_]+)$" + pat4 <- "^([a-zA-Z0-9_]+)\\/([a-zA-Z0-9_]+)\\[([a-zA-Z0-9_+]+)\\]$" + pat5 <- paste0( + "^([a-zA-Z0-9_]+)\\[([a-zA-Z0-9_+]+)\\]\\/", + "([a-zA-Z0-9_]+)\\[([a-zA-Z0-9_+]+)\\]$" + ) + + if (stringr::str_detect(factor, pat5)) { + m <- stringr::str_match(factor, pat5) + var1 <- m[2] + group1 <- m[3] + var2 <- m[4] + group2 <- m[5] + num_df <- .filter_by_group(df_period, group1, group_vals) + den_df <- .filter_by_group(df_period, group2, group_vals) + num <- sum(num_df[[var1]], na.rm = TRUE) + den <- sum(den_df[[var2]], na.rm = TRUE) + ifelse(den == 0, NA_real_, num / den) + } else if (stringr::str_detect(factor, pat3)) { + m <- stringr::str_match(factor, pat3) + var1 <- m[2] + group1 <- m[3] + var2 <- m[4] + num_df <- .filter_by_group(df_period, group1, group_vals) + num <- sum(num_df[[var1]], na.rm = TRUE) + den <- sum(df_period[[var2]], na.rm = TRUE) + ifelse(den == 0, NA_real_, num / den) + } else if (stringr::str_detect(factor, pat4)) { + m <- stringr::str_match(factor, pat4) + var1 <- m[2] + var2 <- m[3] + group2 <- m[4] + num <- sum(df_period[[var1]], na.rm = TRUE) + den_df <- .filter_by_group(df_period, group2, group_vals) + den <- sum(den_df[[var2]], na.rm = TRUE) + ifelse(den == 0, NA_real_, num / den) + } else if (stringr::str_detect(factor, pat2)) { + m <- stringr::str_match(factor, pat2) + var1 <- m[2] + var2 <- m[3] + num <- sum(df_period[[var1]], na.rm = TRUE) + den <- sum(df_period[[var2]], na.rm = TRUE) + ifelse(den == 0, NA_real_, num / den) + } else if (stringr::str_detect(factor, pat1)) { + m <- stringr::str_match(factor, pat1) + var1 <- m[2] + group1 <- m[3] + sub_df <- .filter_by_group(df_period, group1, group_vals) + sum(sub_df[[var1]], na.rm = TRUE) + } else { + sum(df_period[[factor]], na.rm = TRUE) + } +} diff --git a/R/gapfilling.R b/R/gapfilling.R index c59bca0..7eacea6 100644 --- a/R/gapfilling.R +++ b/R/gapfilling.R @@ -1,6 +1,26 @@ # Simple functions to fill gaps (NA values) in a time-dependent variable, # creating complete time series. +# Global variables to avoid R CMD check NOTE +utils::globalVariables(c( + ".", + ".N", + ".fill_scope", + ".smooth_var", + "ma_base", + "lag_source", + "lag_year", + "lag_weight", + "growth_weight", + "individual_growth", + "ind_growth", + "lag_src", + "lag_yr", + "w", + "g_val", + "o_val" +)) + #' Fill gaps by linear interpolation, or carrying forward or backward. #' #' @description @@ -9,21 +29,25 @@ #' the last or initial values, respectively. It also creates a new variable #' indicating the source of the filled values. #' -#' @param df A tibble data frame containing one observation per row. -#' @param var The variable of df containing gaps to be filled. -#' @param time_index The time index variable (usually year). +#' @param data A data frame containing one observation per row. +#' @param value_col The column containing gaps to be filled. +#' @param time_col Character. Name of the time column. Default: "year". #' @param interpolate Logical. If `TRUE` (default), #' performs linear interpolation. #' @param fill_forward Logical. If `TRUE` (default), #' carries last value forward. #' @param fill_backward Logical. If `TRUE` (default), #' carries first value backward. +#' @param value_smooth_window An integer specifying the window size for a +#' centered moving average applied to the variable before gap-filling. Useful +#' for variables with high inter-annual variability. If `NULL` (default), no +#' smoothing is applied. #' @param .by A character vector with the grouping variables (optional). #' -#' @return A tibble data frame (ungrouped) where gaps in var have been filled, -#' and a new "source" variable has been created indicating if the value is -#' original or, in case it has been estimated, the gapfilling method that has -#' been used. +#' @return A tibble data frame (ungrouped) where gaps in value_col have been +#' filled, and a new "source" variable has been created indicating if the +#' value is original or, in case it has been estimated, the gapfilling method +#' that has been used. #' #' @export #' @@ -36,111 +60,85 @@ #' ), #' value = c(NA, 3, NA, NA, 0, NA, 1, NA, NA, NA, 5, NA), #' ) -#' linear_fill(sample_tibble, value, year, .by = c("category")) -#' linear_fill( +#' fill_linear(sample_tibble, value, .by = c("category")) +#' fill_linear( #' sample_tibble, #' value, -#' year, #' interpolate = FALSE, #' .by = c("category"), #' ) -linear_fill <- function( - df, - var, - time_index, +fill_linear <- function( + data, + value_col, + time_col = "year", interpolate = TRUE, + fill_forward = TRUE, fill_backward = TRUE, + value_smooth_window = NULL, .by = NULL ) { - df |> + # Apply smoothing before main mutate to avoid tidy eval issues with if + use_smoothing <- !is.null(value_smooth_window) + + # Create smoothed variable in separate step + if (use_smoothing) { + data <- data |> + dplyr::mutate( + .smooth_var = zoo::rollmean( + {{ value_col }}, + k = value_smooth_window, + fill = NA, + align = "center" + ), + .by = dplyr::all_of(.by) + ) + } else { + data <- data |> + dplyr::mutate( + .smooth_var = {{ value_col }}, + .by = dplyr::all_of(.by) + ) + } + + data |> dplyr::mutate( - # relative to first/last non-NA + # relative to first/last non-NA (use smoothed values for position) place = dplyr::case_when( - !cummax(!is.na({{ var }})) ~ "left", - rev(!cummax(rev(!is.na({{ var }})))) ~ "right", + !cummax(!is.na(.smooth_var)) ~ "left", + rev(!cummax(rev(!is.na(.smooth_var)))) ~ "right", .default = "middle" ), fill_value = dplyr::case_when( place == "left" & fill_backward ~ - zoo::na.locf0({{ var }}, fromLast = TRUE), + zoo::na.locf0(.smooth_var, fromLast = TRUE), place == "right" & fill_forward ~ - zoo::na.locf0({{ var }}, fromLast = FALSE), + zoo::na.locf0(.smooth_var, fromLast = FALSE), place == "middle" & interpolate ~ zoo::na.approx( - {{ var }}, - x = {{ time_index }}, + .smooth_var, + x = .data[[time_col]], na.rm = FALSE ), .default = NA_real_ ), - fill_value = dplyr::coalesce({{ var }}, fill_value), - "source_{{var}}" := dplyr::case_when( - !is.na({{ var }}) ~ "Original", + # Use original values where available, fill otherwise + fill_value = dplyr::coalesce({{ value_col }}, fill_value), + "source_{{value_col}}" := dplyr::case_when( + !is.na({{ value_col }}) ~ "Original", place == "left" & !is.na(fill_value) ~ "First value carried backwards", place == "right" & !is.na(fill_value) ~ "Last value carried forward", place == "middle" & !is.na(fill_value) ~ "Linear interpolation", TRUE ~ "Gap not filled" ), - "{{var}}" := fill_value, + "{{value_col}}" := fill_value, place = NULL, fill_value = NULL, + .smooth_var = NULL, .by = dplyr::all_of(.by) ) } -#' Fill gaps using a proxy variable -#' -#' @description -#' Fills gaps in a variable based on changes in a proxy variable, using ratios -#' between the filled variable and the proxy variable, and labels output -#' accordingly. -#' -#' @param df A tibble data frame containing one observation per row. -#' @param var The variable of df containing gaps to be filled. -#' @param proxy_var The variable to be used as proxy. -#' @param time_index The time index variable (usually year). -#' @param ... Optionally, additional arguments that will be passed to -#' `linear_fill()` with the ratios. See that function to know the accepted -#' arguments. -#' -#' @return A tibble dataframe (ungrouped) where gaps in var have been filled, -#' a new proxy_ratio variable has been created, -#' and a new "source" variable has been created indicating if the value is -#' original or, in case it has been estimated, the gapfilling method that has -#' been used. -#' -#' @export -#' -#' @examples -#' sample_tibble <- tibble::tibble( -#' category = c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b"), -#' year = c( -#' "2015", "2016", "2017", "2018", "2019", "2020", -#' "2015", "2016", "2017", "2018", "2019", "2020" -#' ), -#' value = c(NA, 3, NA, NA, 0, NA, 1, NA, NA, NA, 5, NA), -#' proxy_variable = c(1, 2, 2, 2, 2, 2, 1, 2, 3, 4, 5, 6) -#' ) -#' proxy_fill(sample_tibble, value, proxy_variable, year, .by = c("category")) -proxy_fill <- function(df, var, proxy_var, time_index, ...) { - df |> - dplyr::mutate(proxy_ratio = {{ var }} / {{ proxy_var }}) |> - linear_fill(proxy_ratio, {{ time_index }}, ...) |> - dplyr::mutate( - "source_{{var}}" := dplyr::case_when( - !is.na({{ var }}) ~ "Original", - source_proxy_ratio == "Linear interpolation" ~ "Proxy interpolated", - source_proxy_ratio == "Last value carried forward" ~ - "Proxy carried forward", - source_proxy_ratio == "First value carried backwards" ~ - "Proxy carried backwards", - .default = NA_character_ - ), - "{{var}}" := dplyr::coalesce({{ var }}, proxy_ratio * {{ proxy_var }}) - ) -} - #' Fill gaps summing the previous value of a variable to the value of #' another variable. #' @@ -151,18 +149,19 @@ proxy_fill <- function(df, var, proxy_var, time_index, ...) { #' series, it can either remain unfilled or assume an invisible 0 value before #' the first observation and start filling with cumulative sum. #' -#' @param df A tibble data frame containing one observation per row. -#' @param var The variable of df containing gaps to be filled. -#' @param change_var The variable whose values will be used to fill the gaps. +#' @param data A data frame containing one observation per row. +#' @param value_col The column containing gaps to be filled. +#' @param change_col The column whose values will be used to fill the gaps. +#' @param time_col Character. Name of the time column. Default: "year". #' @param start_with_zero Logical. If TRUE, assumes an invisible 0 value before #' the first observation and fills with cumulative sum starting from the first -#' change_var value. If FALSE (default), starting NA values remain unfilled. +#' change_col value. If FALSE (default), starting NA values remain unfilled. #' @param .by A character vector with the grouping variables (optional). #' -#' @return A tibble dataframe (ungrouped) where gaps in var have been filled, -#' and a new "source" variable has been created indicating if the value is -#' original or, in case it has been estimated, the gapfilling method that has -#' been used. +#' @return A tibble dataframe (ungrouped) where gaps in value_col have been +#' filled, and a new "source" variable has been created indicating if the +#' value is original or, in case it has been estimated, the gapfilling method +#' that has been used. #' #' @export #' @@ -178,39 +177,1098 @@ proxy_fill <- function(df, var, proxy_var, time_index, ...) { #' value = c(NA, 3, NA, NA, 0, NA, 1, NA, NA, NA, 5, NA), #' change_variable = c(1, 2, 3, 4, 1, 1, 0, 0, 0, 0, 0, 1) #' ) -#' sum_fill( +#' fill_sum( #' sample_tibble, #' value, #' change_variable, #' start_with_zero = FALSE, #' .by = c("category") #' ) -#' sum_fill( +#' fill_sum( #' sample_tibble, #' value, #' change_variable, #' start_with_zero = TRUE, #' .by = c("category") #' ) -sum_fill <- function( - df, - var, - change_var, +fill_sum <- function( + data, + value_col, + change_col, + time_col = "year", start_with_zero = TRUE, .by = NULL ) { - df |> + data |> + dplyr::arrange(dplyr::across(dplyr::all_of(c(.by, time_col)))) |> dplyr::mutate( - groups = cumsum(!is.na({{ var }})), - prefilled = dplyr::coalesce({{ var }}, {{ change_var }}), - source_value = ifelse(is.na({{ var }}), "Filled with sum", "Original"), - "{{ var }}" := ave(prefilled, groups, FUN = cumsum), - "{{ var }}" := if (start_with_zero) {{ var }} else { - ifelse(groups == 0, NA, {{ var }}) + groups = cumsum(!is.na({{ value_col }})), + prefilled = dplyr::coalesce({{ value_col }}, {{ change_col }}), + source_value = ifelse( + is.na({{ value_col }}), + "Filled with sum", + "Original" + ), + "{{ value_col }}" := ave(prefilled, groups, FUN = cumsum), + "{{ value_col }}" := if (start_with_zero) {{ value_col }} else { + ifelse(groups == 0, NA, {{ value_col }}) }, - source_value = ifelse(is.na({{ var }}), NA_character_, source_value), + source_value = ifelse( + is.na({{ value_col }}), + NA_character_, + source_value + ), groups = NULL, prefilled = NULL, .by = dplyr::all_of(.by) ) } + +#' Fill gaps using growth rates from proxy variables +#' +#' @description +#' Fills missing values using growth rates from a proxy variable (reference +#' series). Supports regional aggregations, weighting, and linear +#' interpolation for small gaps. +#' +#' @param data A data frame containing time series data. +#' @param value_col Character. Name of the column with values to fill. +#' @param proxy_col Character or vector. Proxy variable(s) for calculating +#' growth rates. Supports multiple syntax formats: +#' - **Simple numeric proxy** (e.g., `"population"`): Auto-detects numeric +#' columns and uses them as proxy variable. Inherits the `group_by` +#' parameter to compute proxy values per group. +#' - **Simple categorical proxy** (e.g., `"region"`): Auto-detects +#' categorical columns and interprets as `value_col:region`. Aggregates +#' `value_col` by the specified groups. +#' - **Advanced syntax** (e.g., `"gdp:region"`): Format is +#' `"variable:group1+group2"`. Aggregates variable by specified groups. +#' - **Hierarchical fallback** (e.g., `c("population", "gdp:region")`): +#' Tries first proxy, falls back to second if first fails. +#' - **Weighted aggregation** (e.g., `"gdp[population]"`): Weight variable +#' by specified column during aggregation. +#' @param time_col Character. Name of the time column. Default: "year". +#' @param group_by Character vector of grouping column names. Default: NULL. +#' @param max_gap Numeric. Maximum gap size to fill using growth method. +#' Default: Inf. +#' @param max_gap_linear Numeric. Maximum gap size for linear interpolation +#' fallback. Default: 3. +#' @param fill_scope Quosure. Filter expression to limit filling scope. +#' Default: NULL. +#' @param smooth_window Integer. Window size for moving average smoothing of +#' proxy reference values before computing growth rates. Default: NULL. +#' @param output_format Character. Output format: "clean" or "detailed". +#' Default: "clean". +#' @param verbose Logical. Print progress messages. Default: TRUE. +#' +#' @return +#' A data frame with filled values. If output_format = "clean", returns +#' original columns with updated value_col and added method_col. If +#' "detailed", includes all intermediate columns. +#' +#' @details +#' **Combined Growth Sequence (Hierarchical Interpolation):** +#' +#' When using multiple proxies with hierarchical fallback, the function +#' implements an intelligent combined growth sequence strategy: +#' +#' 1. Better proxies (earlier in hierarchy) are tried first for each gap. +#' 2. If a better proxy has partial coverage within a gap, those growth +#' rates are used for the covered positions. +#' 3. Fallback proxies fill only the remaining positions where better +#' proxies are not available. +#' 4. Values filled by better proxies are protected from being overwritten. +#' +#' @export +#' +#' @examples +#' # Fill GDP using population as proxy +#' data <- tibble::tibble( +#' country = rep("ESP", 4), +#' year = 2010:2013, +#' gdp = c(1000, NA, NA, 1200), +#' population = c(46, 46.5, 47, 47.5) +#' ) +#' +#' fill_growth( +#' data, +#' value_col = "gdp", +#' proxy_col = "population", +#' group_by = "country" +#' ) +#' +#' @importFrom data.table := +#' @seealso [fill_linear()], [fill_sum()] +fill_growth <- function( + data, + value_col, + proxy_col, + time_col = "year", + group_by = NULL, + max_gap = Inf, + max_gap_linear = 3, + fill_scope = NULL, + smooth_window = 1, + output_format = "clean", + verbose = TRUE +) { + # 1. Setup and Validation + setup <- .fg_setup( + data, + value_col, + time_col, + group_by, + fill_scope, + smooth_window, + max_gap_linear + ) + + # 2. Calculate Growth Rates + data_work <- .fg_compute_all_growth_rates( + setup$data_work, + proxy_col, + value_col, + group_by, + setup$inputs$smooth_window, + verbose + ) + + # 3. Apply Filling (Hierarchical) + data_work <- .fg_apply_hierarchical_filling( + data_work, + proxy_col, + value_col, + setup$cols, + max_gap, + setup$inputs$max_gap_linear, + time_col, + group_by, + verbose + ) + + # 4. Finalize + .fg_finalize_and_report( + data_work, + data, + value_col, + setup$cols, + setup$scope_mask, + output_format, + verbose + ) +} + +# --- 1. Setup & Validation Helpers --- + +.fg_setup <- function( + data, + value_col, + time_col, + group_by, + fill_scope, + smooth_window, + max_gap_linear +) { + inputs <- .fg_validate_inputs( + data, + value_col, + time_col, + group_by, + smooth_window, + max_gap_linear + ) + + scope_mask <- .fg_calculate_scope_mask(inputs$data, fill_scope) + prep <- .fg_prepare_working_columns(inputs$data, value_col) + + list( + inputs = inputs, + scope_mask = scope_mask, + data_work = prep$data_work, + cols = list( + method = prep$method_col, + raw_missing = prep$raw_missing_col, + raw_numeric = prep$raw_numeric_col, + raw = prep$raw_col, + orig_val = prep$original_value_numeric, + orig_method = prep$original_method + ) + ) +} + +.fg_validate_inputs <- function( + data, + value_col, + time_col, + group_by, + smooth_window, + max_gap_linear +) { + if (!rlang::has_name(data, time_col)) { + cli::cli_abort("Time column '{time_col}' not found in data") + } + if (!rlang::has_name(data, value_col)) { + cli::cli_abort("Value column '{value_col}' not found in data") + } + if (!rlang::is_scalar_integerish(smooth_window) || smooth_window < 1) { + cli::cli_abort("`smooth_window` must be a positive integer") + } + smooth_window <- as.integer(smooth_window) + + if (!is.null(group_by)) { + missing <- setdiff(group_by, names(data)) + if (length(missing) > 0) { + cli::cli_abort( + "Group columns not found: {paste(missing, collapse = ', ')}" + ) + } + } + max_gap_linear <- max(0L, as.integer(max_gap_linear)) + list( + data = data, + smooth_window = smooth_window, + max_gap_linear = max_gap_linear + ) +} + +.fg_calculate_scope_mask <- function(data, fill_scope) { + if (is.null(fill_scope)) { + return(rep(TRUE, nrow(data))) + } + fill_scope_expr <- rlang::enquo(fill_scope) + scope_mask <- data |> + dplyr::mutate(scope_mask = !!fill_scope_expr) |> + dplyr::pull() + if (!is.logical(scope_mask) || length(scope_mask) != nrow(data)) { + cli::cli_abort( + "`fill_scope` must evaluate to a logical vector with length equal to nrow(data)" + ) + } + scope_mask[is.na(scope_mask)] <- FALSE + scope_mask +} + +.fg_prepare_working_columns <- function(data, value_col) { + function_tag <- "proxy" + # Original raw capture + raw_original_col <- paste0(value_col, "_raw_original") + if (!raw_original_col %in% names(data)) { + data[[raw_original_col]] <- suppressWarnings(as.numeric(data[[value_col]])) + } + # Snapshot columns logic + snapshot_base <- paste0(value_col, "_raw_", function_tag) + snapshot_col <- snapshot_base + snapshot_index <- 1L + while (snapshot_col %in% names(data)) { + snapshot_index <- snapshot_index + 1L + snapshot_col <- paste0(snapshot_base, "_", snapshot_index) + } + data[[snapshot_col]] <- suppressWarnings(as.numeric(data[[value_col]])) + + # Define working column names + raw_col <- snapshot_col + raw_numeric_col <- paste0(raw_col, "_numeric") + raw_missing_col <- paste0(raw_col, "_missing") + method_col <- paste0("method_", value_col) + has_method_col <- method_col %in% names(data) + + original_value_numeric <- suppressWarnings(as.numeric(data[[value_col]])) + original_value_numeric[!is.finite(original_value_numeric)] <- NA_real_ + original_method <- if (has_method_col) { + as.character(data[[method_col]]) + } else { + ifelse(is.na(original_value_numeric), "missing", "original") + } + + # Add working columns + data_work <- data |> + dplyr::mutate( + !!raw_col := .data[[value_col]], + !!raw_numeric_col := suppressWarnings(as.numeric(.data[[value_col]])) + ) |> + dplyr::mutate( + !!raw_numeric_col := replace( + .data[[raw_numeric_col]], + !is.finite(.data[[raw_numeric_col]]), + NA_real_ + ), + !!raw_missing_col := is.na(.data[[raw_numeric_col]]), + !!method_col := if (has_method_col) { + .data[[method_col]] + } else { + ifelse(.data[[raw_missing_col]], "missing", "original") + }, + !!value_col := .data[[raw_numeric_col]] + ) + data_work[[method_col]][is.na(data_work[[value_col]])] <- "missing" + + list( + data_work = data_work, + method_col = method_col, + raw_missing_col = raw_missing_col, + raw_numeric_col = raw_numeric_col, + raw_col = raw_col, + original_value_numeric = original_value_numeric, + original_method = original_method + ) +} + +.fg_finalize_and_report <- function( + data_work, + original_data, + value_col, + cols, + scope_mask, + output_format, + verbose +) { + result <- .fg_finalize_output( + data_work, + value_col, + cols, + scope_mask, + output_format + ) + + if (verbose) { + n_orig <- sum(is.na(original_data[[value_col]])) + n_final <- sum(is.na(result[[value_col]])) + message( + "Total filled: ", + n_orig - n_final, + " out of ", + n_orig, + " missing values" + ) + } + result +} + +# --- 2. Growth Calculation Logic --- + +.fg_compute_all_growth_rates <- function( + data_work, + proxy_col, + value_col, + group_by, + smooth_window, + verbose +) { + for (i in seq_along(proxy_col)) { + spec <- .parse_proxy_spec( + proxy_col[i], + data_work, + value_col, + group_by, + verbose + ) + # Calculate for this specific proxy + data_work <- .fg_calc_single_spec( + data_work, + spec, + i, + smooth_window, + group_by, + verbose + ) + } + data_work +} + +.fg_calc_single_spec <- function( + data, + spec, + idx, + smooth_window, + group_by, + verbose +) { + growth_col <- paste0("growth_", idx, "_", spec$spec_name) + obs_col <- paste0("n_obs_", idx, "_", spec$spec_name) + + if (verbose) { + message("Calculating growth rates for: ", spec$spec_name) + } + + # 1. Prepare Data.Table with Lags + dt <- .fg_growth_prep_dt(data, spec, group_by) + if (nrow(dt) == 0) { + return(.fg_add_empty_cols(data, growth_col, obs_col)) + } + + # 2. Compute Individual Row Growth (with Smoothing) + dt <- .fg_growth_calc_individual(dt, spec, smooth_window) + + # 3. Aggregate to Groups + summary_dt <- .fg_growth_aggregate(dt, spec, growth_col, obs_col) + + # 4. Join back + join_keys <- c("year", spec$present_group_vars) + if (length(spec$present_group_vars) == 0) { + join_keys <- "year" + } + + dplyr::left_join(data, tibble::as_tibble(summary_dt), by = join_keys) +} + +.fg_growth_prep_dt <- function(data, spec, group_by) { + if (!spec$source_var %in% names(data)) { + return(data.table::data.table()) + } + + lag_vars <- unique(c(group_by, spec$present_group_vars)) + lag_vars <- lag_vars[lag_vars %in% names(data)] + + cols <- unique(c("year", lag_vars, spec$source_var, spec$weight_col)) + dt <- data.table::as.data.table(data[, cols, drop = FALSE]) + + data.table::setorderv(dt, unique(c(lag_vars, "year"))) + dt +} + +.fg_growth_calc_individual <- function(dt, spec, window) { + # Helper to manage lag vars + by_vars <- if (length(spec$present_group_vars) > 0) { + spec$present_group_vars + } else { + NULL + } + + # Apply Smoothing (Moving Average) + if (window > 1) { + dt <- .compute_ma_base(dt, spec$source_var, window, by_vars) + val_col <- "ma_base" + } else { + val_col <- spec$source_var + } + + # Create Lags + if (!is.null(by_vars)) { + dt[, + `:=`( + lag_src = data.table::shift(get(val_col)), + lag_yr = data.table::shift(year) + ), + by = by_vars + ] + } else { + dt[, `:=`( + lag_src = data.table::shift(get(val_col)), + lag_yr = data.table::shift(year) + )] + } + + # Calculate Growth + dt[, + ind_growth := data.table::fifelse( + year == lag_yr + 1 & + !is.na(get(val_col)) & + !is.na(lag_src) & + lag_src > 0, + (get(val_col) - lag_src) / lag_src, + NA_real_ + ) + ] + + dt[!is.na(ind_growth)] +} + +.fg_growth_aggregate <- function(dt, spec, growth_col, obs_col) { + by_vars <- c("year", spec$present_group_vars) + if (length(spec$present_group_vars) == 0) { + by_vars <- "year" + } + + # Setup weights + has_w <- !is.null(spec$weight_col) + if (has_w) { + # Shift weights to align with growth period (previous year weight) + grp <- if (length(spec$present_group_vars) > 0) { + spec$present_group_vars + } else { + NULL + } + if (!is.null(grp)) { + dt[, w := data.table::shift(get(spec$weight_col)), by = grp] + } else { + dt[, w := data.table::shift(get(spec$weight_col))] + } + } + + res <- dt[, + { + val <- mean(ind_growth) # Default unweighted + cnt <- .N + + if (has_w) { + valid_w <- !is.na(w) & is.finite(w) & w > 0 + if (any(valid_w)) { + val <- sum(ind_growth[valid_w] * w[valid_w]) / sum(w[valid_w]) + cnt <- sum(valid_w) + } + } + .(g_val = val, o_val = cnt) + }, + by = by_vars + ] + + data.table::setnames(res, c("g_val", "o_val"), c(growth_col, obs_col)) + res +} + +.fg_add_empty_cols <- function(data, g_col, o_col) { + data[[g_col]] <- NA_real_ + data[[o_col]] <- 0L + data +} + +# --- 3. Filling Logic --- + +.fg_apply_hierarchical_filling <- function( + data, + proxy_cols, + value_col, + cols, + max_gap, + max_gap_lin, + time_col, + group_by, + verbose +) { + if (verbose) { + message("Step 2: Applying hierarchical filling...") + } + + # Pre-calculate spec names for bridge fallback + specs <- lapply(proxy_cols, function(p) { + .parse_proxy_spec(p, data, value_col, group_by, FALSE)$spec_name + }) + + for (i in seq_along(proxy_cols)) { + if (sum(is.na(data[[value_col]])) == 0) { + break + } + + spec_name <- specs[[i]] + g_col <- paste0("growth_", i, "_", spec_name) + m_name <- paste0("growth_", spec_name) + + if (verbose) { + message(" Applying proxy level ", i, ": ", proxy_cols[i]) + } + + # Define filling function wrapper + fill_fun <- function(df) { + .fg_fill_sequence( + df, + g_col, + m_name, + value_col, + cols$method, + cols$raw_missing, + max_gap, + max_gap_lin, + time_col, + i, + specs + ) + } + + if (!is.null(group_by)) { + data <- data |> + dplyr::group_by(dplyr::across(dplyr::all_of(group_by))) |> + dplyr::group_modify(~ fill_fun(.x)) |> + dplyr::ungroup() + } else { + data <- fill_fun(data) + } + } + data +} + +.fg_fill_sequence <- function( + df, + growth_col, + method_name, + val_col, + met_col, + miss_col, + max_gap, + max_lin, + time_col, + level, + all_specs +) { + # Ensure Order + df <- df[order(df[[time_col]]), , drop = FALSE] + + # 1. Identify Valid NA Runs (gaps that are allowed to be filled) + na_runs <- .fg_identify_na_runs(df[[miss_col]]) + + # 2. Forward Fill + df <- .fg_fill_direction( + df, + "forward", + val_col, + growth_col, + met_col, + method_name, + na_runs, + max_gap, + max_lin + ) + + # 3. Backward Fill + df <- .fg_fill_direction( + df, + "backward", + val_col, + growth_col, + met_col, + paste0(method_name, "_back"), + na_runs, + max_gap, + max_lin + ) + + # 4. Bridge Fill (Linear + Geometric) + df <- .fg_fill_bridge( + df, + val_col, + growth_col, + met_col, + miss_col, + time_col, + na_runs, + max_gap, + max_lin, + level, + all_specs, + method_name + ) + + df +} + +.fg_identify_na_runs <- function(missing_vec) { + if (is.null(missing_vec)) { + return(list()) + } + + # Convert to boolean, treating existing NAs as missing + is_miss <- missing_vec + is_miss[is.na(is_miss)] <- TRUE + + rle_res <- rle(is_miss) + ends <- cumsum(rle_res$lengths) + starts <- c(1, ends[-length(ends)] + 1) + + runs <- list() + for (i in seq_along(rle_res$values)) { + if (rle_res$values[i]) { + # Check if "internal" (surrounded by data, not at ends of series) + is_internal <- (starts[i] > 1) && + (ends[i] < length(missing_vec)) && + (!isTRUE(missing_vec[starts[i] - 1])) && + (!isTRUE(missing_vec[ends[i] + 1])) + + runs[[length(runs) + 1]] <- list( + start = starts[i], + end = ends[i], + len = rle_res$lengths[i], + internal = is_internal + ) + } + } + runs +} + +.fg_fill_direction <- function( + df, + direction, + v_col, + g_col, + m_col, + m_name, + runs, + max_gap, + max_lin +) { + vals <- df[[v_col]] + grw <- df[[g_col]] + mets <- df[[m_col]] + indices <- if (direction == "forward") { + seq_along(vals) + } else { + rev(seq_along(vals)) + } + + for (i in indices) { + if (!is.na(vals[i])) { + next + } + + # Check Constraints + is_valid_gap <- TRUE + for (r in runs) { + if (i >= r$start && i <= r$end) { + if (!is.null(max_gap) && r$len > max_gap) { + is_valid_gap <- FALSE + } + if ( + !is.null(max_lin) && + r$internal && + r$len > max_lin + ) { + is_valid_gap <- FALSE + } + } + } + if (!is_valid_gap) { + next + } + + # Calc Value + if (direction == "forward") { + if (i == 1) { + next + } + prev <- vals[i - 1] + if (!is.na(prev) && prev > 0 && !is.na(grw[i])) { + vals[i] <- prev * (1 + grw[i]) + if (mets[i] == "missing") mets[i] <- m_name + } + } else { + if (i == length(vals)) { + next + } + nxt <- vals[i + 1] + nxt_g <- grw[i + 1] + if (!is.na(nxt) && !is.na(nxt_g) && (1 + nxt_g) != 0) { + res <- nxt / (1 + nxt_g) + if (is.finite(res) && res > 0) { + vals[i] <- res + if (mets[i] == "missing") mets[i] <- m_name + } + } + } + } + df[[v_col]] <- vals + df[[m_col]] <- mets + df +} + +.fg_fill_bridge <- function( + df, + v_col, + g_col, + m_col, + miss_col, + t_col, + runs, + max_gap, + max_lin, + level, + specs, + base_method +) { + # Only process internal gaps (bridges) + bridges <- Filter(function(x) x$internal, runs) + + for (run in bridges) { + if (!is.null(max_gap) && run$len > max_gap) { + next + } + + # 1. Linear Interpolation (Small Gaps) + if (run$len <= max_lin) { + df <- .fg_bridge_linear(df, run, v_col, m_col, t_col) + next + } + + # 2. Geometric Bridge (Large Gaps with Lambda) + df <- .fg_bridge_geometric( + df, + run, + v_col, + g_col, + m_col, + level, + specs, + base_method + ) + } + df +} + +.fg_bridge_linear <- function(df, run, v_col, m_col, t_col) { + # Indices + idx_start <- run$start - 1 + idx_end <- run$end + 1 + idx_gap <- run$start:run$end + + v0 <- df[[v_col]][idx_start] + v1 <- df[[v_col]][idx_end] + + if (is.finite(v0) && is.finite(v1) && v0 > 0 && v1 > 0) { + approx_vals <- stats::approx( + x = c(df[[t_col]][idx_start], df[[t_col]][idx_end]), + y = c(v0, v1), + xout = df[[t_col]][idx_gap] + )$y + + df[[v_col]][idx_gap] <- approx_vals + df[[m_col]][idx_gap] <- "linear_interp" + } + df +} + +.fg_bridge_geometric <- function( + df, + run, + v_col, + g_col, + m_col, + level, + specs, + base_method +) { + idx_start <- run$start - 1 + idx_end <- run$end + 1 + idx_seq <- (idx_start + 1):idx_end # Growth rates needed for these positions + + # Combine growth rates from hierarchy (Step 1-current) + combined <- .fg_combine_growth_hierarchy(df, idx_seq, level, specs) + rates <- combined$rates + sources <- combined$sources + + # Lambda Adjustment + v_start <- df[[v_col]][idx_start] + v_end <- df[[v_col]][idx_end] + + pred_end <- v_start * prod(1 + rates) + + if (is.finite(pred_end) && pred_end > 0 && is.finite(v_end) && v_end > 0) { + lambda <- (v_end / pred_end)^(1 / length(rates)) + + curr_val <- v_start + # Iterate and fill + for (k in seq_along(rates)) { + target_idx <- idx_start + k + curr_val <- curr_val * (1 + rates[k]) * lambda + + # Only overwrite if it's strictly inside the gap (not the end anchor) + if (target_idx < idx_end) { + df[[v_col]][target_idx] <- curr_val + src <- sources[k] + if (is.na(src)) { + src <- base_method + } + df[[m_col]][target_idx] <- paste0("growth_", src, "_bridge") + } + } + } + df +} + +.fg_combine_growth_hierarchy <- function(df, indices, level, specs) { + # Default to current level + rates <- df[[paste0("growth_", level, "_", specs[[level]])]][indices] + sources <- rep(specs[[level]], length(indices)) + + # Overwrite with better (lower index) proxies if available + if (level > 1) { + for (l in 1:(level - 1)) { + col <- paste0("growth_", l, "_", specs[[l]]) + if (col %in% names(df)) { + better_rates <- df[[col]][indices] + # Take if available + mask <- !is.na(better_rates) + rates[mask] <- better_rates[mask] + sources[mask] <- specs[[l]] + } + } + } + list(rates = rates, sources = sources) +} + +.fg_finalize_output <- function(data, value_col, cols, mask, format) { + # Restore non-scope values + if (!all(mask)) { + data[[value_col]][!mask] <- cols$orig_val[!mask] + data[[cols$method]][!mask] <- cols$orig_method[!mask] + } + + if (format == "clean") { + # Remove all temp columns (growth_, n_obs_, raw_) + ptn <- "^(growth_|n_obs_|.*_raw_).*" + data <- data |> dplyr::select(-dplyr::matches(ptn)) + } else { + # Detailed: keep debug cols but clean raw numeric + data <- data |> + dplyr::select(-dplyr::any_of(c(cols$raw_numeric, cols$raw_missing))) + } + data +} + +# --- Shared Helpers (Original) --- + +.compute_ma_base <- function(dt, value_var, window, group_vars = NULL) { + if (window <= 1) { + return(dt) + } + + compute_ma <- function(vals, w) { + ma <- rep(NA_real_, length(vals)) + for (i in seq_along(vals)) { + if (i > 1) { + start_idx <- max(1, i - w) + window_vals <- vals[start_idx:(i - 1)] + window_vals <- window_vals[!is.na(window_vals)] + if (length(window_vals) > 0) { + ma[i] <- mean( + window_vals[ + max(1, length(window_vals) - w + 1):length(window_vals) + ] + ) + } + } + } + ma + } + + if (length(group_vars) > 0) { + dt[, ma_base := compute_ma(get(value_var), window), by = group_vars] + } else { + dt[, ma_base := compute_ma(get(value_var), window)] + } + dt +} + +.parse_proxy_spec <- function(spec, data, value_col, group_by, verbose) { + weight_col <- NULL + col_name <- spec + if (grepl("\\[", spec)) { + start_pos <- regexpr("\\[", spec)[1] + if (start_pos > 0) { + after_bracket <- substr(spec, start_pos + 1, nchar(spec)) + end_pos <- regexpr("\\]", after_bracket)[1] + if (end_pos > 0) { + weight_col <- trimws(substr(after_bracket, 1, end_pos - 1)) + } + } + col_name <- sub("\\[.*$", "", spec) + } + + col_name <- trimws(col_name) + + if (col_name %in% c("global", "all", "total")) { + if (verbose) { + message("Using global aggregation (no grouping)") + } + return(list( + source_var = value_col, + group_vars = NULL, + weight_col = weight_col, + spec_name = paste0("global", if (!is.null(weight_col)) "_w" else "") + )) + } + + if (col_name %in% names(data)) { + col_data <- data[[col_name]] + is_numeric <- is.numeric(col_data) || + all(!is.na(suppressWarnings(as.numeric(as.character(col_data))))) + + if (is_numeric) { + if (verbose) { + message( + "Auto-detected '", + col_name, + "' as numeric -> ", + "using as source variable" + ) + } + return(list( + source_var = col_name, + group_vars = group_by, + weight_col = weight_col, + spec_name = paste0( + col_name, + if (!is.null(weight_col)) "_w" else "" + ) + )) + } else { + if (verbose) { + message( + "Auto-detected '", + col_name, + "' as categorical -> ", + "using as grouping variable" + ) + } + return(list( + source_var = value_col, + group_vars = col_name, + weight_col = weight_col, + spec_name = paste0( + value_col, + "_", + col_name, + if (!is.null(weight_col)) "_w" else "" + ) + )) + } + } else { + warning("Column '", col_name, "' not found in data") + return(list( + source_var = value_col, + group_vars = col_name, + weight_col = weight_col, + spec_name = paste0( + value_col, + "_", + col_name, + if (!is.null(weight_col)) "_w" else "" + ) + )) + } + + parts <- strsplit(spec, ":")[[1]] + source_var <- parts[1] + rhs <- if (length(parts) > 1) parts[2] else "" + + if (is.na(rhs)) { + rhs <- "" + } + + weight_col <- NULL + if (nchar(rhs) > 0 && grepl("\\[", rhs)) { + start_pos <- regexpr("\\[", rhs)[1] + if (start_pos > 0) { + after_bracket <- substr(rhs, start_pos + 1, nchar(rhs)) + end_pos <- regexpr("\\]", after_bracket)[1] + if (end_pos > 0) { + weight_col <- trimws(substr(after_bracket, 1, end_pos - 1)) + } + } + rhs <- sub("\\[.*$", "", rhs) + } + + group_vars <- if (rhs == "") NULL else trimws(strsplit(rhs, "\\+")[[1]]) + + if (is.null(group_vars) || length(group_vars) == 0) { + spec_name <- paste0(source_var, "_global") + } else { + spec_name <- paste0(source_var, "_", paste(group_vars, collapse = "_")) + } + if (!is.null(weight_col)) { + spec_name <- paste0(spec_name, "_w") + } + + list( + source_var = source_var, + group_vars = group_vars, + weight_col = weight_col, + spec_name = spec_name + ) +} diff --git a/R/whep-package.R b/R/whep-package.R index 62800aa..a65cf64 100644 --- a/R/whep-package.R +++ b/R/whep-package.R @@ -2,6 +2,5 @@ "_PACKAGE" ## usethis namespace: start -#' @importFrom rlang := ## usethis namespace: end NULL diff --git a/_pkgdown.yml b/_pkgdown.yml index 3bd672a..23928fa 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -71,6 +71,12 @@ reference: desc: > Functions to fill gaps (NA values) in time-dependent variables using different methods. contents: - - linear_fill - - proxy_fill - - sum_fill + - fill_linear + - fill_sum + - fill_growth + +- title: Decomposition analysis + desc: > + Functions for index decomposition analysis. + contents: + - calculate_lmdi diff --git a/man/calculate_lmdi.Rd b/man/calculate_lmdi.Rd new file mode 100644 index 0000000..603cba2 --- /dev/null +++ b/man/calculate_lmdi.Rd @@ -0,0 +1,169 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/calculate_lmdi.R +\name{calculate_lmdi} +\alias{calculate_lmdi} +\title{Calculate LMDI decomposition} +\usage{ +calculate_lmdi( + data, + identity, + identity_labels = NULL, + time_var = "year", + periods = NULL, + periods_2 = NULL, + analysis_by = NULL, + rolling_mean = NULL, + output_format = "clean", + verbose = TRUE +) +} +\arguments{ +\item{data}{A data frame containing the variables for decomposition. Must +include all variables specified in the identity, time variable, and any +grouping variables.} + +\item{identity}{Character. Decomposition identity in format +\code{"target:factor1*factor2*..."}. The target appears before the colon, +factors after, separated by asterisks. Supports explicit ratios with +\code{/} and structural decomposition with \verb{[]}.} + +\item{identity_labels}{Named character vector. Custom labels for factors +to use in output instead of variable names. Default: NULL uses variable +names as-is.} + +\item{time_var}{Character. Name of the time variable column in the data. +Default: "year". Must be numeric or coercible to numeric.} + +\item{periods}{Numeric vector. Years defining analysis periods. Each +consecutive pair defines one period. Default: NULL uses all available +years.} + +\item{periods_2}{Numeric vector. Additional period specification for +complex multi-period analyses. Default: NULL.} + +\item{analysis_by}{Character vector. Grouping variables for performing +separate decompositions. Default: NULL (single decomposition for all +data).} + +\item{rolling_mean}{Numeric. Window size for rolling mean smoothing +applied before decomposition. Default: NULL (no smoothing).} + +\item{output_format}{Character. Format of output data frame. Options: +\code{"clean"} (default) or \code{"detailed"}.} + +\item{verbose}{Logical. If TRUE (default), prints progress messages during +decomposition.} +} +\value{ +A data frame with LMDI decomposition results containing: +\itemize{ +\item Time variables and grouping variables (if specified). +\item \code{additive}: Additive contributions (sum equals total change in target). +\item \code{multiplicative}: Multiplicative indices (product equals target ratio). +\item \code{multiplicative_log}: Log of multiplicative indices. +\item Period identifiers and metadata. +} +} +\description{ +Performs LMDI (Log Mean Divisia Index) decomposition analysis with +flexible identity parsing, automatic factor detection, and support for +multiple periods and groupings. Supports sectoral decomposition using +bracket notation for both summing and grouping operations. +} +\details{ +The LMDI method decomposes changes in a target variable into contributions +from multiple factors using logarithmic mean weights. This implementation +supports: + +\strong{Flexible identity specification:} +\itemize{ +\item Automatic factor detection from identity string. +\item Support for ratio calculations (implicit division). +\item Sectoral aggregation with \verb{[]} notation. +\item Sectoral grouping with \code{{}} notation. +} + +\strong{Period analysis:} +The function can decompose changes over single or multiple periods. +Periods are defined by consecutive pairs in the \code{periods} vector. + +\strong{Grouping capabilities:} +Use \code{analysis_by} to perform separate decompositions for different +groups (e.g., countries, regions) while maintaining consistent factor +structure. +} +\section{Identity Syntax}{ + +The identity parameter uses a special syntax to define decomposition: + +\strong{Basic format:} \code{"target:factor1*factor2*factor3"} + +\strong{Simple decomposition (no sectors):} +\itemize{ +\item Basic: \code{"emissions:gdp*(emissions/gdp)"} +\item Complete: \code{"emissions:(emissions/gdp)*(gdp/population)*population"} +} + +\strong{Understanding bracket notation:} + +Square brackets \verb{[]} specify variables to sum across categories, enabling +structural decomposition. The bracket aggregates values BEFORE calculating +ratios. + +\strong{Single-level structural decomposition:} +\itemize{ +\item \code{"emissions:activity*(activity[sector]/activity)*(emissions[sector]/activity[sector])"} +\item Creates 3 factors: Activity level, Sectoral structure, Sectoral +intensity. +} + +\strong{Multi-level structural decomposition:} +\itemize{ +\item Two levels: \code{"emissions:activity*(activity[sector]/activity)*(activity[sector+fuel]/activity[sector])*(emissions[sector+fuel]/activity[sector+fuel])"} +\item Creates 4 factors: Activity level, Sector structure, Fuel structure, +Sectoral-fuel intensity. +} +} + +\section{Data Requirements}{ + +The input data frame must contain: +\itemize{ +\item All variables mentioned in the identity. +\item The time variable (default: "year"). +\item Grouping variables if using \code{analysis_by}. +\item No missing values in key variables for decomposition periods. +} +} + +\examples{ +# Simple LMDI decomposition +data <- data.frame( + year = rep(2010:2015, 2), + country = rep(c("ESP", "FRA"), each = 6), + emissions = c(100, 105, 110, 115, 120, 125, 200, 210, 220, 230, 240, 250), + gdp = c( + 1000, 1050, 1100, 1150, 1200, 1250, + 2000, 2100, 2200, 2300, 2400, 2500 + ), + population = c(46, 46.5, 47, 47.5, 48, 48.5, 65, 65.5, 66, 66.5, 67, 67.5) +) + +# Example 1: Simple decomposition by country +lmdi_basic <- calculate_lmdi( + data, + identity = "emissions:gdp*(emissions/gdp)", + time_var = "year", + analysis_by = "country" +) + +# Example 2: Complete form with multiple factors +lmdi_complete <- calculate_lmdi( + data, + identity = "emissions:(emissions/gdp)*(gdp/population)*population", + time_var = "year", + analysis_by = "country" +) + +} +\concept{calculations} diff --git a/man/fill_growth.Rd b/man/fill_growth.Rd new file mode 100644 index 0000000..1de19be --- /dev/null +++ b/man/fill_growth.Rd @@ -0,0 +1,107 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/gapfilling.R +\name{fill_growth} +\alias{fill_growth} +\title{Fill gaps using growth rates from proxy variables} +\usage{ +fill_growth( + data, + value_col, + proxy_col, + time_col = "year", + group_by = NULL, + max_gap = Inf, + max_gap_linear = 3, + fill_scope = NULL, + smooth_window = 1, + output_format = "clean", + verbose = TRUE +) +} +\arguments{ +\item{data}{A data frame containing time series data.} + +\item{value_col}{Character. Name of the column with values to fill.} + +\item{proxy_col}{Character or vector. Proxy variable(s) for calculating +growth rates. Supports multiple syntax formats: +\itemize{ +\item \strong{Simple numeric proxy} (e.g., \code{"population"}): Auto-detects numeric +columns and uses them as proxy variable. Inherits the \code{group_by} +parameter to compute proxy values per group. +\item \strong{Simple categorical proxy} (e.g., \code{"region"}): Auto-detects +categorical columns and interprets as \code{value_col:region}. Aggregates +\code{value_col} by the specified groups. +\item \strong{Advanced syntax} (e.g., \code{"gdp:region"}): Format is +\code{"variable:group1+group2"}. Aggregates variable by specified groups. +\item \strong{Hierarchical fallback} (e.g., \code{c("population", "gdp:region")}): +Tries first proxy, falls back to second if first fails. +\item \strong{Weighted aggregation} (e.g., \code{"gdp[population]"}): Weight variable +by specified column during aggregation. +}} + +\item{time_col}{Character. Name of the time column. Default: "year".} + +\item{group_by}{Character vector of grouping column names. Default: NULL.} + +\item{max_gap}{Numeric. Maximum gap size to fill using growth method. +Default: Inf.} + +\item{max_gap_linear}{Numeric. Maximum gap size for linear interpolation +fallback. Default: 3.} + +\item{fill_scope}{Quosure. Filter expression to limit filling scope. +Default: NULL.} + +\item{smooth_window}{Integer. Window size for moving average smoothing of +proxy reference values before computing growth rates. Default: NULL.} + +\item{output_format}{Character. Output format: "clean" or "detailed". +Default: "clean".} + +\item{verbose}{Logical. Print progress messages. Default: TRUE.} +} +\value{ +A data frame with filled values. If output_format = "clean", returns +original columns with updated value_col and added method_col. If +"detailed", includes all intermediate columns. +} +\description{ +Fills missing values using growth rates from a proxy variable (reference +series). Supports regional aggregations, weighting, and linear +interpolation for small gaps. +} +\details{ +\strong{Combined Growth Sequence (Hierarchical Interpolation):} + +When using multiple proxies with hierarchical fallback, the function +implements an intelligent combined growth sequence strategy: +\enumerate{ +\item Better proxies (earlier in hierarchy) are tried first for each gap. +\item If a better proxy has partial coverage within a gap, those growth +rates are used for the covered positions. +\item Fallback proxies fill only the remaining positions where better +proxies are not available. +\item Values filled by better proxies are protected from being overwritten. +} +} +\examples{ +# Fill GDP using population as proxy +data <- tibble::tibble( + country = rep("ESP", 4), + year = 2010:2013, + gdp = c(1000, NA, NA, 1200), + population = c(46, 46.5, 47, 47.5) +) + +fill_growth( + data, + value_col = "gdp", + proxy_col = "population", + group_by = "country" +) + +} +\seealso{ +\code{\link[=fill_linear]{fill_linear()}}, \code{\link[=fill_sum]{fill_sum()}} +} diff --git a/man/linear_fill.Rd b/man/fill_linear.Rd similarity index 59% rename from man/linear_fill.Rd rename to man/fill_linear.Rd index 2dbde12..fb45d3a 100644 --- a/man/linear_fill.Rd +++ b/man/fill_linear.Rd @@ -1,25 +1,26 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/gapfilling.R -\name{linear_fill} -\alias{linear_fill} +\name{fill_linear} +\alias{fill_linear} \title{Fill gaps by linear interpolation, or carrying forward or backward.} \usage{ -linear_fill( - df, - var, - time_index, +fill_linear( + data, + value_col, + time_col = "year", interpolate = TRUE, fill_forward = TRUE, fill_backward = TRUE, + value_smooth_window = NULL, .by = NULL ) } \arguments{ -\item{df}{A tibble data frame containing one observation per row.} +\item{data}{A data frame containing one observation per row.} -\item{var}{The variable of df containing gaps to be filled.} +\item{value_col}{The column containing gaps to be filled.} -\item{time_index}{The time index variable (usually year).} +\item{time_col}{Character. Name of the time column. Default: "year".} \item{interpolate}{Logical. If \code{TRUE} (default), performs linear interpolation.} @@ -30,13 +31,18 @@ carries last value forward.} \item{fill_backward}{Logical. If \code{TRUE} (default), carries first value backward.} +\item{value_smooth_window}{An integer specifying the window size for a +centered moving average applied to the variable before gap-filling. Useful +for variables with high inter-annual variability. If \code{NULL} (default), no +smoothing is applied.} + \item{.by}{A character vector with the grouping variables (optional).} } \value{ -A tibble data frame (ungrouped) where gaps in var have been filled, -and a new "source" variable has been created indicating if the value is -original or, in case it has been estimated, the gapfilling method that has -been used. +A tibble data frame (ungrouped) where gaps in value_col have been +filled, and a new "source" variable has been created indicating if the +value is original or, in case it has been estimated, the gapfilling method +that has been used. } \description{ Fills gaps (\code{NA} values) in a time-dependent variable by @@ -53,11 +59,10 @@ sample_tibble <- tibble::tibble( ), value = c(NA, 3, NA, NA, 0, NA, 1, NA, NA, NA, 5, NA), ) -linear_fill(sample_tibble, value, year, .by = c("category")) -linear_fill( +fill_linear(sample_tibble, value, .by = c("category")) +fill_linear( sample_tibble, value, - year, interpolate = FALSE, .by = c("category"), ) diff --git a/man/sum_fill.Rd b/man/fill_sum.Rd similarity index 65% rename from man/sum_fill.Rd rename to man/fill_sum.Rd index ddbd780..4029a3d 100644 --- a/man/sum_fill.Rd +++ b/man/fill_sum.Rd @@ -1,30 +1,39 @@ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/gapfilling.R -\name{sum_fill} -\alias{sum_fill} +\name{fill_sum} +\alias{fill_sum} \title{Fill gaps summing the previous value of a variable to the value of another variable.} \usage{ -sum_fill(df, var, change_var, start_with_zero = TRUE, .by = NULL) +fill_sum( + data, + value_col, + change_col, + time_col = "year", + start_with_zero = TRUE, + .by = NULL +) } \arguments{ -\item{df}{A tibble data frame containing one observation per row.} +\item{data}{A data frame containing one observation per row.} + +\item{value_col}{The column containing gaps to be filled.} -\item{var}{The variable of df containing gaps to be filled.} +\item{change_col}{The column whose values will be used to fill the gaps.} -\item{change_var}{The variable whose values will be used to fill the gaps.} +\item{time_col}{Character. Name of the time column. Default: "year".} \item{start_with_zero}{Logical. If TRUE, assumes an invisible 0 value before the first observation and fills with cumulative sum starting from the first -change_var value. If FALSE (default), starting NA values remain unfilled.} +change_col value. If FALSE (default), starting NA values remain unfilled.} \item{.by}{A character vector with the grouping variables (optional).} } \value{ -A tibble dataframe (ungrouped) where gaps in var have been filled, -and a new "source" variable has been created indicating if the value is -original or, in case it has been estimated, the gapfilling method that has -been used. +A tibble dataframe (ungrouped) where gaps in value_col have been +filled, and a new "source" variable has been created indicating if the +value is original or, in case it has been estimated, the gapfilling method +that has been used. } \description{ Fills gaps in a variable with the sum of its previous value and the value @@ -43,14 +52,14 @@ sample_tibble <- tibble::tibble( value = c(NA, 3, NA, NA, 0, NA, 1, NA, NA, NA, 5, NA), change_variable = c(1, 2, 3, 4, 1, 1, 0, 0, 0, 0, 0, 1) ) -sum_fill( +fill_sum( sample_tibble, value, change_variable, start_with_zero = FALSE, .by = c("category") ) -sum_fill( +fill_sum( sample_tibble, value, change_variable, diff --git a/man/proxy_fill.Rd b/man/proxy_fill.Rd deleted file mode 100644 index 8f10d40..0000000 --- a/man/proxy_fill.Rd +++ /dev/null @@ -1,45 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/gapfilling.R -\name{proxy_fill} -\alias{proxy_fill} -\title{Fill gaps using a proxy variable} -\usage{ -proxy_fill(df, var, proxy_var, time_index, ...) -} -\arguments{ -\item{df}{A tibble data frame containing one observation per row.} - -\item{var}{The variable of df containing gaps to be filled.} - -\item{proxy_var}{The variable to be used as proxy.} - -\item{time_index}{The time index variable (usually year).} - -\item{...}{Optionally, additional arguments that will be passed to -\code{linear_fill()} with the ratios. See that function to know the accepted -arguments.} -} -\value{ -A tibble dataframe (ungrouped) where gaps in var have been filled, -a new proxy_ratio variable has been created, -and a new "source" variable has been created indicating if the value is -original or, in case it has been estimated, the gapfilling method that has -been used. -} -\description{ -Fills gaps in a variable based on changes in a proxy variable, using ratios -between the filled variable and the proxy variable, and labels output -accordingly. -} -\examples{ -sample_tibble <- tibble::tibble( - category = c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b"), - year = c( - "2015", "2016", "2017", "2018", "2019", "2020", - "2015", "2016", "2017", "2018", "2019", "2020" - ), - value = c(NA, 3, NA, NA, 0, NA, 1, NA, NA, NA, 5, NA), - proxy_variable = c(1, 2, 2, 2, 2, 2, 1, 2, 3, 4, 5, 6) -) -proxy_fill(sample_tibble, value, proxy_variable, year, .by = c("category")) -} diff --git a/tests/testthat/test_calculate_lmdi.R b/tests/testthat/test_calculate_lmdi.R new file mode 100644 index 0000000..a34ca7e --- /dev/null +++ b/tests/testthat/test_calculate_lmdi.R @@ -0,0 +1,83 @@ +# Tests for calculate_lmdi() + +test_that("calculate_lmdi performs basic decomposition", { + data <- tibble::tribble( + ~year, ~emissions, ~gdp, ~population, + 2010, 100, 1000, 46, + 2011, 110, 1100, 47, + 2012, 120, 1200, 48 + ) + + result <- calculate_lmdi( + data, + identity = "emissions:gdp*population", + time_var = "year", + verbose = FALSE + ) + + expect_true(is.data.frame(result)) + expect_true("additive" %in% names(result)) +}) + +test_that("calculate_lmdi works with grouping", { + data <- tibble::tribble( + ~country, ~year, ~emissions, ~gdp, ~population, + "ESP", 2010, 100, 1000, 46, + "ESP", 2011, 110, 1100, 47, + "FRA", 2010, 200, 2000, 65, + "FRA", 2011, 220, 2200, 66 + ) + + result <- calculate_lmdi( + data, + identity = "emissions:gdp*population", + time_var = "year", + analysis_by = "country", + verbose = FALSE + ) + + expect_true("country" %in% names(result)) + expect_true("ESP" %in% result$country) + expect_true("FRA" %in% result$country) +}) + +test_that("calculate_lmdi returns expected columns", { + data <- tibble::tribble( + ~year, ~emissions, ~gdp, ~population, + 2010, 100, 1000, 46, + 2011, 110, 1100, 47, + 2012, 120, 1200, 48 + ) + + result <- calculate_lmdi( + data, + identity = "emissions:gdp*population", + time_var = "year", + verbose = FALSE + ) + + expect_true("component_type" %in% names(result)) + expect_true("additive" %in% names(result)) + expect_true("multiplicative" %in% names(result)) +}) + +test_that("calculate_lmdi handles multiple periods", { + data <- tibble::tribble( + ~year, ~emissions, ~gdp, ~population, + 2010, 100, 1000, 46, + 2011, 110, 1100, 47, + 2012, 120, 1200, 48, + 2013, 130, 1300, 49 + ) + + result <- calculate_lmdi( + data, + identity = "emissions:gdp*population", + time_var = "year", + periods = c(2010, 2012, 2013), + verbose = FALSE + ) + + expect_true(is.data.frame(result)) + expect_true(nrow(result) > 0) +}) diff --git a/tests/testthat/test_gapfilling.R b/tests/testthat/test_gapfilling.R index c8f263e..7b2be2a 100644 --- a/tests/testthat/test_gapfilling.R +++ b/tests/testthat/test_gapfilling.R @@ -1,6 +1,6 @@ # Helper fixtures -------------------------------------------------------------- -linear_fill_fixture <- function() { +fill_linear_fixture <- function() { tibble::tribble( ~category, ~year, ~value, "a", 2015, NA, @@ -38,25 +38,7 @@ single_anchor_series <- function(anchor = 42) { ) } -proxy_fill_fixture <- function() { - tibble::tribble( - ~category, ~year, ~value, ~proxy_variable, - "a", 2015, NA, 1, - "a", 2016, 3, 2, - "a", 2017, NA, 2, - "a", 2018, NA, 2, - "a", 2019, 0, 2, - "a", 2020, NA, 2, - "b", 2015, 1, 1, - "b", 2016, NA, 2, - "b", 2017, NA, 3, - "b", 2018, NA, 4, - "b", 2019, 5, 5, - "b", 2020, NA, 6 - ) -} - -sum_fill_fixture <- function() { +fill_sum_fixture <- function() { tibble::tribble( ~category, ~year, ~value, ~change_variable, "a", 2014, NA, 2, @@ -75,11 +57,11 @@ sum_fill_fixture <- function() { ) } -# linear_fill ------------------------------------------------------------------ +# fill_linear ------------------------------------------------------------------ -testthat::test_that("linear_fill fills gaps and preserves originals", { - result <- linear_fill_fixture() |> - linear_fill(value, year, .by = "category") +testthat::test_that("fill_linear fills gaps and preserves originals", { + result <- fill_linear_fixture() |> + fill_linear(value, .by = "category") result |> pointblank::expect_col_exists("source_value") |> @@ -103,11 +85,10 @@ testthat::test_that("linear_fill fills gaps and preserves originals", { testthat::expect_false() }) -testthat::test_that("linear_fill interpolates between anchor points, and adds flags", { - linear_fill_fixture() |> - linear_fill( +testthat::test_that("fill_linear interpolates between anchor points, and adds flags", { + fill_linear_fixture() |> + fill_linear( value, - year, interpolate = TRUE, fill_forward = FALSE, fill_backward = FALSE, @@ -132,11 +113,10 @@ testthat::test_that("linear_fill interpolates between anchor points, and adds fl ) }) -testthat::test_that("linear_fill carries values backward from first anchor, and adds flags", { - linear_fill_fixture() |> - linear_fill( +testthat::test_that("fill_linear carries values backward from first anchor, and adds flags", { + fill_linear_fixture() |> + fill_linear( value, - year, interpolate = FALSE, fill_forward = FALSE, fill_backward = TRUE, @@ -161,11 +141,10 @@ testthat::test_that("linear_fill carries values backward from first anchor, and ) }) -testthat::test_that("linear_fill carries values forward from last anchor, and adds flags", { - linear_fill_fixture() |> - linear_fill( +testthat::test_that("fill_linear carries values forward from last anchor, and adds flags", { + fill_linear_fixture() |> + fill_linear( value, - year, interpolate = FALSE, fill_forward = TRUE, fill_backward = FALSE, @@ -190,9 +169,9 @@ testthat::test_that("linear_fill carries values forward from last anchor, and ad ) }) -testthat::test_that("linear_fill interpolates grouped series", { - linear_fill_fixture() |> - linear_fill(value, year, .by = "category") |> +testthat::test_that("fill_linear interpolates grouped series", { + fill_linear_fixture() |> + fill_linear(value, .by = "category") |> pointblank::expect_col_vals_equal( value, c(3, 3, 2, 1, 0, 0), @@ -205,11 +184,10 @@ testthat::test_that("linear_fill interpolates grouped series", { ) }) -testthat::test_that("linear_fill propagates a single anchor value", { +testthat::test_that("fill_linear propagates a single anchor value", { single_anchor_series() |> - linear_fill( + fill_linear( value, - year, interpolate = FALSE, fill_forward = TRUE, fill_backward = TRUE @@ -225,55 +203,106 @@ testthat::test_that("linear_fill propagates a single anchor value", { ) }) -# proxy_fill ------------------------------------------------------------------ +testthat::test_that("fill_linear with value_smooth_window uses smoothed values for filling", { + # Create data with high variability + noisy_data <- tibble::tribble( + ~year, ~value, + 2010, 100, + 2011, 120, + 2012, 80, + 2013, NA, + 2014, NA, + 2015, 110, + 2016, 90, + 2017, 130 + ) -testthat::test_that("proxy_fill scales gaps from proxy ratios", { - proxy_fill_fixture() |> - proxy_fill( - value, - proxy_variable, - year, - .by = "category" - ) |> - pointblank::expect_col_exists("proxy_ratio") |> - pointblank::expect_col_exists("source_value") |> - pointblank::expect_col_vals_in_set( - source_value, - c( - "Original", - "Proxy interpolated", - "Proxy carried forward", - "Proxy carried backwards" - ) - ) |> - pointblank::expect_col_vals_equal( + # Without smoothing: interpolation uses raw anchor values (80 and 110) + result_no_smooth <- noisy_data |> + fill_linear(value, fill_forward = FALSE, fill_backward = FALSE) + + # With smoothing (window = 3): uses moving average of anchors + result_smooth <- noisy_data |> + fill_linear( value, - c(3, 0, 1, 5), - preconditions = \(df) df |> dplyr::filter(source_value == "Original") - ) |> - pointblank::expect_col_vals_expr( - ~ dplyr::near(proxy_ratio, value / proxy_variable, tol = 1e-6), - preconditions = \(df) df |> dplyr::filter(!is.na(value)) + fill_forward = FALSE, + fill_backward = FALSE, + value_smooth_window = 3 ) + + # Both should fill the gaps + testthat::expect_false(any(is.na(result_no_smooth$value[4:5]))) + testthat::expect_false(any(is.na(result_smooth$value[4:5]))) + + # Smoothed result should differ from non-smoothed due to moving average + testthat::expect_false( + all(result_no_smooth$value[4:5] == result_smooth$value[4:5]) + ) + + # Original values should be preserved in both cases + testthat::expect_equal(result_no_smooth$value[1], 100) + testthat::expect_equal(result_smooth$value[1], 100) + testthat::expect_equal(result_no_smooth$value[6], 110) + testthat::expect_equal(result_smooth$value[6], 110) }) -testthat::test_that("proxy_fill works without grouping variables", { - tibble::tribble( - ~year, ~value, ~proxy_variable, - 2015, 10, 5, - 2016, NA, 10, - 2017, 30, 15 - ) |> - proxy_fill(value, proxy_variable, year) |> - pointblank::expect_col_exists("proxy_ratio") |> - pointblank::expect_col_vals_not_null(proxy_ratio) +testthat::test_that("fill_linear value_smooth_window NULL behaves as default", { + result_default <- simple_linear_series() |> + fill_linear(value) + + result_null <- simple_linear_series() |> + fill_linear(value, value_smooth_window = NULL) + + testthat::expect_equal(result_default, result_null) +}) + +testthat::test_that("fill_linear value_smooth_window works with carry forward/backward", { + # Data where smoothing affects the carried value + edge_data <- tibble::tribble( + ~year, ~value, + 2010, 100, + 2011, 120, + 2012, 80, + 2013, NA, + 2014, NA + ) + + # Without smoothing: carries 80 forward + result_no_smooth <- edge_data |> + fill_linear( + value, + interpolate = FALSE, + fill_forward = TRUE, + fill_backward = FALSE + ) + + # With smoothing (window = 3): carries smoothed value forward + # MA of (100, 120, 80) = 100 for 2011, MA of (120, 80, NA) won't work + # But 2012 has neighbours so its smoothed value = mean(120, 80, NA) = NA + # This tests edge behaviour + result_smooth <- edge_data |> + fill_linear( + value, + interpolate = FALSE, + fill_forward = TRUE, + fill_backward = FALSE, + value_smooth_window = 3 + ) + + # Without smoothing, should carry 80 + testthat::expect_equal(result_no_smooth$value[4], 80) + testthat::expect_equal(result_no_smooth$value[5], 80) + + # Original values preserved + testthat::expect_equal(result_smooth$value[1], 100) + testthat::expect_equal(result_smooth$value[3], 80) }) -# sum_fill --------------------------------------------------------------------- +# fill_sum -------------------------------------------------------------------- -testthat::test_that("sum_fill accumulates changes while keeping originals", { - sum_fill_fixture() |> - sum_fill( +testthat::test_that("fill_sum accumulates changes while keeping originals", { + fill_sum_fixture() |> + fill_sum( value, change_variable, start_with_zero = TRUE, @@ -297,7 +326,7 @@ testthat::test_that("sum_fill accumulates changes while keeping originals", { ) }) -testthat::test_that("sum_fill handles accumulation without explicit groups", { +testthat::test_that("fill_sum handles accumulation without explicit groups", { tibble::tribble( ~year, ~value, ~change_variable, 2015, 10, 0, @@ -305,7 +334,7 @@ testthat::test_that("sum_fill handles accumulation without explicit groups", { 2017, NA, 3, 2018, NA, 1 ) |> - sum_fill(value, change_variable) |> + fill_sum(value, change_variable) |> pointblank::expect_col_vals_equal(value, c(10, 12, 15, 16)) |> pointblank::expect_col_vals_in_set( source_value, @@ -313,22 +342,22 @@ testthat::test_that("sum_fill handles accumulation without explicit groups", { ) }) -testthat::test_that("sum_fill start_with_zero toggles behaviour", { +testthat::test_that("fill_sum start_with_zero toggles behaviour", { contiguous_gaps <- tibble::tribble( - ~value, ~change_variable, - NA, 1, - NA, 2, - NA, 3, - NA, 4 + ~year, ~value, ~change_variable, + 2015, NA, 1, + 2016, NA, 2, + 2017, NA, 3, + 2018, NA, 4 ) contiguous_gaps |> - sum_fill(value, change_variable) |> + fill_sum(value, change_variable) |> pointblank::expect_col_vals_equal(value, c(1, 3, 6, 10)) |> pointblank::expect_col_vals_equal(source_value, "Filled with sum") contiguous_gaps |> - sum_fill( + fill_sum( value, change_variable, start_with_zero = FALSE @@ -336,9 +365,9 @@ testthat::test_that("sum_fill start_with_zero toggles behaviour", { pointblank::expect_col_vals_null(value) }) -testthat::test_that("sum_fill respects grouping keys", { - sum_fill_fixture() |> - sum_fill( +testthat::test_that("fill_sum respects grouping keys", { + fill_sum_fixture() |> + fill_sum( value, change_variable, .by = "category" @@ -354,3 +383,332 @@ testthat::test_that("sum_fill respects grouping keys", { preconditions = \(df) df |> dplyr::filter(category == "b") ) }) + +# fill_growth ------------------------------------------------------------------ + +test_that("fill_growth fills missing values using proxy growth rates", { + data <- tibble::tribble( + ~country, ~year, ~gdp, ~population, + "ESP", 2010, 100, 46, + "ESP", 2011, NA, 47, + "ESP", 2012, 120, 48, + "ESP", 2013, NA, 49 + ) + + result <- fill_growth( + data, + value_col = "gdp", + proxy_col = "population", + group_by = "country", + verbose = FALSE + ) + + # Should have filled the NA values + expect_false(any(is.na(result$gdp))) +}) + +test_that("fill_growth respects max_gap parameter", { + data <- tibble::tribble( + ~year, ~value, ~proxy, + 2010, 100, 1000, + 2011, NA, 1100, + 2012, NA, 1200, + 2013, NA, 1300, + 2014, 150, 1400 + ) + + result <- fill_growth( + data, + value_col = "value", + proxy_col = "proxy", + max_gap = 2, + verbose = FALSE + ) + + # With max_gap = 2, should not fill 3 consecutive NAs + expect_true(is.na(result$value[3])) +}) + +test_that("fill_growth works with grouping", { + data <- tibble::tribble( + ~country, ~year, ~emissions, ~gdp, + "ESP", 2010, 100, 1000, + "ESP", 2011, NA, 1100, + "ESP", 2012, 130, 1200, + "FRA", 2010, 200, 2000, + "FRA", 2011, NA, 2200, + "FRA", 2012, 250, 2400 + ) + + result <- fill_growth( + data, + value_col = "emissions", + proxy_col = "gdp", + group_by = "country", + verbose = FALSE + ) + + # Check both groups have filled values + esp_filled <- result |> + dplyr::filter(country == "ESP", year == 2011) |> + dplyr::pull(emissions) + + fra_filled <- result |> + dplyr::filter(country == "FRA", year == 2011) |> + dplyr::pull(emissions) + + expect_false(is.na(esp_filled)) + expect_false(is.na(fra_filled)) +}) + +test_that("fill_growth returns same number of rows", { + data <- tibble::tribble( + ~year, ~value, ~proxy, + 2010, 100, 1000, + 2011, NA, 1100, + 2012, 120, 1200 + ) + + result <- fill_growth( + data, + value_col = "value", + proxy_col = "proxy", + verbose = FALSE + ) + + expect_equal(nrow(result), nrow(data)) +}) + +# Hierarchical Segmented Interpolation ----------------------------------------- + +test_that("fill_growth uses hierarchical segmentation with intermediate proxy data", { + # Spain wages example: household_ppp has gap 2008-2019 + # formal_ppp has data 2010-2018 (should be used for middle segment) + # gdp_pc_constant has complete data (fallback for edges) + + data_wages <- tibble::tribble( + ~country, ~year, ~household_ppp, ~formal_ppp, ~gdp_pc_constant, + "ESP", 2008, 100, NA, 50, + "ESP", 2009, NA, NA, 51, + "ESP", 2010, NA, 105, 52, + "ESP", 2011, NA, 108, 53, + "ESP", 2012, NA, 112, 54, + "ESP", 2013, NA, 115, 55, + "ESP", 2014, NA, 118, 56, + "ESP", 2015, NA, 122, 57, + "ESP", 2016, NA, 125, 58, + "ESP", 2017, NA, 130, 59, + "ESP", 2018, NA, 135, 60, + "ESP", 2019, 150, NA, 61 + ) + + result <- fill_growth( + data_wages, + value_col = "household_ppp", + proxy_col = c("formal_ppp", "gdp_pc_constant"), + group_by = "country", + output_format = "detailed", + verbose = FALSE + ) + + # All gaps should be filled + expect_false(any(is.na(result$household_ppp))) + + # Original values should be preserved + expect_equal(result$household_ppp[result$year == 2008], 100) + expect_equal(result$household_ppp[result$year == 2019], 150) + + # Check that method column indicates bridge interpolation was used + middle_methods <- result |> + dplyr::filter(year >= 2009, year <= 2018) |> + dplyr::pull(method_household_ppp) + + expect_true(any(grepl("bridge", middle_methods))) +}) + +test_that("fill_growth maintains continuity without jumps between segments", { + # Test that segmented interpolation produces smooth transitions with bridge + data_test <- tibble::tribble( + ~year, ~primary, ~proxy1, ~proxy2, + 2000, 100, 100, 100, + 2001, NA, 105, 102, + 2002, NA, 110, 104, + 2003, NA, 115, 106, + 2004, NA, 120, 108, + 2005, 200, 125, 110 + ) + + result <- fill_growth( + data_test, + value_col = "primary", + proxy_col = c("proxy1", "proxy2"), + verbose = FALSE + ) + + # Check for continuity: bridge should connect smoothly + values <- result$primary + + # First and last values should match original anchors + expect_equal(values[1], 100) + expect_equal(values[6], 200) + + # Check that bridge method was used (not simple forward/backfill) + expect_true(any(grepl("bridge", result$method_primary))) + + # Values should increase (since both proxies increase monotonically) + expect_true(all(diff(values) >= 0)) + + # Growth rates should be reasonable (smooth with bridge adjustment) + growth_rates <- diff(values) / values[-length(values)] + expect_true(all(abs(growth_rates) < 1.0)) +}) + +test_that("fill_growth respects proxy hierarchy in segmentation", { + # Better proxy (proxy1) should be used when available + data_hierarchy <- tibble::tribble( + ~year, ~value, ~proxy1, ~proxy2, + 2010, 100, NA, 50, + 2011, NA, NA, 52, + 2012, NA, 120, 54, + 2013, NA, 125, 56, + 2014, NA, NA, 58, + 2015, 180, NA, 60 + ) + + result <- fill_growth( + data_hierarchy, + value_col = "value", + proxy_col = c("proxy1", "proxy2"), + output_format = "detailed", + verbose = FALSE + ) + + # Should have used better proxy for middle segment + expect_false(any(is.na(result$value))) + expect_equal(nrow(result), 6) +}) + +test_that("fill_growth backward compatible: single proxy behaves as before", { + # Without intermediate data, should work exactly as old version + data_simple <- tibble::tribble( + ~year, ~value, ~proxy, + 2010, 100, 1000, + 2011, NA, 1100, + 2012, NA, 1200, + 2013, NA, 1300, + 2014, 150, 1400 + ) + + result <- fill_growth( + data_simple, + value_col = "value", + proxy_col = "proxy", + verbose = FALSE + ) + + # Should fill all gaps + expect_false(any(is.na(result$value))) + + # Should maintain anchors + expect_equal(result$value[result$year == 2010], 100) + expect_equal(result$value[result$year == 2014], 150) + + # Interpolated values should be between anchors + expect_true(all(result$value[2:4] > 100)) + expect_true(all(result$value[2:4] < 150)) +}) + +test_that("fill_growth handles case with no intermediate data gracefully", { + # Hierarchical proxies but none have intermediate data + data_no_intermediate <- tibble::tribble( + ~year, ~value, ~proxy1, ~proxy2, + 2010, 100, NA, 50, + 2011, NA, NA, 52, + 2012, NA, NA, 54, + 2013, 150, NA, 56 + ) + + result <- fill_growth( + data_no_intermediate, + value_col = "value", + proxy_col = c("proxy1", "proxy2"), + verbose = FALSE + ) + + # Should fall back to proxy2 for entire gap + expect_false(any(is.na(result$value))) + expect_equal(result$value[result$year == 2010], 100) + expect_equal(result$value[result$year == 2013], 150) +}) + +test_that("fill_growth preserves original data points when they exist", { + data_mixed <- tibble::tribble( + ~year, ~value, ~proxy, + 2010, 100, 1000, + 2011, NA, 1100, + 2012, 120, 1200, + 2013, NA, 1300, + 2014, 150, 1400 + ) + + result <- fill_growth( + data_mixed, + value_col = "value", + proxy_col = "proxy", + verbose = FALSE + ) + + # Original data points should be exactly preserved + expect_equal(result$value[result$year == 2010], 100) + expect_equal(result$value[result$year == 2012], 120) + expect_equal(result$value[result$year == 2014], 150) + + # Gaps should be filled + expect_false(is.na(result$value[result$year == 2011])) + expect_false(is.na(result$value[result$year == 2013])) +}) + +test_that("fill_growth preserves non-NA values", { + data <- tibble::tribble( + ~year, ~value, ~proxy, + 2010, 100, 1000, + 2011, NA, 1100, + 2012, 120, 1200 + ) + + result <- fill_growth( + data, + value_col = "value", + proxy_col = "proxy", + verbose = FALSE + ) + + # Original non-NA values should be unchanged + expect_equal(result$value[1], 100) + expect_equal(result$value[3], 120) +}) + +test_that("fill_growth extrapolates at ends with hierarchical growth", { + data <- tibble::tribble( + ~year, ~value, ~proxy1, ~proxy2, + 2010, NA, 100, 50, + 2011, NA, 103, 51, + 2012, 120, 106, 52, + 2013, NA, 109, 53, + 2014, NA, 112, 54 + ) + + res <- fill_growth( + data, + value_col = "value", + proxy_col = c("proxy1", "proxy2"), + max_gap_linear = 1, + output_format = "detailed", + verbose = FALSE + ) + + # Start and end should be filled (edge extrapolation) + expect_false(any(is.na(res$value))) + # Methods at ends should be growth_* (not *_bridge) + expect_true(any(grepl("^growth_", res$method_value))) +})