Skip to content

Simplified and Extended "Updating by Reference" Section in Joins Vignette #6847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jun 23, 2025
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 31 additions & 17 deletions vignettes/datatable-joins.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -698,23 +698,37 @@ Products[!"popcorn",

The `:=` operator in `data.table` is used for updating or adding columns by reference. This means it modifies the original `data.table` without creating a copy, which is very memory-efficient, especially for large datasets. When used inside a `data.table`, `:=` allows you to **add new columns** or **modify existing ones** as part of your query.

Let's update our `Products` table with the latest price from `ProductPriceHistory`:

```{r}
copy(Products)[ProductPriceHistory,
on = .(id = product_id),
j = `:=`(price = tail(i.price, 1),
last_updated = tail(i.date, 1)),
by = .EACHI][]
```

In this operation:

- The function copy creates a ***deep*** copy of the `Products` table, preventing modifications made by `:=` from changing the original table by reference.
- We join `Products` with `ProductPriceHistory` based on `id` and `product_id`.
- We update the `price` column with the latest price from `ProductPriceHistory`.
- We add a new `last_updated` column to track when the price was last changed.
- The `by = .EACHI` ensures that the `tail` function is applied for each product in `ProductPriceHistory`.
1) Let's update our `Products` table with the latest price from `ProductPriceHistory`:
```{r Simple One-to-One Update}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these valid {knitr} chunk names? regardless, please use machine-readable names (a la https://style.tidyverse.org/files.html)

Products[ProductPriceHistory, on = .(id = product_id), price := i.price]
```
- The price column in Products is updated using the price column from ProductPriceHistory.
- The on = .(id = product_id) ensures that updates happen based on matching IDs.
- This method modifies Products in place, avoiding unnecessary copies.

2) If we need to get the latest price and date (instead of all matches), we can still use := efficiently:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we've cleanly explained the difference between this join and the previous one. The only difference I see is tail() vs. last(), am I missing something?

IIUC the difference between tail() and last() would be that last() can skip NA values, right?

```{r Updating with the Latest Record}
Products[ProductPriceHistory,
on = .(id = product_id),
`:=`(price = last(i.price), last_updated = last(i.date)),
by = .EACHI]
```
- last(i.price) ensures that only the latest price is selected.
- last_updated column is added to track the last update date.
- by = .EACHI ensures that the last price is picked for each product.

3) When we need to update Products with multiple columns from ProductPriceHistory
```{r Efficient Right Join Update }
cols <- setdiff(names(ProductPriceHistory), 'product_id')
Products[ProductPriceHistory,
on = .(id = product_id),
(cols) := mget(cols)]
```
- Efficiently updates multiple columns in Products from ProductPriceHistory.
- mget(cols) retrieves multiple matching columns dynamically.
- This method is faster and more memory-efficient than Products <- ProductPriceHistory[Products, on=...].
- Note: := updates Products in place, but does not modify ProductPriceHistory.
- Unlike traditional RIGHT JOIN, data.table does not allow i (right table) to be updated directly.

***

Expand Down
Loading