By Corone


2013-02-04 14:59:49 8 Comments

I want to create a new column in a data.table calculated from the current value of one column and the previous of another. Is it possible to access previous rows?

E.g.:

> DT <- data.table(A=1:5, B=1:5*10, C=1:5*100)
> DT
   A  B   C
1: 1 10 100
2: 2 20 200
3: 3 30 300
4: 4 40 400
5: 5 50 500
> DT[, D := C + BPreviousRow] # What is the correct code here?

The correct answer should be

> DT
   A  B   C   D
1: 1 10 100  NA
2: 2 20 200 210
3: 3 30 300 320
4: 4 40 400 430
5: 5 50 500 540

7 comments

@Abdullah Al Mahmud 2018-07-05 10:51:14

Here is my intuitive solution:

#create data frame
df <- data.frame(A=1:5, B=seq(10,50,10), C=seq(100,500, 100))`
#subtract the shift from num rows
shift  <- 1 #in this case the shift is 1
invshift <- nrow(df) - shift
#Now create the new column
df$D <- c(NA, head(df$B, invshift)+tail(df$C, invshift))`

Here invshift, the number of rows minus 1, is 4. nrow(df) provides you with the number of rows in a data frame or in a vector. Similarly, if you want to take still earlier values, subtract from nrow 2, 3, ...etc, and also put NA's accordingly at the beginning.

@Arun 2013-02-04 15:02:24

With shift() implemented in v1.9.6, this is quite straightforward.

DT[ , D := C + shift(B, 1L, type="lag")]
# or equivalently, in this case,
DT[ , D := C + shift(B)]

From NEWS:

  1. New function shift() implements fast lead/lag of vector, list, data.frames or data.tables. It takes a type argument which can be either "lag" (default) or "lead". It enables very convenient usage along with := or set(). For example: DT[, (cols) := shift(.SD, 1L), by=id]. Please have a look at ?shift for more info.

See history for previous answers.

@SlowLearner 2013-02-04 15:24:33

Does that .N hold the current row number or something? Sorry to ask here, but I can't seem to find it in the help files...

@Steve Lianoglou 2013-02-04 16:02:27

@SlowLearner: You might also find .I useful, which holds the row indices for the rows in the curren group.

@mnel 2013-02-04 19:08:20

Use seq_len(.N - 1) instead of 1:(.N-1). This avoids problems associated with 1:0.

@MichaelChirico 2015-04-26 22:41:26

+1 for the .SD example--I was trying to use a lapply and getting funky results. this is much simpler.

@skan 2015-04-30 16:45:31

Where can I find an updated pdf with all this new information ? The official 1.9.4 vignettes and webminars don't include it. And the Rmd 1.9.5 vignettes are not comfortable and don't include it either.

@skan 2015-04-30 16:51:43

@Arun , your first solution says "Error in C + shift(B, 1L, type = "lag") : non-numeric argument to binary operator" in my data.table 1.9.5

@Arun 2015-04-30 17:10:54

@skan, update to the latest commit of data.table... Made changes yesterday. If using devel, better to watch the project on GitHub

@Andreas 2019-10-21 10:30:48

Note - for me at least - using shift was much slower that using 'slicing' with -N - see timings here (for a data.table of 1,5 mio. rows) stackoverflow.com/questions/58372627/…

@Steven Beaupré 2015-04-27 01:52:21

Using dplyr you could do:

mutate(DT, D = lag(B) + C)

Which gives:

#   A  B   C   D
#1: 1 10 100  NA
#2: 2 20 200 210
#3: 3 30 300 320
#4: 4 40 400 430
#5: 5 50 500 540

@B C 2018-11-19 18:25:57

Underrated answer - thank you for sharing.

@geneorama 2014-11-03 22:03:58

I added a padding argument and changed some names and called it shift. https://github.com/geneorama/geneorama/blob/master/R/shift.R

@geneorama 2015-01-09 20:06:56

Thanks so much for the note. I'll be on the lookout for it, and most likely use it and deprecate my geneorama version.

@dnlbrky 2014-08-01 16:24:39

Several folks have answered the specific question. See the code below for a general purpose function that I use in situations like this that may be helpful. Rather than just getting the prior row, you can go as many rows in the "past" or "future" as you'd like.

rowShift <- function(x, shiftLen = 1L) {
  r <- (1L + shiftLen):(length(x) + shiftLen)
  r[r<1] <- NA
  return(x[r])
}

# Create column D by adding column C and the value from the previous row of column B:
DT[, D := C + rowShift(B,-1)]

# Get the Old Faithul eruption length from two events ago, and three events in the future:
as.data.table(faithful)[1:5,list(eruptLengthCurrent=eruptions,
                                 eruptLengthTwoPrior=rowShift(eruptions,-2), 
                                 eruptLengthThreeFuture=rowShift(eruptions,3))]
##   eruptLengthCurrent eruptLengthTwoPrior eruptLengthThreeFuture
##1:              3.600                  NA                  2.283
##2:              1.800                  NA                  4.533
##3:              3.333               3.600                     NA
##4:              2.283               1.800                     NA
##5:              4.533               3.333                     NA

@geneorama 2014-11-03 19:52:37

This is a brilliant answer, I'm annoyed that I've already upvoted the other answers because this is a far more general answer. In fact, I'm going to use it in my geneorama package (if you don't mind).

@dnlbrky 2014-11-03 20:33:04

Sure, go for it. I was hoping to get some free time and submit it as a pull request to the data.table package, but alas...

@dnlbrky 2015-02-19 18:53:10

A similar function called shift has been added to data.table as of version 1.9.5. See the updated answer from @Arun.

@Gary Weissman 2014-05-04 04:25:05

Based on @Steve Lianoglou 's comment above, why not just:

DT[, D:= C + c(NA, B[.I - 1]) ]
#    A  B   C   D
# 1: 1 10 100  NA
# 2: 2 20 200 210
# 3: 3 30 300 320
# 4: 4 40 400 430
# 5: 5 50 500 540

And avoid using seq_len or head or any other function.

@Matthew 2014-09-02 18:39:00

Nice - however this would not work if you wanted to find the previous within a group.

@Gary Weissman 2015-02-15 20:00:33

@Matthew you are right. If subsetting by group I would replace .I with seq_len(.N)

@Ryogi 2013-02-04 15:53:25

Following Arun's solution, a similar results can be obtained without referring to to .N

> DT[, D := C + c(NA, head(B, -1))][]
   A  B   C   D
1: 1 10 100  NA
2: 2 20 200 210
3: 3 30 300 320
4: 4 40 400 430
5: 5 50 500 540

@Corone 2013-02-04 16:10:38

Is there a reason to prefer one method to another? Or is it simply an aesthetic difference?

@Ryogi 2013-02-04 16:24:10

I think that in this scenario (i.e. where .N is readily available) it is mostly aesthetic choice. I am not aware of any important difference.

Related Questions

Sponsored Content

2 Answered Questions

R data.table calculations with previous row dependencies

  • 2019-02-02 12:37:56
  • Sarel Louw
  • 92 View
  • 1 Score
  • 2 Answer
  • Tags:   r data.table

3 Answered Questions

[SOLVED] data.table vs dplyr: can one do something well the other can't or does poorly?

  • 2014-01-29 15:21:45
  • BrodieG
  • 114074 View
  • 723 Score
  • 3 Answer
  • Tags:   r data.table dplyr

1 Answered Questions

[SOLVED] calculate new column on factor level based on two data.frames/data.tables

  • 2018-06-14 13:26:39
  • crazysantaclaus
  • 62 View
  • 0 Score
  • 1 Answer
  • Tags:   r data.table

5 Answered Questions

[SOLVED] Creating a vector from data.table row without using apply

  • 2018-04-04 00:47:30
  • Dan Rosenheck
  • 111 View
  • 3 Score
  • 5 Answer
  • Tags:   r data.table

1 Answered Questions

1 Answered Questions

Referencing previous value in dplyr

  • 2017-12-11 20:03:36
  • Ziyan Xu
  • 173 View
  • 1 Score
  • 1 Answer
  • Tags:   r dplyr data.table

3 Answered Questions

5 Answered Questions

[SOLVED] calculating sum of previous 3 rows in R data.table (by grid-square)

1 Answered Questions

[SOLVED] Use previous calculated row value in r

2 Answered Questions

[SOLVED] How Can I Merge Multiple Files with Unequal Rows in R

Sponsored Content