By Jakob Gepp


2017-07-13 10:05:37 8 Comments

I want to select specific rows values (here TARGET) of a data.table (DT1) where the filter criterions are in an other data.table (DT2). It is not an exact filter, because if I have value 3 in DT2 I have a minimum and a maximum variable for this value in DT1. Also I have a string which shall contain a specific pattern. E.g : A = 3 in DT2 and the corresponding line in DT1 contains minA = 3, maxA = 6 and C = "Mon" which is in C = "Mon,Tue".

DT1
    INDEX1 minA maxA       C TARGET
9 :      9    3    6 Mon,Tue    109

DT2
   A   C INDEX2
1: 3 Mon      1

I am looking for the line with the range in which this value lies and also with the maximum target value.

I have the following simplified example:

# version 1.9.6
library(data.table)

DT1 <- data.table(INDEX1 = 1:12,
                  minA = c(1,1,1,2,2,2,3,3,3,4,4,4),
                  maxA = c(4,5,6),
                  C = c("Mon,Tue", "Mon,Wed", "Tue,Thu", "Wed,Thu"),
                  TARGET = c(101:112))
size <- 2
DT2 <- data.table(A = rep(c(3,4), size),
                  C = rep(c("Mon", "Thu"), size),
                  INDEX2 = 1:(2*size))

which looks like this:

DT1
   INDEX1 minA maxA       C TARGET
1 :      1    1    4 Mon,Tue    101
2 :      2    1    5 Mon,Wed    102
3 :      3    1    6 Tue,Thu    103
4 :      4    2    4 Wed,Thu    104
5 :      5    2    5 Mon,Tue    105
6 :      6    2    6 Mon,Wed    106
7 :      7    3    4 Tue,Thu    107
8 :      8    3    5 Wed,Thu    108
9 :      9    3    6 Mon,Tue    109
10:     10    4    4 Mon,Wed    110
11:     11    4    5 Tue,Thu    111
12:     12    4    6 Wed,Thu    112

DT2
   A   C INDEX2
1: 3 Mon      1
2: 4 Thu      2

I included size just for scaling and testing.

My solution so far is the following:

I wrote a function foo() which takes the filter-input values and returns the index (or some other more usefull variable) of DT1.

foo <- function(i.A, i.C){
  DT1[INDEX1 %in% grep(i.C,C) & minA <= i.A & maxA >= i.A,][TARGET == max(TARGET),]
}

I call this function for each row of DT2

DT2[, foo(i.A = A, i.C = C), by = INDEX2]

with the outout:

   INDEX2 INDEX1 minA maxA       C TARGET
1:      1      9    3    6 Mon,Tue    109
2:      2     12    4    6 Wed,Thu    112

And here is my problem:

This works fine for small data.tables, but I have a lot more rows in DT2. The functions takes much longer and i was wondering if there is a better / faster way for this kind of filtering? Maybe it is possible to "upgrade" foo() so that it can process the whole column instead of single row?

If possible i like to avoid to expand my DT1 like here:

and I think, I have a more complex filter than in these questions:

Thanks in advance for any help.

1 comments

@Jakob Gepp 2017-07-26 08:36:14

New solution

I realised that going through each line of the bigger data.table takes to much time, so I build a new function foo_new which works the other way around:

foo_new <- function(data, i.A, i.C){
  data[C %in% i.C & A %between% i.A, INDEX2]
}

Instead of machting every row of DT2 with a row of DT1, I select every row in DT2 which matches the values of one row of DT1. The ordering of DT2 is done because I need the row with the highest TARGET value. Also if a row in DT2 was already selected, it is removed for the next iteration.

The whole process is speeded up a lot:

   function   user  system elapsed 
   foo      61.511   0.327  62.052
   foo_new   0.045   0.003   0.047 

This is probaly only the case, when DT1 is smaller than DT2 - which is my case.


Here my whole simulation code:

rm(list = ls())
library(data.table)

DT1 <- data.table(INDEX1 = 1:12,
                  minA = c(1,1,1,2,2,2,3,3,3,4,4,4),
                  maxA = c(4,5,6),
                  C = c("Mon,Tue", "Mon,Wed", "Tue,Thu", "Wed,Thu"),
                  TARGET = c(101:112))

size <- 20000
DT2 <- data.table(A = rep(c(3,4), size),
                  C = rep(c("Mon", "Thu"), size),
                  INDEX2 = 1:(2*size))

foo <- function(i.A, i.C){
  DT1[INDEX1 %in% grep(i.C, C) &
        minA <= i.A &
        maxA >= i.A,
      ][TARGET == max(TARGET),]
}

foo_new <- function(data, i.A, i.C){
  data[C %in% i.C & A %between% i.A, INDEX2]
}

# with foo
DT2[, foo(i.A = A, i.C = C), by = INDEX2])

# with foo_new
DT1.ordered <- copy(DT1[order(TARGET, decreasing = TRUE)])
tmp.index <- list()
DT2[, TARGET := as.numeric(NA)]
for (i in c(1:dim(DT1.ordered)[1])) {
  # i <- 1
  restdata <- copy(DT2[is.na(TARGET),])
  tmp.index <- foo_new(data = restdata,
                       i.A  = unlist(DT1.ordered[i, list(minA, maxA)]),
                       i.C  = DT1.ordered[i, strsplit(C, ",")[[1]]])
  DT2[INDEX2 %in% tmp.index, TARGET := DT1.ordered[i, TARGET]]
}

Related Questions

Sponsored Content

4 Answered Questions

[SOLVED] data.table vs dplyr: can one do something well the other can't or does poorly?

  • 2014-01-29 15:21:45
  • BrodieG
  • 125209 View
  • 766 Score
  • 4 Answer
  • Tags:   r data.table dplyr

5 Answered Questions

[SOLVED] Filtering out duplicated/non-unique rows in data.table

16 Answered Questions

[SOLVED] Remove rows with all or some NAs (missing values) in data.frame

8 Answered Questions

[SOLVED] Convert column classes in data.table

  • 2011-10-18 20:53:22
  • Christoph_J
  • 101293 View
  • 117 Score
  • 8 Answer
  • Tags:   r data.table

3 Answered Questions

[SOLVED] What does .SD stand for in data.table in R

  • 2011-12-14 16:59:16
  • Farrel
  • 62071 View
  • 172 Score
  • 3 Answer
  • Tags:   r data.table

8 Answered Questions

[SOLVED] How do you delete a column by name in data.table?

  • 2012-02-08 22:20:31
  • Maiasaura
  • 159665 View
  • 196 Score
  • 8 Answer
  • Tags:   r data.table

5 Answered Questions

[SOLVED] Select multiple columns in data.table by their numeric indices

  • 2012-11-14 17:18:44
  • jamborta
  • 179383 View
  • 142 Score
  • 5 Answer
  • Tags:   r data.table

6 Answered Questions

[SOLVED] How to delete a row by reference in data.table?

  • 2012-05-28 20:41:07
  • Florian Oswald
  • 48944 View
  • 149 Score
  • 6 Answer
  • Tags:   r data.table

2 Answered Questions

7 Answered Questions

[SOLVED] Test if a vector contains a given element

  • 2009-07-23 02:20:53
  • medriscoll
  • 681388 View
  • 520 Score
  • 7 Answer
  • Tags:   r vector r-faq

Sponsored Content