By ishan


2019-07-12 12:56:44 8 Comments

I have a data.table with multiple columns of a variable "Performance" in specific years and a column named "ExPerf". I want to create a new column called FLAG which would indicate rows flagged for manual review based on these two conditions:

  1. Any of the "Performance" columns has a negative value
  2. The "ExPerf" column is different from any of the columns by more than 50%.

A mock data.table similar to the one I have:

library(data.table)
dt <- data.table(Id = c("N23", "N34", "N11", "N65", "N55", "N78", "N88"),
                 Name = c("ABCD", "ACBD", "ACCD", "ADBN", "ADDD", "DBCA", "CBDA"),
                 Type = c("T", "B", "B", "T", "T", "B", "B"),
                 Sold = c(500, 300, 350, 500, 350, 400, 450),
                 Bl = c(2000, 2100, 2000, 1500, 1890, 1900, 2000),
                 P_2016 = c(-200, 420, 800, 900, -10, 75, 400),
                 P_2017 = c(500, 300, -20, 700, 50, 80, 370),
                 P_2018 = c(1000, 400, 600, 800, 40, 500, 300),
                 EP_2019 = c(1500, 380, 500, 850, 30, 400, 350))
dt

Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019
N23 ABCD T   500  2000     -200      500       1000      1500
N34 ACBD B   300  2100     420       300       400       380
N11 ACCD B   350  2000     800       -20       600       500
N65 ADBN T   500  1500     900       700       800       850
N55 ADDD T   350  1890     -10       50        40        30
N78 DBCA B   400  1900     75        80        500       400
N88 CBDA B   450  2000     400       370       300       350

For this data.table the desired output would add the FLAG column as seen below:

    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019  FLAG
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE

2 comments

@Frank 2019-07-15 15:41:20

  1. Any of the performance columns has a negative value
  2. The expected performance column is different from any of the performance columns by more than 50%.

In other words, there are common min and max bounds for these columns:

  • the min is max(0, ExpPerf*0.5)
  • the max is ExpPerf*1.5

So...

dt[, v := !Reduce(`&`, 
  lapply(.SD, between, pmax(0, ExpPerf_2019*0.5), ExpPerf_2019*1.5)
), .SDcols=grep("^Perf_", names(dt), value=TRUE)]

    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019     v
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE

How it works:

  • between checks if a column lies between the min and max
  • lapply applies the check to each column, returning a list
  • Reduce with & checks whether all columns meet the condition
  • ! negates the result, so we identify cases where at least one column fails the condition

between, & and ! are vectorized operators, so we end up with a vector of results, one for each row. I would probably write this sequence in magrittr so the steps are simpler to follow:

library(magrittr)

dt[, v := .SD %>% 
  lapply(between, pmax(0, ExpPerf_2019*0.5), ExpPerf_2019*1.5) %>%
  Reduce(f=`&`) %>%
  not
, .SDcols=grep("^Perf_", names(dt), value=TRUE)]

not is a relabeling of !, offered by magrittr for convenience.

.SD is a special symbol for the subset of data operated on inside the j part of DT[i, j, by]. In this case, there is no i or by, so only .SDcols is subsetting (to select the columns of interest).

Comment

  • The code would be simpler if the OP chose to format the data in long format.
  • My answer uses the same steps as Gilean's, but is vectorised instead of calculating per row.

@Gilean0709 2019-07-15 15:12:36

You can use the following code to check for your two conditions:

dt[, FLAG := any(.SD < 0 | .SD < ExpPerf_2019 - .5*ExpPerf_2019 | .SD > ExpPerf_2019 + .5*ExpPerf_2019),
   by = Id,
   .SDcols = grep("^Perf", colnames(dt), value = TRUE)
   ]

The result:

> dt
    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019  FLAG
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE

Related Questions

Sponsored Content

8 Answered Questions

[SOLVED] Convert column classes in data.table

  • 2011-10-18 20:53:22
  • Christoph_J
  • 101294 View
  • 117 Score
  • 8 Answer
  • Tags:   r data.table

8 Answered Questions

[SOLVED] How do you delete a column by name in data.table?

  • 2012-02-08 22:20:31
  • Maiasaura
  • 159666 View
  • 196 Score
  • 8 Answer
  • Tags:   r data.table

5 Answered Questions

[SOLVED] Select multiple columns in data.table by their numeric indices

  • 2012-11-14 17:18:44
  • jamborta
  • 179383 View
  • 142 Score
  • 5 Answer
  • Tags:   r data.table

19 Answered Questions

[SOLVED] How to sort a dataframe by multiple column(s)

2 Answered Questions

1 Answered Questions

[SOLVED] Summarizing multiple columns with data.table

  • 2013-05-13 01:41:45
  • Tahnoon Pasha
  • 60105 View
  • 54 Score
  • 1 Answer
  • Tags:   r data.table

2 Answered Questions

[SOLVED] Remove multiple columns from data.table

  • 2013-05-19 19:16:58
  • matt_k
  • 36930 View
  • 72 Score
  • 2 Answer
  • Tags:   r data.table

6 Answered Questions

[SOLVED] How to delete a row by reference in data.table?

  • 2012-05-28 20:41:07
  • Florian Oswald
  • 48944 View
  • 149 Score
  • 6 Answer
  • Tags:   r data.table

2 Answered Questions

[SOLVED] How to reorder data.table columns (without copying)

  • 2012-09-01 22:47:27
  • Michael
  • 67585 View
  • 117 Score
  • 2 Answer
  • Tags:   r data.table

1 Answered Questions

[SOLVED] R: set default color for ascii grid plot

Sponsored Content