By frankc


2012-09-12 15:46:41 8 Comments

How do you refer to variables in a data.table if the variable names are stored in a character vector? For instance, this works for a data.frame:

df <- data.frame(col1 = 1:3)
colname <- "col1"
df[colname] <- 4:6
df
#   col1
# 1    4
# 2    5
# 3    6

How can I perform this same operation for a data.table, either with or without := notation? The obvious thing of dt[ , list(colname)] doesn't work (nor did I expect it to).

3 comments

@Matt Dowle 2012-09-12 16:05:27

Two ways to programmatically select variable(s):

  1. with = FALSE:

    DT = data.table(col1 = 1:3)
    colname = "col1"
    DT[, colname, with = FALSE] 
    #    col1
    # 1:    1
    # 2:    2
    # 3:    3
    
  2. 'dot dot' (..) prefix:

    DT[, ..colname]    
    #    col1
    # 1:    1
    # 2:    2
    # 3:    3
    

For further description of the 'dot dot' (..) notation, see New Features in 1.10.2 (it is currently not described in help text).

To assign to variable(s), wrap the LHS of := in parentheses:

DT[, (colname) := 4:6]    
#    col1
# 1:    4
# 2:    5
# 3:    6

The latter is known as a column plonk, because you replace the whole column vector by reference. If a subset i was present, it would subassign by reference. The parens around (colname) is a shorthand introduced in version v1.9.4 on CRAN Oct 2014. Here is the news item:

Using with = FALSE with := is now deprecated in all cases, given that wrapping the LHS of := with parentheses has been preferred for some time.

colVar = "col1"
DT[, colVar := 1, with = FALSE]                 # deprecated, still works silently
DT[, (colVar) := 1]                             # please change to this
DT[, c("col1", "col2") := 1]                    # no change
DT[, 2:4 := 1]                                  # no change
DT[, c("col1","col2") := list(sum(a), mean(b)]  # no change
DT[, `:=`(...), by = ...]                       # no change

See also Details section in ?`:=`:

DT[i, (colnamevector) := value]
# [...] The parens are enough to stop the LHS being a symbol

And to answer further question in comment, here's one way (as usual there are many ways) :

DT[, colname := cumsum(get(colname)), with = FALSE]
#    col1
# 1:    4
# 2:    9
# 3:   15 

or, you might find it easier to read, write and debug just to eval a paste, similar to constructing a dynamic SQL statement to send to a server :

expr = paste0("DT[,",colname,":=cumsum(",colname,")]")
expr
# [1] "DT[,col1:=cumsum(col1)]"

eval(parse(text=expr))
#    col1
# 1:    4
# 2:   13
# 3:   28

If you do that a lot, you can define a helper function EVAL :

EVAL = function(...)eval(parse(text=paste0(...)),envir=parent.frame(2))

EVAL("DT[,",colname,":=cumsum(",colname,")]")
#    col1
# 1:    4
# 2:   17
# 3:   45

Now that data.table 1.8.2 automatically optimizes j for efficiency, it may be preferable to use the eval method. The get() in j prevents some optimizations, for example.

Or, there is set(). A low overhead, functional form of :=, which would be fine here. See ?set.

set(DT, j = colname, value = cumsum(DT[[colname]]))
DT
#    col1
# 1:    4
# 2:   21
# 3:   66

@frankc 2012-09-12 16:48:48

Thanks for the reply Matthew. The with=FALSE definitely solves part of my problem. In reality though, I want to replace the column with the cumsum of the column. Can I reference the column name by variable on the right-hand side of the assignment somehow?

@frankc 2012-09-12 17:18:58

Acutally, I just storded the cumsum externally with a different name that doesn't exist inside the dt and that works fine.

@Matt Dowle 2012-09-12 17:27:15

But that would be whole extra line! Not very elegant :) But ok sometimes it's useful. In those cases best to start the variable name with ., or .. to avoid any potential masking if DT ever did contain that symbol as a column name in future (and stick to the convention that column names don't start with .). There are some feature requests to make it more robust to scope issues like that, such as adding .() and ..().

@frankc 2012-09-12 17:32:52

I replied before I noticed you edited your answer. My first thought had been eval(parse()) but for some reason I was having trouble getting it to work, when it dawned on me to just do it externally. This is a great answer with lots of things I didn't think about. Thanks for data.table in general, it's a great package.

@G. Grothendieck 2013-01-14 14:11:36

Note that you could use the quasi-perl type string interpolation of fn$ from the gsubfn package to improve the readability of the EVAL solution: library(gsubfn); fn$EVAL( "DT[,$colname:=cumsum($colname)]" ) .

@Matt Dowle 2013-01-14 14:38:56

@G.Grothendieck Nice! Didn't know about that. Thanks.

@jan-glx 2015-08-18 08:57:19

From ?':=' it seems like this answer needs an update: __________ "DT[i,colnamevector:=value,with=FALSE] # old syntax. The contents of colnamevector in calling scope determine the column names or positions to update (or add) _____________________________ DT[i,(colnamevector):=value] # same, shorthand. Now preferred. The parens are enough to stop the LHS being a symbol"

@Matt Dowle 2015-08-18 22:04:46

@YAK Thanks. Done.

@Lazarus Thurston 2018-03-18 07:55:50

@MattDowle, Apologies to reopen a very old question, but could you also clarify how to reference a column of a data table if the column number (which is also its name) is stored in an integer variable (which could be another column of the same data table)? I tried to pass the variable inside (), inside .() and inside c() ... also tried to convert from integer to character but it always returns the numbers themselves. But if we just hardcode the numbers like DT[,c("5","10")] it outputs the columns named 5 and 10.

@Matt Dowle 2018-03-19 17:39:07

@SanjayMehrotra Not quite following (please ask a new question with reproducible example) but maybe first item in NEWS for v1.10.2 helps.

@Sathish 2017-04-17 04:04:38

For multiple columns and a function applied on column values.

When updating the values from a function, the RHS must be a list object, so using a loop on .SD with lapply will do the trick.

The example below converts integer columns to numeric columns

a1 <- data.table(a=1:5, b=6:10, c1=letters[1:5])
sapply(a1, class)  # show classes of columns
#         a           b          c1 
# "integer"   "integer" "character" 

# column name character vector
nm <- c("a", "b")

# Convert columns a and b to numeric type
a1[, j = (nm) := lapply(.SD, as.numeric ), .SDcols = nm ]

sapply(a1, class)
#         a           b          c1 
# "numeric"   "numeric" "character" 

@efh0888 2015-06-30 16:26:42

*This is not an answer really, but I don't have enough street cred to post comments :/

Anyway, for anyone who might be looking to actually create a new column in a data table with a name stored in a variable, I've got the following to work. I have no clue as to it's performance. Any suggestions for improvement? Is it safe to assume a nameless new column will always be given the name V1?

colname <- as.name("users")
# Google Analytics query is run with chosen metric and resulting data is assigned to DT
DT2 <- DT[, sum(eval(colname, .SD)), by = country]
setnames(DT2, "V1", as.character(colname))

Notice I can reference it just fine in the sum() but can't seem to get it to assign in the same step. BTW, the reason I need to do this is colname will be based on user input in a Shiny app.

@neuropsych 2016-01-24 21:39:12

+1 for just working: I agree this must not be "the way" to do this, but having just spent like 45 minutes pouring over every SO post on this subject, this is the only solution that I have actually been able to get to work - thanks for taking the time to point it out!

@efh0888 2016-01-25 21:27:42

Glad I could help! Unfortunately, I never did find a more elegant solution directly using data.tables, although this 3 liner isn't terrible. In my scenario, I did realize a simpler alternative would have been to use tidyr to just make my data "long" instead of "wide", since based on user input, I could always filter on a single column rather than selecting from a set of columns.

@dracodoc 2016-08-10 19:45:36

It's not safe to assume V1 is the new name. For example, if you read csv with fread and there is a unnamed column, it will hasV1 name (and read.csv will give X). So it's possible your table already have a V1. Maybe just get the name by names(DT)[length(names(DT))]

Related Questions

Sponsored Content

1 Answered Questions

[SOLVED] R data.table grouped sum for column referenced by name stored in a variable

  • 2019-01-29 22:14:26
  • Laurimann
  • 39 View
  • 0 Score
  • 1 Answer
  • Tags:   r data.table

1 Answered Questions

[SOLVED] Multiply columns of a data.table by a vector

  • 2018-11-15 11:57:40
  • beni
  • 448 View
  • 0 Score
  • 1 Answer
  • Tags:   r data.table

1 Answered Questions

2 Answered Questions

[SOLVED] Match() and the dynamic selection of columns of a data.table

  • 2017-12-19 10:22:36
  • koteletje
  • 74 View
  • 1 Score
  • 2 Answer
  • Tags:   r data.table

3 Answered Questions

1 Answered Questions

1 Answered Questions

[SOLVED] data.table: why is it not always possible to pass column names directly?

  • 2015-12-20 19:32:21
  • PatrickT
  • 8232 View
  • 17 Score
  • 1 Answer
  • Tags:   r data.table

1 Answered Questions

[SOLVED] How to select columns in data.table using a character vector of certain column names?

  • 2015-08-24 14:01:20
  • bourbaki4481472
  • 40018 View
  • 28 Score
  • 1 Answer
  • Tags:   r data.table

1 Answered Questions

Sponsored Content