By moku


2014-12-03 15:05:38 8 Comments

Having an issue with how to dummy code the following dataset.

Example data, lets say dataframe = mydata:

ID |     NAMES      |
-- | -------------- |
1  | 4444, 333, 456 |
2  | 333            |
3  | 456, 765       |

I'd like to cast only the unique variables in NAMES as column variables and code if each row has that variable or not i.e 1 or 0

Desired Output:

ID |     NAMES      | 4444 | 333 | 456 | 765 |
-- | -------------- |------|-----|-----|-----|
1  | 4444, 333, 456 |   1  |  1  |  1  |   0 |
2  | 333            |   0  |  1  |  0  |   0 |
3  | 456, 765       |   0  |  0  |  1  |   1 |

what I've done so far is created a vector of unique

split <- str_split(string = mydata$NAMES,pattern = ",")

vec <- unique(str_trim(unlist(split)))
remove <- ""
vec <- as.data.frame(vec[! vec %in% remove])
colnames(vec) <- "var"
vecRef <- as.vector(vec$var)

namesCast <- dcast(data = vec,formula = .~var)
namesCast <- nameCast[,2:ncol(namesCast)]

This yields a vector of unique NAMES with spaces/irregularities removed. From there I have no idea how to do the matching/dummy coding so any help would be greatly appreciated!

1 comments

@A5C1D2H2I1M1N2O1R2T1 2014-12-03 15:10:55

You can use cSplit_e from my "splitstackshape" package, like this:

library(splitstackshape)
cSplit_e(mydata, "NAMES", sep = ",", type = "character", fill = 0)
#   ID          NAMES NAMES_333 NAMES_4444 NAMES_456 NAMES_765
# 1  1 4444, 333, 456         1          1         1         0
# 2  2            333         1          0         0         0
# 3  3       456, 765         0          0         1         1

If you want to see the underlying function that is called when you use those arguments, you can look at splitstackshape:::charMat, which takes a list generated by strsplit and creates a matrix from it.

Calling the function directly would give you something like this:

splitstackshape:::charMat(
  lapply(strsplit(as.character(mydata$NAMES), ","), 
         function(x) gsub("^\\s+|\\s$", "", x)))
#      333 4444 456 765
# [1,]   1    1   1  NA
# [2,]   1   NA  NA  NA
# [3,]  NA   NA   1   1 

@moku 2014-12-03 15:21:44

Ha I knew someone would just comeback with one line of code that blows my mind. Thanks it works great!

Related Questions

Sponsored Content

8 Answered Questions

[SOLVED] Use dynamic variable names in `dplyr`

  • 2014-09-23 19:51:15
  • Timm S.
  • 95106 View
  • 140 Score
  • 8 Answer
  • Tags:   r dplyr r-faq

20 Answered Questions

[SOLVED] Drop data frame columns by name

  • 2011-01-05 14:34:29
  • Btibert3
  • 1397789 View
  • 838 Score
  • 20 Answer
  • Tags:   r dataframe r-faq

1 Answered Questions

[SOLVED] equivalent of melt+reshape that splits on column names

  • 2019-07-03 15:28:35
  • EngrStudent - Reinstate Monica
  • 65 View
  • 2 Score
  • 1 Answer
  • Tags:   r split reshape2 melt

16 Answered Questions

[SOLVED] Changing column names of a data frame

  • 2011-05-21 11:31:23
  • Son
  • 1294243 View
  • 381 Score
  • 16 Answer
  • Tags:   r dataframe rename

3 Answered Questions

[SOLVED] How to generate a dummy variable after NAs in R

2 Answered Questions

2 Answered Questions

[SOLVED] Creating dummy variables in sparklyr?

10 Answered Questions

[SOLVED] How to drop columns by name in a data frame

  • 2011-03-08 14:56:26
  • leroux
  • 397899 View
  • 293 Score
  • 10 Answer
  • Tags:   r dataframe subset

2 Answered Questions

[SOLVED] cast {reshape}: using variables instead of the columns' name

3 Answered Questions

[SOLVED] Automatic column name label with cast

  • 2012-07-05 23:14:45
  • Brandon Bertelsen
  • 3037 View
  • 9 Score
  • 3 Answer
  • Tags:   r reshape

Sponsored Content