By jkebinger


2010-12-03 22:29:15 8 Comments

I'd like to take data of the form

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))
  attr          type
1    1   foo_and_bar
2   30 foo_and_bar_2
3    4   foo_and_bar
4    6 foo_and_bar_2

and use split() on the column "type" from above to get something like this:

  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2

I came up with something unbelievably complex involving some form of apply that worked, but I've since misplaced that. It seemed far too complicated to be the best way. I can use strsplit as below, but then unclear how to get that back into 2 columns in the data frame.

> strsplit(as.character(before$type),'_and_')
[[1]]
[1] "foo" "bar"

[[2]]
[1] "foo"   "bar_2"

[[3]]
[1] "foo" "bar"

[[4]]
[1] "foo"   "bar_2"

Thanks for any pointers. I've not quite groked R lists just yet.

15 comments

@Joe 2018-02-17 03:44:05

base but probably slow:

n <- 1
for(i in strsplit(as.character(before$type),'_and_')){
     before[n, 'type_1'] <- i[[1]]
     before[n, 'type_2'] <- i[[2]]
     n <- n + 1
}

##   attr          type type_1 type_2
## 1    1   foo_and_bar    foo    bar
## 2   30 foo_and_bar_2    foo  bar_2
## 3    4   foo_and_bar    foo    bar
## 4    6 foo_and_bar_2    foo  bar_2

@Yannis P. 2017-11-01 17:26:04

The subject is almost exhausted, I 'd like though to offer a solution to a slightly more general version where you don't know the number of output columns, a priori. So for example you have

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2', 'foo_and_bar_2_and_bar_3', 'foo_and_bar'))
  attr                    type
1    1             foo_and_bar
2   30           foo_and_bar_2
3    4 foo_and_bar_2_and_bar_3
4    6             foo_and_bar

We can't use dplyr separate() because we don't know the number of the result columns before the split, so I have then created a function that uses stringr to split a column, given the pattern and a name prefix for the generated columns. I hope the coding patterns used, are correct.

split_into_multiple <- function(column, pattern = ", ", into_prefix){
  cols <- str_split_fixed(column, pattern, n = Inf)
  # Sub out the ""'s returned by filling the matrix to the right, with NAs which are useful
  cols[which(cols == "")] <- NA
  cols <- as.tibble(cols)
  # name the 'cols' tibble as 'into_prefix_1', 'into_prefix_2', ..., 'into_prefix_m' 
  # where m = # columns of 'cols'
  m <- dim(cols)[2]

  names(cols) <- paste(into_prefix, 1:m, sep = "_")
  return(cols)
}

We can then use split_into_multiple in a dplyr pipe as follows:

after <- before %>% 
  bind_cols(split_into_multiple(.$type, "_and_", "type")) %>% 
  # selecting those that start with 'type_' will remove the original 'type' column
  select(attr, starts_with("type_"))

>after
  attr type_1 type_2 type_3
1    1    foo    bar   <NA>
2   30    foo  bar_2   <NA>
3    4    foo  bar_2  bar_3
4    6    foo    bar   <NA>

And then we can use gather to tidy up...

after %>% 
  gather(key, val, -attr, na.rm = T)

   attr    key   val
1     1 type_1   foo
2    30 type_1   foo
3     4 type_1   foo
4     6 type_1   foo
5     1 type_2   bar
6    30 type_2 bar_2
7     4 type_2 bar_2
8     6 type_2   bar
11    4 type_3 bar_3

@Swifty McSwifterton 2017-09-28 20:14:42

This question is pretty old but I'll add the solution I found the be the simplest at present.

library(reshape2)
before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))
newColNames <- c("type1", "type2")
newCols <- colsplit(before$type, "_and_", newColNames)
after <- cbind(before, newCols)
after$type <- NULL
after

@Rich Scriven 2017-08-28 19:15:06

Since R version 3.4.0 you can use strcapture() from the utils package (included with base R installs), binding the output onto the other column(s).

out <- strcapture(
    "(.*)_and_(.*)",
    as.character(before$type),
    data.frame(type_1 = character(), type_2 = character())
)

cbind(before["attr"], out)
#   attr type_1 type_2
# 1    1    foo    bar
# 2   30    foo  bar_2
# 3    4    foo    bar
# 4    6    foo  bar_2

@A5C1D2H2I1M1N2O1R2T1 2014-09-27 15:46:59

To add to the options, you could also use my splitstackshape::cSplit function like this:

library(splitstackshape)
cSplit(before, "type", "_and_")
#    attr type_1 type_2
# 1:    1    foo    bar
# 2:   30    foo  bar_2
# 3:    4    foo    bar
# 4:    6    foo  bar_2

@Nicki 2017-08-03 13:21:16

3 years later - this option is working best for a similar problem I have - however the dataframe I am working with has 54 columns and I need to split all of them into two. Is there a way to do this using this method - short of typing out the above command 54 times? Many thanks, Nicki.

@A5C1D2H2I1M1N2O1R2T1 2017-08-04 16:12:23

@Nicki, Have you tried providing a vector of the column names or the column positions? That should do it....

@Nicki 2017-08-07 13:20:06

It wasnt just renaming the columns - I needed to literally split the columns as above effectively doubling the number of columns in my df. The below was what I used in the end: df2 <- cSplit(df1, splitCols = 1:54, "/")

@Soumya Das 2017-05-26 18:08:56

tp <- c("a-c","d-e-f","g-h-i","m-n")

temp = strsplit(as.character(tp),'-')

x=c();
y=c();
z=c();

#tab=data.frame()
#tab= cbind(tab,c(x,y,z))

for(i in 1:length(temp) )
{
  l = length(temp[[i]]);

  if(l==2)
  {
     x=c(x,temp[[i]][1]);
     y=c(y,"NA")
     z=c(z,temp[[i]][2]);

    df= as.data.frame(cbind(x,y,z)) 

  }else
  {
    x=c(x,temp[[i]][1]);
    y=c(y,temp[[i]][2]);
    z=c(z,temp[[i]][3]);

    df= as.data.frame(cbind(x,y,z))
   }
}

@42- 2010-12-03 23:35:27

Notice that sapply with "[" can be used to extract either the first or second items in those lists so:

before$type_1 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 1)
before$type_2 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 2)
before$type <- NULL

And here's a gsub method:

before$type_1 <- gsub("_and_.+$", "", before$type)
before$type_2 <- gsub("^.+_and_", "", before$type)
before$type <- NULL

@David Arenburg 2015-10-14 14:14:40

5 years later adding the obligatory data.table solution

library(data.table) ## v 1.9.6+ 
setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_")]
before
#    attr          type type1 type2
# 1:    1   foo_and_bar   foo   bar
# 2:   30 foo_and_bar_2   foo bar_2
# 3:    4   foo_and_bar   foo   bar
# 4:    6 foo_and_bar_2   foo bar_2

We could also both make sure that the resulting columns will have correct types and improve performance by adding type.convert and fixed arguments (since "_and_" isn't really a regex)

setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_", type.convert = TRUE, fixed = TRUE)]

@lmo 2016-07-22 20:34:38

Here is a base R one liner that overlaps a number of previous solutions, but returns a data.frame with the proper names.

out <- setNames(data.frame(before$attr,
                  do.call(rbind, strsplit(as.character(before$type),
                                          split="_and_"))),
                  c("attr", paste0("type_", 1:2)))
out
  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2

It uses strsplit to break up the variable, and data.frame with do.call/rbind to put the data back into a data.frame. The additional incremental improvement is the use of setNames to add variable names to the data.frame.

@hadley 2014-06-11 16:50:59

Another option is to use the new tidyr package.

library(dplyr)
library(tidyr)

before <- data.frame(
  attr = c(1, 30 ,4 ,6 ), 
  type = c('foo_and_bar', 'foo_and_bar_2')
)

before %>%
  separate(type, c("foo", "bar"), "_and_")

##   attr foo   bar
## 1    1 foo   bar
## 2   30 foo bar_2
## 3    4 foo   bar
## 4    6 foo bar_2

@Jelena-bioinf 2016-01-11 11:42:29

Is there a way to limit number of splits with separate? Let's say I want to split on '_' only once (or do it with str_split_fixed and adding columns to existing dataframe)?

@hadley 2016-01-12 00:00:08

Yes. See the docs

@hadley 2010-12-04 04:21:27

Use stringr::str_split_fixed

library(stringr)
str_split_fixed(before$type, "_and_", 2)

@LearneR 2015-07-28 06:53:12

this worked pretty fine for my problem today as well.. but it was adding a 'c' at the beginning of each row. Any idea why is that??? left_right <- str_split_fixed(as.character(split_df),'\">',2)

@user3841581 2016-03-14 08:15:50

I would like to split with a pattern that has "...", when I apply that function, it returns nothing. What could be the problem. my type is something like "test...score"

@thelatemail 2017-08-09 04:30:09

@user3841581 - old query of yours I know, but this is covered in the documentation - str_split_fixed("aaa...bbb", fixed("..."), 2) works fine with fixed() to "Match a fixed string" in the pattern= argument. . means 'any character' in regex.

@cloudscomputes 2017-09-15 03:28:43

Thanks hadley, very convinient method, but there is one thing can be improved, if there is NA in the original column, after separation it will become sevaral empty string in result columns, which is unwanted, I want to keep the NA still NA after separation

@maycca 2018-05-22 19:32:10

Works well i.e. if the separator is missing ! i.e. if I have a vector 'a<-c("1N", "2N")' that I would like to separate in columns '1,1, "N", "N"' I run 'str_split_fixed(s, "", 2)'. I am just not sure how to name my new columns in this approach, 'col1<-c(1,1)' and 'col2<-c("N", "N")'

@Ramnath 2010-12-04 02:09:23

here is a one liner along the same lines as aniko's solution, but using hadley's stringr package:

do.call(rbind, str_split(before$type, '_and_'))

@schultem 2013-03-07 09:46:23

this also works with strsplit from the base package

@Melka 2016-03-30 11:34:08

Good catch, best solution for me. Though a bit slower than with the stringr package.

@Aniko 2010-12-04 00:51:30

Yet another approach: use rbind on out:

before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))  
out <- strsplit(as.character(before$type),'_and_') 
do.call(rbind, out)

     [,1]  [,2]   
[1,] "foo" "bar"  
[2,] "foo" "bar_2"
[3,] "foo" "bar"  
[4,] "foo" "bar_2"

And to combine:

data.frame(before$attr, do.call(rbind, out))

@alexis_laz 2016-11-10 18:23:33

Another alternative on newer R versions is strcapture("(.*)_and_(.*)", as.character(before$type), data.frame(type_1 = "", type_2 = ""))

@ashaw 2010-12-03 23:52:51

Another approach if you want to stick with strsplit() is to use the unlist() command. Here's a solution along those lines.

tmp <- matrix(unlist(strsplit(as.character(before$type), '_and_')), ncol=2,
   byrow=TRUE)
after <- cbind(before$attr, as.data.frame(tmp))
names(after) <- c("attr", "type_1", "type_2")

@Gavin Simpson 2010-12-03 23:36:58

An easy way is to use sapply() and the [ function:

before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))
out <- strsplit(as.character(before$type),'_and_')

For example:

> data.frame(t(sapply(out, `[`)))
   X1    X2
1 foo   bar
2 foo bar_2
3 foo   bar
4 foo bar_2

sapply()'s result is a matrix and needs transposing and casting back to a data frame. It is then some simple manipulations that yield the result you wanted:

after <- with(before, data.frame(attr = attr))
after <- cbind(after, data.frame(t(sapply(out, `[`))))
names(after)[2:3] <- paste("type", 1:2, sep = "_")

At this point, after is what you wanted

> after
  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2

Related Questions

Sponsored Content

48 Answered Questions

[SOLVED] How to replace all occurrences of a string in JavaScript

59 Answered Questions

[SOLVED] How do I read / convert an InputStream into a String in Java?

81 Answered Questions

[SOLVED] How do I make the first letter of a string uppercase in JavaScript?

47 Answered Questions

13 Answered Questions

[SOLVED] How to join (merge) data frames (inner, outer, left, right)?

17 Answered Questions

[SOLVED] Does Python have a string 'contains' substring method?

18 Answered Questions

[SOLVED] How to sort a dataframe by multiple column(s)?

57 Answered Questions

[SOLVED] What is the difference between String and string in C#?

76 Answered Questions

[SOLVED] How do I iterate over the words of a string?

  • 2008-10-25 08:58:21
  • Ashwin Nanjappa
  • 2088128 View
  • 2781 Score
  • 76 Answer
  • Tags:   c++ string split

19 Answered Questions

[SOLVED] Drop data frame columns by name

  • 2011-01-05 14:34:29
  • Btibert3
  • 1168945 View
  • 744 Score
  • 19 Answer
  • Tags:   r dataframe r-faq

Sponsored Content