By Matias Andina


2017-01-18 20:02:35 8 Comments

I have a program that gives me data in this format

toy
                file_path Condition Trial.Num A B  C  ID A B  C   ID  A B  C    ID
1     root/some.extension  Baseline         1 2 3  5 car 2 1  7 bike  4 9  0 plane
2    root/thing.extension  Baseline         2 3 6 45 car 5 4  4 bike  9 5  4 plane
3     root/else.extension  Baseline         3 4 4  6 car 7 5  4 bike 68 7 56 plane
4 root/uniquely.extension Treatment         1 5 3  7 car 1 7 37 bike  9 8  7 plane
5  root/defined.extension Treatment         2 6 7  3 car 4 6  8 bike  9 0  8 plane

My goal is to tidy the format into something that at least can be easier to finally tidy with reshape having unique column names

tidy_toy
                 file_path Condition Trial.Num  A B  C    ID
1      root/some.extension  Baseline         1  2 3  5   car
2     root/thing.extension  Baseline         2  3 6 45   car
3      root/else.extension  Baseline         3  4 4  6   car
4  root/uniquely.extension Treatment         1  5 3  7   car
5   root/defined.extension Treatment         2  6 7  3   car
6      root/some.extension  Baseline         1  2 1  7  bike
7     root/thing.extension  Baseline         2  5 4  4  bike
8      root/else.extension  Baseline         3  7 5  4  bike
9  root/uniquely.extension Treatment         1  1 7 37  bike
10  root/defined.extension Treatment         2  4 6  8  bike
11     root/some.extension  Baseline         1  4 9  0 plane
12    root/thing.extension  Baseline         2  9 5  4 plane
13     root/else.extension  Baseline         3 68 7 56 plane
14 root/uniquely.extension Treatment         1  9 8  7 plane
15  root/defined.extension Treatment         2  9 0  8 plane

If I try to melt from toy it doesn't work because only the first ID column will get used for id.vars (hence everything will get tagged as cars). Identical variables will get dropped.

Here's the dput of both tables

   structure(list(file_path = structure(c(3L, 4L, 2L, 5L, 1L), .Label = c("root/defined.extension", 
    "root/else.extension", "root/some.extension", "root/thing.extension", 
    "root/uniquely.extension"), class = "factor"), Condition = structure(c(1L, 
    1L, 1L, 2L, 2L), .Label = c("Baseline", "Treatment"), class = "factor"), 
        Trial.Num = c(1L, 2L, 3L, 1L, 2L), A = 2:6, B = c(3L, 6L, 
        4L, 3L, 7L), C = c(5L, 45L, 6L, 7L, 3L), ID = structure(c(1L, 
        1L, 1L, 1L, 1L), .Label = "car", class = "factor"), A = c(2L, 
        5L, 7L, 1L, 4L), B = c(1L, 4L, 5L, 7L, 6L), C = c(7L, 4L, 
        4L, 37L, 8L), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "bike", class = "factor"), 
        A = c(4L, 9L, 68L, 9L, 9L), B = c(9L, 5L, 7L, 8L, 0L), C = c(0L, 
        4L, 56L, 7L, 8L), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "plane", class = "factor")), .Names = c("file_path", 
    "Condition", "Trial.Num", "A", "B", "C", "ID", "A", "B", "C", 
    "ID", "A", "B", "C", "ID"), class = "data.frame", row.names = c(NA, 
    -5L))


structure(list(file_path = structure(c(3L, 4L, 2L, 5L, 1L, 3L, 
4L, 2L, 5L, 1L, 3L, 4L, 2L, 5L, 1L), .Label = c("root/defined.extension", 
"root/else.extension", "root/some.extension", "root/thing.extension", 
"root/uniquely.extension"), class = "factor"), Condition = structure(c(1L, 
1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L), .Label = c("Baseline", 
"Treatment"), class = "factor"), Trial.Num = c(1L, 2L, 3L, 1L, 
2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 1L, 2L), A = c(2L, 3L, 4L, 
5L, 6L, 2L, 5L, 7L, 1L, 4L, 4L, 9L, 68L, 9L, 9L), B = c(3L, 6L, 
4L, 3L, 7L, 1L, 4L, 5L, 7L, 6L, 9L, 5L, 7L, 8L, 0L), C = c(5L, 
45L, 6L, 7L, 3L, 7L, 4L, 4L, 37L, 8L, 0L, 4L, 56L, 7L, 8L), ID = structure(c(2L, 
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L), .Label = c("bike", 
"car", "plane"), class = "factor")), .Names = c("file_path", 
"Condition", "Trial.Num", "A", "B", "C", "ID"), class = "data.frame", row.names = c(NA, 
-15L))

2 comments

@Moody_Mudskipper 2019-09-14 03:20:32

with tidyverse we can do :

library(tidyverse)
toy %>%
  repair_names(sep="_") %>%
  pivot_longer(-(1:3),names_to = c(".value","id"), names_sep="_") %>%
  select(-id)

#> # A tibble: 15 x 7
#>    file_path               Condition Trial.Num     A     B     C ID   
#>    <fct>                   <fct>         <int> <int> <int> <int> <fct>
#>  1 root/some.extension     Baseline          1     2     3     5 car  
#>  2 root/some.extension     Baseline          1     2     1     7 bike 
#>  3 root/some.extension     Baseline          1     4     9     0 plane
#>  4 root/thing.extension    Baseline          2     3     6    45 car  
#>  5 root/thing.extension    Baseline          2     5     4     4 bike 
#>  6 root/thing.extension    Baseline          2     9     5     4 plane
#>  7 root/else.extension     Baseline          3     4     4     6 car  
#>  8 root/else.extension     Baseline          3     7     5     4 bike 
#>  9 root/else.extension     Baseline          3    68     7    56 plane
#> 10 root/uniquely.extension Treatment         1     5     3     7 car  
#> 11 root/uniquely.extension Treatment         1     1     7    37 bike 
#> 12 root/uniquely.extension Treatment         1     9     8     7 plane
#> 13 root/defined.extension  Treatment         2     6     7     3 car  
#> 14 root/defined.extension  Treatment         2     4     6     8 bike 
#> 15 root/defined.extension  Treatment         2     9     0     8 plane
#> Warning message:
#> Expected 2 pieces. Missing pieces filled with `NA` in 4 rows [1, 2, 3, 4].

@Jaap 2017-01-18 20:13:56

You can use the make.unique-function to create unique column names. After that you can use melt from the data.table-package which is able to create multiple value-columns based on patterns in the columnnames:

# make the column names unique
names(toy) <- make.unique(names(toy))
# let the 'Condition' column start with a small letter 'c'
# so it won't be detected by the patterns argument from melt
names(toy)[2] <- tolower(names(toy)[2])

# load the 'data.table' package
library(data.table)
# tidy the data into long format
tidy_toy <- melt(setDT(toy), 
                 measure.vars = patterns('^A','^B','^C','^ID'), 
                 value.name = c('A','B','C','ID'))

which gives:

 > tidy_toy
                  file_path condition Trial.Num variable  A B  C    ID
 1:     root/some.extension  Baseline         1        1  2 3  5   car
 2:    root/thing.extension  Baseline         2        1  3 6 45   car
 3:     root/else.extension  Baseline         3        1  4 4  6   car
 4: root/uniquely.extension Treatment         1        1  5 3  7   car
 5:  root/defined.extension Treatment         2        1  6 7  3   car
 6:     root/some.extension  Baseline         1        2  2 1  7  bike
 7:    root/thing.extension  Baseline         2        2  5 4  4  bike
 8:     root/else.extension  Baseline         3        2  7 5  4  bike
 9: root/uniquely.extension Treatment         1        2  1 7 37  bike
10:  root/defined.extension Treatment         2        2  4 6  8  bike
11:     root/some.extension  Baseline         1        3  4 9  0 plane
12:    root/thing.extension  Baseline         2        3  9 5  4 plane
13:     root/else.extension  Baseline         3        3 68 7 56 plane
14: root/uniquely.extension Treatment         1        3  9 8  7 plane
15:  root/defined.extension Treatment         2        3  9 0  8 plane

Another option is to use a list of column-indexes for measure.vars:

tidy_toy <- melt(setDT(toy), 
                 measure.vars = list(c(4,8,12), c(5,9,13), c(6,10,14), c(7,11,15)), 
                 value.name = c('A','B','C','ID'))

Making the column-names unique isn't necessary then.


A more complicated method that creates names that are better distinguishable by the patterns argument:

# select the names that are not unique
tt <- table(names(toy))
idx <- which(names(toy) %in% names(tt)[tt > 1])
nms <- names(toy)[idx]

# make them unique
names(toy)[idx] <- paste(nms, 
                         rep(seq(length(nms) / length(names(tt)[tt > 1])), 
                             each = length(names(tt)[tt > 1])), 
                         sep = '.')

# your columnnames are now unique:
> names(toy)
 [1] "file_path" "Condition" "Trial.Num" "A.1"       "B.1"       "C.1"       "ID.1"      "A.2"      
 [9] "B.2"       "C.2"       "ID.2"      "A.3"       "B.3"       "C.3"       "ID.3"     

# tidy the data into long format
tidy_toy <- melt(setDT(toy), 
                 measure.vars = patterns('^A.\\d','^B.\\d','^C.\\d','^ID.\\d'), 
                 value.name = c('A','B','C','ID'))

which will give the same end-result.


As mentioned in the comments, the janitor-package can be helpful for this problem as well. The clean_names() works similar as the make.unique function. See here for an explanation.

@Matias Andina 2017-01-18 20:22:57

This is very creative! I will try to play with the patterns of the variables (I have more columns in my original data than A,B,C)

@Jake Kaupp 2017-01-18 20:24:20

I'm a big fan of the janitor package for cleaning data frames. Helps for situations like this.

Related Questions

Sponsored Content

16 Answered Questions

[SOLVED] Convert data.frame columns from factors to characters

  • 2010-05-17 16:52:02
  • Mike Dewar
  • 541405 View
  • 327 Score
  • 16 Answer
  • Tags:   r dataframe

10 Answered Questions

[SOLVED] How to select rows from a DataFrame based on column values?

23 Answered Questions

[SOLVED] Renaming columns in pandas

19 Answered Questions

[SOLVED] How to sort a dataframe by multiple column(s)

15 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

16 Answered Questions

[SOLVED] Create an empty data.frame

  • 2012-05-21 16:35:18
  • Jeff Allen
  • 758842 View
  • 446 Score
  • 16 Answer
  • Tags:   r dataframe r-faq

13 Answered Questions

[SOLVED] Delete column from pandas DataFrame

19 Answered Questions

[SOLVED] Drop data frame columns by name

  • 2011-01-05 14:34:29
  • Btibert3
  • 1320097 View
  • 800 Score
  • 19 Answer
  • Tags:   r dataframe r-faq

16 Answered Questions

[SOLVED] Remove rows with all or some NAs (missing values) in data.frame

Sponsored Content