By shib


2018-05-16 15:09:27 8 Comments

I have the following dataframe:

    df <- structure(list(country = c("Ghana", "Eritrea", "Ethiopia", "Ethiopia", 
"Congo - Kinshasa", "Ethiopia", "Ethiopia", "Ghana", "Botswana", 
"Nigeria"), CommodRank = c(1L, 2L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 
1L), topCommodInCountry = c(TRUE, FALSE, FALSE, TRUE, FALSE, 
TRUE, TRUE, TRUE, TRUE, TRUE), Main_Commod = c("Gold", "Copper", 
"Nickel", "Gold", "Gold", "Gold", "Gold", "Gold", "Diamonds", 
"Iron Ore")), row.names = c(NA, -10L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = "country", drop = TRUE, indices = list(
    8L, 4L, 1L, c(2L, 3L, 5L, 6L), c(0L, 7L), 9L), group_sizes = c(1L, 
1L, 1L, 4L, 2L, 1L), biggest_group_size = 4L, labels = structure(list(
    country = c("Botswana", "Congo - Kinshasa", "Eritrea", "Ethiopia", 
    "Ghana", "Nigeria")), row.names = c(NA, -6L), class = "data.frame", vars = "country", drop = TRUE, .Names = "country"), .Names = c("country", 
"CommodRank", "topCommodInCountry", "Main_Commod"))

df

            country CommodRank topCommodInCountry Main_Commod
1             Ghana          1               TRUE        Gold
2           Eritrea          2              FALSE      Copper
3          Ethiopia          3              FALSE      Nickel
4          Ethiopia          1               TRUE        Gold
5  Congo - Kinshasa          3              FALSE        Gold
6          Ethiopia          1               TRUE        Gold
7          Ethiopia          1               TRUE        Gold
8             Ghana          1               TRUE        Gold
9          Botswana          1               TRUE    Diamonds
10          Nigeria          1               TRUE    Iron Ore  

I am trying to add another column showing the top commodity (top CommodRank) for every country in this dataset, but I'm not sure how. I'm able to label 'topcommod' with the 'Main_Commod' where CommodRank == 1, but I want to copy this same value to cases where CommodRank != 1. Looking below, both Ethiopia values at rows 3 & 4 should read 'Gold'.

df %>% mutate(topcommod = ifelse(CommodRank == 1, Main_Commod, 'unknown'))


            country CommodRank topCommodInCountry Main_Commod topcommod
1             Ghana          1               TRUE        Gold      Gold
2           Eritrea          2              FALSE      Copper   unknown
3          Ethiopia          3              FALSE      Nickel   unknown
4          Ethiopia          1               TRUE        Gold      Gold
5  Congo - Kinshasa          3              FALSE        Gold   unknown
6          Ethiopia          1               TRUE        Gold      Gold
7          Ethiopia          1               TRUE        Gold      Gold
8             Ghana          1               TRUE        Gold      Gold
9          Botswana          1               TRUE    Diamonds  Diamonds
10          Nigeria          1               TRUE    Iron Ore  Iron Ore

I'm ideally looking for a dplyr solution I can add to an existing long series of pipe %>% function calls, but any solution would help.

3 comments

@docendo discimus 2018-05-16 15:13:28

IIUC, there are multiple ways to do this, for example:

df %>% mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
                       else Main_Commod[which.max(topCommodInCountry)])

# A tibble: 10 x 5
# Groups:   country [6]
   country          CommodRank topCommodInCountry Main_Commod topCom  
   <chr>                 <int> <lgl>              <chr>       <chr>   
 1 Ghana                     1 TRUE               Gold        Gold    
 2 Eritrea                   2 FALSE              Copper      unknown 
 3 Ethiopia                  3 FALSE              Nickel      Gold    
 4 Ethiopia                  1 TRUE               Gold        Gold    
 5 Congo - Kinshasa          3 FALSE              Gold        unknown 
 6 Ethiopia                  1 TRUE               Gold        Gold    
 7 Ethiopia                  1 TRUE               Gold        Gold    
 8 Ghana                     1 TRUE               Gold        Gold    
 9 Botswana                  1 TRUE               Diamonds    Diamonds
10 Nigeria                   1 TRUE               Iron Ore    Iron Ore

Regarding OP's question in comment how to handle ties of multiple top Commodities, you could do the following:

df %>% 
  mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
              else paste(unique(Main_Commod[topCommodInCountry]), collapse = "/"))

If there are multiple unique top Commodities in a country, they will be paste together into a single string, separated by /.

@shib 2018-05-16 15:43:50

Thanks so much! Off the top of your head is there an obvious way to split and label ties here, such that topCom gets assigned to something like "Gold / Diamonds / ..."? (Say there are 2 or more Main_Commods with CommodRank == 1)

@shib 2018-05-16 17:28:23

Nevermind you just use 'which' instead of 'which.max' to get all row indices, which you can then access and paste unique names together : df %>% mutate(topCom = Main_Commod[which(topCommodInCountry == max(topCommodInCountry))]) %>% unique %>% paste (sep = '', collapse = '/'))

@docendo discimus 2018-05-17 07:11:07

@shib, I updated my answer

@shib 2018-05-17 16:13:01

I see you removed which.max from the answer, could you say why it runs without the function?

@DJV 2018-05-16 16:29:20

It's not an answer but learning greatly from @docendo discimus answer, it took me a second to understand the "if negative" (!any(topCommodInCountry)), and I was wondering if it's only me or it would take my computer a second more to do that too :)

Using the same dataset I examined the idea of making the if else positive. First I tested for identical between the two solutions:

identical(
  #Negative
  df %>% 
    mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
           else Main_Commod[which.max(topCommodInCountry)]), 
  #Positive
  df %>% 
    mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)] 
           else "unknown"))

[1] TRUE

Next, I tested the benchmark of the two:

require(rbenchmark)

benchmark("Negative" = {
  df %>% 
    mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
           else Main_Commod[which.max(topCommodInCountry)])
},
"Positive" = {
  df %>% 
    mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)] 
           else  "unknown")
},
replications = 10000,
columns = c("test", "replications", "elapsed",
            "relative", "user.self", "sys.self"))

The difference is not that big but I'm assuming that with a bigger dataset it will increase.

      test replications elapsed relative user.self sys.self
1 Negative        10000   12.59    1.015     12.44        0
2 Positive        10000   12.41    1.000     12.30        0 

@Ryan 2018-05-16 16:47:35

On my four year old laptop, running for(i in 1:1e6) !TRUE takes about 1/10th of a second. Not worth worrying about.

@Ryan 2018-05-16 17:06:01

It may be worth removing unnecessary ! just for readability, but for what it's worth I think it's pretty intuitive if ! is read as "not" i.e. "If not any topCommodInCountry"

@Nate 2018-05-16 15:21:47

another pattern with dplyr...

df %>% arrange(CommodRank) %>%
    mutate(topCommod = Main_Commod[1])

@akrun 2018-05-16 15:23:47

It is already grouped by 'country'

@Ryan 2018-05-16 15:28:56

Sorting the entire data frame (group) will be much slower than getting the max of a single column (group)

@DJV 2018-05-16 16:37:05

In addition to @Ryan comment, if you don't arrange your dataset correctly, doing 'Main_Commod[1]' can be very dangerous/wrong

Related Questions

Sponsored Content

1 Answered Questions

[SOLVED] Create new columns based on vector in dplyr

  • 2018-03-14 14:24:12
  • C. Braun
  • 61 View
  • 0 Score
  • 1 Answer
  • Tags:   r dplyr

1 Answered Questions

dplyr: Create new column based on rows from another column

  • 2018-01-10 19:19:49
  • Nadescha
  • 168 View
  • -1 Score
  • 1 Answer
  • Tags:   r dataframe dplyr

3 Answered Questions

[SOLVED] Summarize data with condition and create new row (dplyr)

  • 2017-12-03 14:45:51
  • Mandy
  • 67 View
  • 0 Score
  • 3 Answer
  • Tags:   r dplyr

1 Answered Questions

[SOLVED] grouping two columns based on a condition in dplyr

  • 2017-11-07 18:51:52
  • Niranjan
  • 55 View
  • 0 Score
  • 1 Answer
  • Tags:   r dplyr

2 Answered Questions

[SOLVED] Create a ranking variable with dplyr?

  • 2014-09-29 18:25:58
  • Ignacio
  • 15379 View
  • 26 Score
  • 2 Answer
  • Tags:   r dplyr

2 Answered Questions

[SOLVED] New List Column from other vector columns with dplyr and rowwise

  • 2017-09-25 01:12:44
  • New_code
  • 109 View
  • 1 Score
  • 2 Answer
  • Tags:   r dplyr

2 Answered Questions

[SOLVED] R: creating new variable with conditions using dplyr

  • 2017-06-01 20:07:06
  • Gan Uger
  • 75 View
  • 2 Score
  • 2 Answer
  • Tags:   r dplyr

2 Answered Questions

[SOLVED] How can I create a new column based on conditional statements and dplyr?

  • 2016-03-15 05:59:29
  • maximusdooku
  • 1047 View
  • 1 Score
  • 2 Answer
  • Tags:   r dplyr

1 Answered Questions

[SOLVED] Using Dplyr In A Function To Create New Dataframes

  • 2015-11-20 06:01:28
  • ClintWeathers
  • 484 View
  • 0 Score
  • 1 Answer
  • Tags:   r dplyr

2 Answered Questions

[SOLVED] how to make a bar plot for a list of dataframes?

  • 2014-02-27 15:52:03
  • antonio
  • 311 View
  • 1 Score
  • 2 Answer
  • Tags:   r list plot

Sponsored Content