By nomad10

2019-05-15 16:16:13 8 Comments

I have a DataFrame like this:

data = {'col1': ['A', 'B', 'B', 'A', 'B', 'C', 'B', 'B', 'B', 
                  'A', 'C', 'A', 'B', 'C'],
        'col2': ['NaN', 'comment1', 'comment2', 'NaN', 'comment3', NaN,
                 'comment4', 'comment5', 'comment6', 
                 'NaN', 'NaN', 'NaN', 'comment7', 'NaN]}

frame = pd.DataFrame(data)

col1  col2
A     NaN
B     comment1
B     comment2
A     NaN
B     comment3
C     NaN
B     comment4
B     comment5
B     comment6
A     NaN
C     NaN
A     NaN
B     comment7
C     NaN

Each row with col1 == 'B' has a comment which will be a string. I need to aggregate the comments and fill the preceding row (where col1 != 'B') with the resulting aggregated string.

Any given row where col1 != 'B' could have none, one or many corresponding rows of comments (col1 == 'B') which seems to be the crux of the problem. I can't just use fillna('bfill') etc.

I have looked into iterrows(), groupby(), while loops and tried to build my own function. But, I don't think I'm fully understanding how all of those are working.

Finished product should look like this:

col1    col2
A       comment1 + comment2
B       comment1
B       comment2
A       comment3
B       comment3
C       comment4 + comment5 + comment6
B       comment4
B       comment5
B       comment6
A       NaN
C       NaN
A       comment7
B       comment7
C       NaN

Eventually I will be dropping all rows where col1 == 'B', but for now I'd like to keep them for verification.


@nomad10 2019-05-22 20:11:08

df['col_group'] = -1
col_group = 0
for i in df.index:
    if df.loc[i, 'col1'] != 'B':
        col_group += 1
    df.loc[i, 'col_group'] = col_group

comments = df[df['col1'] == 'B']
transactions = df[df['col1'] != 'B']
agg_comments = comments.groupby('col_group')['col2'].apply(lambda x: reduce(lambda i,j: i+"&$#"+j,x)).reset_index()
df = pd.merge(transactions, agg_comments, on='col_group', how='outer')

@nomad10 2019-05-28 15:43:09

So, this does work but it is extremely slow for hundreds of thousands to 1 million rows. Does anyone have any suggestions on how this could be sped up?

@yatu 2019-05-15 16:22:46

Here's one way using GroupBy with a custom grouper to concatenate the strings where col1 is B:

where_a ='B') 
g = where_a.cumsum()
com = frame[frame.col1.eq('B')].groupby(g).col2.agg(lambda x:' + '))
till = (frame.col2.isna() & frame.col2.shift(-1).notna())[::-1].idxmax()
ixs = where_a[:till+1].reindex(frame.index).fillna(False)
frame.loc[ixs, 'col2'] = com.values


    col1                         col2
0     A             comment1 + comment2
1     B                        comment1
2     B                        comment2
3     A                        comment3
4     B                        comment3
5     C  comment4 + comment5 + comment6
6     B                        comment4
7     B                        comment5
8     B                        comment6
9     A                             NaN
10    C                             NaN

@nomad10 2019-05-15 16:42:59

So, this did work for the example I gave, but not for the larger dataframe I am working with. I am getting the following ValueError: "Must have equal len keys and value when setting with an iterable.

@yatu 2019-05-15 16:46:33

This means that you have some col1 which does not Co tain or A or B @nomad10 perhaps consecutive As? Let me check later for these cases

@yatu 2019-05-15 16:50:16

So I'm assuming that any value other than B must be filled with the following comments? @nomad10

@nomad10 2019-05-15 16:59:24

Yes, your assumption is correct. Tried the updated code but still got the same error.

@yatu 2019-05-15 19:46:17

Okay, should work now @nomad10 . Make sure NaNs are proper NaNs, i.e. np.nan. A sring 'NaN' isn't useful

@nomad10 2019-05-15 20:09:55

Okay, I've verified NaNs and still getting same issue

@nomad10 2019-05-15 20:20:50

Yes it is, I must be missing something.

@nomad10 2019-05-16 12:25:11

I've updated the question to include the scenario where there are additional comments after the last few NaN values. Adding these lines in is a better reflection of the full dataset. Also, the example has been updated to reflect the changes. If I run the current answer against the current example, I get the following ValueError: shape mismatch: value array of shape (4,) could not be broadcast to indexing result of shape (6,).

@yatu 2019-05-16 12:27:52

Hmm okay I'll check in a bit. We'll get it to work some day @nonad10 :)

@nomad10 2019-05-20 15:44:25

Hi @yatu, I'm still struggling to figure this out. Do you know of a way to get more people to see the question?

@nomad10 2019-05-20 22:33:24

Figured it out with the help of a friend. I'll post the answer shortly

@yatu 2019-05-21 07:11:14

Hi sorry I completely forgot about it, and was too busy yesterday to look into it. Glad you got it to work @nomad10

Related Questions

Sponsored Content

3 Answered Questions

[SOLVED] How to pivot a dataframe

9 Answered Questions

[SOLVED] Deleting DataFrame row in Pandas based on column value

  • 2013-08-11 14:14:57
  • TravisVOX
  • 813110 View
  • 506 Score
  • 9 Answer
  • Tags:   python pandas

1 Answered Questions

[SOLVED] How to fill NA values for a set of Primary key grouping in Pandas Python

10 Answered Questions

[SOLVED] How to select rows from a DataFrame based on column values?

1 Answered Questions

1 Answered Questions

[SOLVED] Filter then rank a multi-level index Dataframe

  • 2016-01-13 14:47:57
  • CurryPy
  • 512 View
  • 3 Score
  • 1 Answer
  • Tags:   python pandas

1 Answered Questions

[SOLVED] Pandas: Feeding index values to apply

  • 2014-11-18 12:41:52
  • Armando
  • 84 View
  • 1 Score
  • 1 Answer
  • Tags:   python pandas

Sponsored Content