By user1234440


2012-07-01 21:03:16 8 Comments

I have data in different columns but I don't know how to extract it to save it in another variable.

index  a   b   c
1      2   3   4
2      3   4   5

How do I select 'a', 'b' and save it in to df1?

I tried

df1 = df['a':'b']
df1 = df.ix[:, 'a':'b']

None seem to work.

15 comments

@tozCSS 2018-08-15 18:13:41

Starting with 0.21.0, using .loc or [] with a list with one or more missing labels is deprecated in favor of .reindex. So, the answer to your question is:

df1 = df.reindex(columns=['b','c'])

In prior versions, using .loc[list-of-labels] would work as long as at least 1 of the keys was found (otherwise it would raise a KeyError). This behavior is deprecated and now shows a warning message. The recommended alternative is to use .reindex().

Read more at Indexing and Selecting Data

@Kareem Jeiroudi 2019-05-26 19:21:50

I've seen several answers on that, but on remained unclear to me. How would you select those columns of interest? The answer to that is that if you have them gathered in a list, you can just reference the columns using the list.

Example

print(extracted_features.shape)
print(extracted_features)

(63,)
['f000004' 'f000005' 'f000006' 'f000014' 'f000039' 'f000040' 'f000043'
 'f000047' 'f000048' 'f000049' 'f000050' 'f000051' 'f000052' 'f000053'
 'f000054' 'f000055' 'f000056' 'f000057' 'f000058' 'f000059' 'f000060'
 'f000061' 'f000062' 'f000063' 'f000064' 'f000065' 'f000066' 'f000067'
 'f000068' 'f000069' 'f000070' 'f000071' 'f000072' 'f000073' 'f000074'
 'f000075' 'f000076' 'f000077' 'f000078' 'f000079' 'f000080' 'f000081'
 'f000082' 'f000083' 'f000084' 'f000085' 'f000086' 'f000087' 'f000088'
 'f000089' 'f000090' 'f000091' 'f000092' 'f000093' 'f000094' 'f000095'
 'f000096' 'f000097' 'f000098' 'f000099' 'f000100' 'f000101' 'f000103']

I have the following list/numpy array extracted_features, specifying 63 columns. The original dataset has 103 columns, and I would like to extract exactly those, then I would use

dataset[extracted_features]

And you will end up with this

enter image description here

This something you would use quite often in Machine Learning (more specifically, in feature selection). I would like to discuss other ways too, but I think that has already been covered by other stackoverflowers. Hope this've been helpful!

@Ankita 2018-10-15 11:43:43

One different and easy approach : iterating rows

using iterows

`df1= pd.DataFrame() #creating an empty dataframe
 for index,i in df.iterrows():
 df1.loc[index,'A']=df.loc[index,'A']
 df1.loc[index,'B']=df.loc[index,'B']
 df1.head()

@cs95 2019-06-09 03:49:09

Please do not recommend the use of iterrows(). It is a blatant enabler of the worst anti-pattern in the history of pandas.

@Ankita 2019-06-09 19:41:28

Could you please explain what do you mean by "worst anti-pattern" ?

@cs95 2019-06-09 19:41:57

@Elf 2019-08-16 14:07:01

IMHO, iterrows() should be the last option when using pandas.

@Puneet Sinha 2019-01-29 05:37:50

you can also use df.pop()

>>> df = pd.DataFrame([('falcon', 'bird',    389.0),
...                    ('parrot', 'bird',     24.0),
...                    ('lion',   'mammal',   80.5),
...                    ('monkey', 'mammal', np.nan)],
...                   columns=('name', 'class', 'max_speed'))
>>> df
     name   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal 

>>> df.pop('class')
0      bird
1      bird
2    mammal
3    mammal
Name: class, dtype: object

>>> df
     name  max_speed
0  falcon      389.0
1  parrot       24.0
2    lion       80.5
3  monkey        NaN

let me know if this helps so for you , please use df.pop(c)

@ely 2012-07-02 02:43:02

The column names (which are strings) cannot be sliced in the manner you tried.

Here you have a couple of options. If you know from context which variables you want to slice out, you can just return a view of only those columns by passing a list into the __getitem__ syntax (the []'s).

df1 = df[['a','b']]

Alternatively, if it matters to index them numerically and not by their name (say your code should automatically do this without knowing the names of the first two columns) then you can do this instead:

df1 = df.iloc[:,0:2] # Remember that Python does not slice inclusive of the ending index.

Additionally, you should familiarize yourself with the idea of a view into a Pandas object vs. a copy of that object. The first of the above methods will return a new copy in memory of the desired sub-object (the desired slices).

Sometimes, however, there are indexing conventions in Pandas that don't do this and instead give you a new variable that just refers to the same chunk of memory as the sub-object or slice in the original object. This will happen with the second way of indexing, so you can modify it with the copy() function to get a regular copy. When this happens, changing what you think is the sliced object can sometimes alter the original object. Always good to be on the look out for this.

df1 = df.iloc[0,0:2].copy() # To avoid the case where changing df1 also changes df

To use iloc, you need to know the column positions (or indices). As the column positions may change, instead of hard-coding indices, you can use iloc along with get_loc function of columns method of dataframe object to obtain column indices.

{df.columns.get_loc(c):c for idx, c in enumerate(df.columns)}

Now you can use this dictionary to access columns through names and using iloc.

@Wes McKinney 2012-07-08 17:54:17

Note: df[['a','b']] produces a copy

@ely 2012-07-08 18:09:16

Yes this was implicit in my answer. The bit about the copy was only for use of ix[] if you prefer to use ix[] for any reason.

@hobs 2012-10-31 18:58:23

ix indexes rows, not columns. I thought the OP wanted columns.

@ely 2012-10-31 19:02:47

ix accepts slice arguments, so you can also get columns. For example, df.ix[0:2, 0:2] gets the upper left 2x2 sub-array just like it does for a NumPy matrix (depending on your column names of course). You can even use the slice syntax on string names of the columns, like df.ix[0, 'Col1':'Col5']. That gets all columns that happen to be ordered between Col1 and Col5 in the df.columns array. It is incorrect to say that ix indexes rows. That is just its most basic use. It also supports much more indexing than that. So, ix is perfectly general for this question.

@Antoine 2016-04-06 14:51:26

what if I want to select non-consecutive columns, e.g., index_retain=c(1,3,9)? in R I would just do my_data_frame[,index_retain]

@ely 2016-04-06 16:23:19

@Antoine The relevant documentation is linked here. The short answer is to use the iloc indexing helper, such as my_data_frame.iloc[:, [1, 3, 9]] in your example. Keep in mind that Python indexing is zero-based. More broadly, though, since a DataFrame is typically meant to represent a relational data structure with no implicit ordering on the columns, it is sometimes safer to just write your own helper functions that refer to the columns by name and remove any need to access them by position.

@ely 2016-04-06 16:26:01

For example, if you instantiate a DataFrame from a dict, since dict is inherently unordered and the order of iterated items will depend on the (arbitrary) order in which items were stored during the dict constructor, then from one execution of your program to the next, the column ordering could change. Similar things can happen if you serialize the data to e.g. a .csv file and then load from file in a different program. Many times the order will be stable, but it is the rare few corner cases when you assumed positional order was stable yet it was not that really cause problems.

@bincob 2017-03-11 03:16:32

The slicing in .ix() is inclusive when you use use labels and when the labels themselves are integers, then slicing happens inclusive of the ending index. Try out the below snippet... df = pd.DataFrame(np.ones([2,3])) print df df.ix[:, 1:2] = 0 print df

@Andrew Cassidy 2017-06-23 01:31:17

df.ix[:,0:2] does slice inclusive? df.ix[:,0:2].shape #(2, 3)? I know ix is deprecated

@Ted Petrou 2017-07-01 13:55:53

@AndrewCassidy Never use .ix again. If you want to slice with integers use .iloc which is exclusive of the last position just like Python lists.

@Archan Joshi 2018-01-13 01:43:20

Does the "mean" function avoid the null values? I mean if I had a total of 3 columns, and one of the columns was empty for a particular row, then the mean function should just return the average of the two columns. Does it do that?

@Bowen Liu 2018-09-19 20:33:18

How would you use the .iloc method if we need to select a and c column, which are not next to each other? Thanks.

@Connor 2019-03-24 20:17:20

@ely ix is deprecated.

@Limok Palantaemon 2019-07-18 15:08:52

Old thread, but requesting a clarification, since it probably isn't worth its own question. In the case of indexing like df[['a','b']] is the order of the columns guaranteed to be preserve (i.e a then b)? So for example df.loc[0, ['a','b']].to_numpy(), it would matter which order the columns were.

@324 2019-09-12 01:40:20

How would you take, say, only columns 2 and 5?

@ely 2019-09-12 12:19:16

@dte324 If your DataFrame is named df then use df.iloc[:, [1, 4]]. Usually if you want this type of access pattern, you'll already know these particular column names, and you can just use df.loc[:, ['name2', 'name5']] where 'name2' and 'name5' are your column string names for the respective columns you want, or look the names up with e.g. name2 = df.columns[1].

@pink.slash 2019-02-20 01:01:58

You can use pandas. I create the DataFrame:

    import pandas as pd
    df = pd.DataFrame([[1, 2,5], [5,4, 5], [7,7, 8], [7,6,9]], 
                      index=['Jane', 'Peter','Alex','Ann'],
                      columns=['Test_1', 'Test_2', 'Test_3'])

The DataFrame:

           Test_1  Test_2  Test_3
    Jane        1       2       5
    Peter       5       4       5
    Alex        7       7       8
    Ann         7       6       9

To select 1 or more columns by name:

    df[['Test_1','Test_3']]

           Test_1  Test_3
    Jane        1       5
    Peter       5       5
    Alex        7       8
    Ann         7       9

You can also use:

    df.Test_2

And yo get column Test_2

    Jane     2
    Peter    4
    Alex     7
    Ann      6

You can also select columns and rows from these rows using .loc(). This is called "slicing". Notice that I take from column Test_1to Test_3

    df.loc[:,'Test_1':'Test_3']

The "Slice" is:

            Test_1  Test_2  Test_3
     Jane        1       2       5
     Peter       5       4       5
     Alex        7       7       8
     Ann         7       6       9

And if you just want Peter and Ann from columns Test_1 and Test_3:

    df.loc[['Peter', 'Ann'],['Test_1','Test_3']]

You get:

           Test_1  Test_3
    Peter       5       5
    Ann         7       9

@ayhan 2016-04-30 12:39:08

As of version 0.11.0, columns can be sliced in the manner you tried using the .loc indexer:

df.loc[:, 'C':'E']

is equivalent of

df[['C', 'D', 'E']]  # or df.loc[:, ['C', 'D', 'E']]

and returns columns C through E.


A demo on a randomly generated DataFrame:

import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)), 
                  columns=list('ABCDEF'), 
                  index=['R{}'.format(i) for i in range(100)])
df.head()

Out: 
     A   B   C   D   E   F
R0  99  78  61  16  73   8
R1  62  27  30  80   7  76
R2  15  53  80  27  44  77
R3  75  65  47  30  84  86
R4  18   9  41  62   1  82

To get the columns from C to E (note that unlike integer slicing, 'E' is included in the columns):

df.loc[:, 'C':'E']

Out: 
      C   D   E
R0   61  16  73
R1   30  80   7
R2   80  27  44
R3   47  30  84
R4   41  62   1
R5    5  58   0
...

Same works for selecting rows based on labels. Get the rows 'R6' to 'R10' from those columns:

df.loc['R6':'R10', 'C':'E']

Out: 
      C   D   E
R6   51  27  31
R7   83  19  18
R8   11  67  65
R9   78  27  29
R10   7  16  94

.loc also accepts a boolean array so you can select the columns whose corresponding entry in the array is True. For example, df.columns.isin(list('BCD')) returns array([False, True, True, True, False, False], dtype=bool) - True if the column name is in the list ['B', 'C', 'D']; False, otherwise.

df.loc[:, df.columns.isin(list('BCD'))]

Out: 
      B   C   D
R0   78  61  16
R1   27  30  80
R2   53  80  27
R3   65  47  30
R4    9  41  62
R5   78   5  58
...

@Vivek Ananthan 2018-11-21 15:32:56

With pandas,

wit column names

dataframe[['column1','column2']]

with iloc, column index can be used like

dataframe[:,[1,2]]

with loc column names can be used like

dataframe[:,['column1','column2']]

hope it helps !

@W.Perrin 2018-01-03 07:56:07

If you want to get one element by row index and column name, you can do it just like df['b'][0]. It is as simple as you can image.

Or you can use df.ix[0,'b'],mixed usage of index and label.

Note: Since v0.20 ix has been deprecated in favour of loc / iloc.

@Harshit 2018-07-21 21:28:41

The different approaches discussed in above responses are based on the assumption that either the user knows column indices to drop or subset on, or the user wishes to subset a dataframe using a range of columns (for instance between 'C' : 'E'). pandas.DataFrame.drop() is certainly an option to subset data based on a list of columns defined by user (though you have to be cautious that you always use copy of dataframe and inplace parameters should not be set to True!!)

Another option is to use pandas.columns.difference(), which does a set difference on column names, and returns an index type of array containing desired columns. Following is the solution:

df = pd.DataFrame([[2,3,4],[3,4,5]],columns=['a','b','c'],index=[1,2])
columns_for_differencing = ['a']
df1 = df.copy()[df.columns.difference(columns_for_differencing)]
print(df1)

The output would be: b c 1 3 4 2 4 5

@Bazyli Debowski 2018-08-08 17:20:55

The copy() is not necessary. i.e: df1 = df[df.columns.difference(columns_for_differencing)] will return a new/copied dataframe. You will be able to modify df1 without altering df. Thank you, btw. This was exactly what I needed.

@Alvis 2017-05-02 09:41:52

I found this method to be very useful:

# iloc[row slicing, column slicing]
surveys_df.iloc [0:3, 1:4]

More details can be found here

@324 2019-09-12 01:40:55

How would you take, say, only columns 2 and 5?

@Julian Gorfer 2019-10-03 08:43:11

That would be surveys_df.iloc [:, [2,5]] then.

@hobs 2012-10-31 18:57:33

Assuming your column names (df.columns) are ['index','a','b','c'], then the data you want is in the 3rd & 4th columns. If you don't know their names when your script runs, you can do this

newdf = df[df.columns[2:4]] # Remember, Python is 0-offset! The "3rd" entry is at slot 2.

As EMS points out in his answer, df.ix slices columns a bit more concisely, but the .columns slicing interface might be more natural because it uses the vanilla 1-D python list indexing/slicing syntax.

WARN: 'index' is a bad name for a DataFrame column. That same label is also used for the real df.index attribute, a Index array. So your column is returned by df['index'] and the real DataFrame index is returned by df.index. An Index is a special kind of Series optimized for lookup of it's elements' values. For df.index it's for looking up rows by their label. That df.columns attribute is also a pd.Index array, for looking up columns by their labels.

@ely 2012-10-31 19:11:08

As I noted in my comment above, .ix is not just for rows. It is for general purpose slicing, and can be used for multidimensional slicing. It is basically just an interface to NumPy's usual __getitem__ syntax. That said, you can easily convert a column-slicing problem into a row-slicing problem by just applying a transpose operation, df.T. Your example uses columns[1:3], which is a little misleading. The result of columns is a Series; be careful not to just treat it like an array. Also, you should probably change it to be columns[2:3] to match up with your "3rd & 4th" comment.

@hobs 2016-02-04 17:26:58

@Mr.F: My [2:4] is correct. Your [2:3] is wrong. And using standard python slicing notation to generate a sequence/Series is not misleading IMO. But I like your bypass of the DataFrame interface to access the underlying numpy array with ix.

@ely 2016-02-04 18:05:49

You are correct in this case, but the point I was trying to make is that in general, slicing with labels in Pandas is inclusive of the slice endpoint (or at least this was the behavior in most previous Pandas versions). So if you retrieve df.columns and want to slice it by label, then you'd have different slice semantics than if you slice it by integer index position. I definitely did not explain it well in my previous comment though.

@hobs 2016-02-05 00:17:40

Ahh, now I see your point. I forgot that columns is an immutable Series and the getter has been overridden to use labels as indices. Thanks for taking the time to clarify.

@Martien Lubberink 2017-07-01 23:57:22

Note the Deprecation Warning: .ix is deprecated. Therefore this makes sense: newdf = df[df.columns[2:4]]

@zerovector 2016-02-04 14:05:35

I realize this question is quite old, but in the latest version of pandas there is an easy way to do exactly this. Column names (which are strings) can be sliced in whatever manner you like.

columns = ['b', 'c']
df1 = pd.DataFrame(df, columns=columns)

@Banjocat 2017-11-28 07:05:50

This can only be done on creation. The question is asking if you already have it in a dataframe.

@mhery 2019-08-16 13:46:55

@Banjocat, it works with an existing dataframe

@Muthu Chithambara Jothi 2014-09-03 11:30:59

You could provide a list of columns to be dropped and return back the DataFrame with only the columns needed using the drop() function on a Pandas DataFrame.

Just saying

colsToDrop = ['a']
df.drop(colsToDrop, axis=1)

would return a DataFrame with just the columns b and c.

The drop method is documented here.

@Wes McKinney 2012-07-08 17:55:12

In [39]: df
Out[39]: 
   index  a  b  c
0      1  2  3  4
1      2  3  4  5

In [40]: df1 = df[['b', 'c']]

In [41]: df1
Out[41]: 
   b  c
0  3  4
1  4  5

@kuanb 2017-02-14 20:30:58

What if I wanted to rename the column, for example something like: df[['b as foo', 'c as bar'] such that the output renames column b as foo and column c as bar?

@Greg 2017-08-25 22:48:43

df[['b', 'c']].rename(columns = {'b' : 'foo', 'c' : 'bar'})

Related Questions

Sponsored Content

22 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

10 Answered Questions

[SOLVED] How to select rows from a DataFrame based on column values?

23 Answered Questions

[SOLVED] Renaming columns in pandas

19 Answered Questions

[SOLVED] Add one row to pandas DataFrame

18 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

33 Answered Questions

[SOLVED] How do I UPDATE from a SELECT in SQL Server?

18 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

19 Answered Questions

[SOLVED] How to sort a dataframe by multiple column(s)

13 Answered Questions

[SOLVED] Delete column from pandas DataFrame

19 Answered Questions

[SOLVED] Drop data frame columns by name

  • 2011-01-05 14:34:29
  • Btibert3
  • 1317698 View
  • 799 Score
  • 19 Answer
  • Tags:   r dataframe r-faq

Sponsored Content