By TheDon


2020-06-29 22:35:09 8 Comments

Hi all and thanks for help in advance.

The problem I am trying to solve is as follows:

I have two columns within one CSV file: Column A and Column B.

There are certain patterns that need to be present in my data in each row underneath column A and B.

For example, if there is a "1" in row 1 of column A , there must be a "5" adjacent to it in row 1 of column B.

If there was a "1" in row two of Column A, and a "2" adjacent to it in row 2 of column B I would need this to be flagged and printed out as "does not follow pattern"

The rules go as follows:

  • Any time theres a 1 in column A, next to it there should be a 5 in column B

  • Any time theres a 3 in column A, next to it there should be a 6 in column B

  • Any time theres a 2 in column A, next to it there should be a 4 in column B

  • Anytime these rules are not followed a return statement should say "pattern not followed"

Here is where I am at on the code, I can't seem to figure out a way of doing this data check.

import numpy as np
import pandas as pd
import os  # filepaths
import glob
import getpass  # Login information


unane = getpass.getuser()

# Paths:
path2proj = os.path.join('C:', os.sep, 'Users', unane, 'Documents', 'Home','Expts','TSP', '')
path2data = os.path.join(path2proj,'Data','')
path2asys = os.path.join(path2proj,'Analysis', '')
path2figs = os.path.join(path2asys, 'figures', '')
path2hddm = os.path.join(path2asys, 'modeling', '')

df = pd.read_csv(path2data + '001_2012_Dec_19_0932_PST_train.csv')

os.chdir(path2data)

# extension = 'csv'
# all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

all_filenames = glob.glob("*.csv")
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames]) 
combined_csv.to_csv("combined_csv.csv",index=False, encoding='utf-8-sig')

df = pd.read_csv(path2data + 'combined_csv.csv')
df['left_stim_number'].equals(df['right_stim_number'])
df = pd.read_csv(path2data + 'combined_csv.csv')
df1 = pd.DataFrame(df, columns=['left_stim_number'])
df2 = pd.DataFrame(df, columns=['right_stim_number'])
df1['match'] = np.where(df1['left_stim_number']== df2['right_stim_number'], True, False)

# Checking to see if there are any errors as all should add up to 7
df1['add'] = np.where(df1['left_stim_number']== df2['right_stim_number'], 0, df1['left_stim_number'] + df2['right_stim_number'])

# def see_correct(df):
#    if df1['add'] == ['7']:
#        return 1
#    else: 
#        return 0
    
# df1.tail(10)

combined_csv.isna().sum()
combined_csv.dropna()

df.loc[df['left_stim_number'] != df['right_stim_number'],:]






---

Example of CSV data

A(left_stim_number)        Column B (Right_stim_number)
1                 5

1                 5

3                 6

1                 5

3                 6

2                 4

2                 4

2                 4

1                 5

1 comments

@wwii 2020-06-30 00:36:20

Since we don't have an example I'll make one up - a pandas DataFrame with two columns of integers.

import numpy as np
import pandas as pd
np.random.seed(2)
df = pd.DataFrame({'colA':np.random.randint(0,10,100),
                   'colB':np.random.randint(0,10,100)})

>>> df.head()
   colA  colB
0     8     7
1     8     1
2     6     9
3     2     2
4     8     1

There might be more concise ways to do this nut this is pretty clear what is happening. This uses a lot of boolean indexing.

Your rules exclude any row in colA that is not 1,2,or 3. Rows in colB that are not 4,5,or 6 are also excluded. You can make a mask for all the excluded rows.

mask = ~df.colA.isin([1,2,3]) | ~df.colB.isin([4,5,6])

>>> df[mask].head()
   colA  colB
0     8     7
1     8     1
2     6     9
3     2     2
4     8     1
>>>

You can use the mask to assign "pattern not followed" to a new column for all those rows.

df.loc[mask,'colC'] = 'pattern not followed'

>>> df.head()
   colA  colB                  colC
0     8     7  pattern not followed
1     8     1  pattern not followed
2     6     9  pattern not followed
3     2     2  pattern not followed
4     8     1  pattern not followed

You can also use the mask to find all the rows that might match your criteria. Notice colC is NaN for these rows.

>>> df[~mask]
    colA  colB colC
13     3     5  NaN
35     2     6  NaN
39     1     5  NaN
61     2     5  NaN
62     1     5  NaN
65     1     6  NaN
69     1     5  NaN
70     2     4  NaN
77     3     5  NaN
92     1     6  NaN
98     2     5  NaN
>>>

Set colC the rows that meet the criteria to True(?).

df.loc[(df.colA == 1) & (df.colB == 5),'colC'] = True
df.loc[(df.colA == 3) & (df.colB == 6),'colC'] = True
df.loc[(df.colA == 2) & (df.colB == 4),'colC'] = True

That leaves some outliers.

>>> df.loc[df.colC.isna()]
    colA  colB colC
13     3     5  NaN
35     2     6  NaN
61     2     5  NaN
65     1     6  NaN
77     3     5  NaN
92     1     6  NaN
98     2     5  NaN

Which can be fixed with.

df.loc[df.colC.isna(),'colC'] = 'pattern not followed'

After looking at that only the last four operations are needed.

df.loc[(df.colA == 1) & (df.colB == 5),'colC'] = True
df.loc[(df.colA == 3) & (df.colB == 6),'colC'] = True
df.loc[(df.colA == 2) & (df.colB == 4),'colC'] = True

df.loc[df.colC.isna(),'colC'] = 'pattern not followed'


>>> df.loc[df.colC == True]
    colA  colB  colC
39     1     5  True
62     1     5  True
69     1     5  True
70     2     4  True
>>>

If the text in csv file looks like this-

4,9
8,3
4,6
2,4
7,5
1,3
 .
 .
 .

The Dataframe can be made with -

df = pd.read_csv('data.csv',names=['colA','colB'])

Related Questions

Sponsored Content

45 Answered Questions

[SOLVED] How do I merge two dictionaries in a single expression in Python?

21 Answered Questions

[SOLVED] What are metaclasses in Python?

27 Answered Questions

[SOLVED] Renaming columns in pandas

15 Answered Questions

[SOLVED] Delete column from pandas DataFrame

24 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

22 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas

26 Answered Questions

[SOLVED] Does Python have a ternary conditional operator?

62 Answered Questions

[SOLVED] Calling an external command from Python

10 Answered Questions

[SOLVED] Does Python have a string 'contains' substring method?

10 Answered Questions

[SOLVED] How to select rows from a DataFrame based on column values?

Sponsored Content