By LinkCoder


2019-07-06 09:29:18 8 Comments

A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.

So the input could be (but not limited to) the following:

1- "eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn" 2- "What is your\nlastname and email?\ndasf?lkjas" 3- "askjdmk.\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"

(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")

The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text

I only want to extract the question from the garbage input and nothing else.

I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted \n in the input, but the question doesn't always have \n before the first letter of the question.

def extractQuestion(input):
    index_end_q = input.find('?', 1)
    index_first_letter_of_q = 0 # TODO
    question = '\n ' . join(input[index_first_letter_of_q :index_end_q ])


2 comments

@game0ver 2019-07-06 09:42:46

A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant:

#!/usr/bin/env python

import enchant

GLOSSARY = enchant.Dict("en_US")

def isWord(word):
    return True if GLOSSARY.check(word) else False

sentences = [
"eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]

for sentence in sentences:
    for i,w in enumerate(sentence.split()):
        if isWord(w):
            print('index: {} => {}'.format(i, w))
            break

The above piece of code gives as a result:

index: 3 => What
index: 0 => What
index: 0 => Given

@LinkCoder 2019-07-06 09:49:15

Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw

@game0ver 2019-07-06 09:53:20

@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).

@LinkCoder 2019-07-06 09:59:18

Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.

@game0ver 2019-07-06 10:04:06

@LinkCoder that's easy, you can use the python built-in find() function.

@tobias_k 2019-07-06 09:43:31

You could try a regular expression like \b[A-Z][a-z][^?]+\?, meaning:

  • The start of a word \b with an upper case letter [A-Z] followed by a lower case letter [a-z],
  • then a sequence of non-questionmark-characters [^?]+,
  • followed by a literal question mark \?.

This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.

>>> tests = ["eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
             "What is your\nlastname and email?\ndasf?lkjas",
             "\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]

>>> import re
>>> p = r"\b[A-Z][a-z][^?]+\?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
 'What is your\nlastname and email?',
 'Given your skills\nhow would you rate yourself?']

If that's one blob of text, you can use findall instead of search:

>>> text = "\n".join(tests)
>>> re.findall(p, text)
['What is your name?',
 'What is your\nlastname and email?',
 'Given your skills\nhow would you rate yourself?']

Actually, this also seems to work reasonably well for questions with names in them:

>>> t = "asdGARBAGEasd\nHow did you like St. Petersburg? more stuff with ?" 
>>> re.search(p, t).group()
'How did you like St. Petersburg?'

@tobias_k 2019-07-06 10:59:36

@LinkCoder Wouldn't that be exactly what I did in the lower code with text?

@LinkCoder 2019-07-06 11:04:09

Yes I figured it out. I just used re.search(regex_pattern, input, flags=re.S).group().replace("\n", " ") so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?

@tobias_k 2019-07-06 11:06:20

Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.

Related Questions

Sponsored Content

24 Answered Questions

[SOLVED] How to read a text file into a string variable and strip newlines?

  • 2011-12-03 16:47:54
  • klijo
  • 1190447 View
  • 875 Score
  • 24 Answer
  • Tags:   python

24 Answered Questions

[SOLVED] How do I parse a string to a float or int?

23 Answered Questions

[SOLVED] How to check if the string is empty?

8 Answered Questions

[SOLVED] How do I trim whitespace from a string?

  • 2009-04-17 19:16:06
  • robert
  • 1130488 View
  • 1104 Score
  • 8 Answer
  • Tags:   python string trim

11 Answered Questions

[SOLVED] How do I get a substring of a string in Python?

  • 2009-03-19 17:29:41
  • Joan Venge
  • 2735613 View
  • 2004 Score
  • 11 Answer
  • Tags:   python string

19 Answered Questions

[SOLVED] How to remove an element from a list by index?

  • 2009-03-09 18:16:11
  • Joan Venge
  • 2416762 View
  • 1385 Score
  • 19 Answer
  • Tags:   python list

31 Answered Questions

[SOLVED] How do I check if a string is a number (float)?

5 Answered Questions

[SOLVED] How do I lowercase a string in Python?

6 Answered Questions

[SOLVED] How to change a string into uppercase

Sponsored Content