By LinkCoder


2019-07-06 09:29:18 8 Comments

A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.

So the input could be (but not limited to) the following:

1- "eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn" 2- "What is your\nlastname and email?\ndasf?lkjas" 3- "askjdmk.\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"

(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")

The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text

I only want to extract the question from the garbage input and nothing else.

I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted \n in the input, but the question doesn't always have \n before the first letter of the question.

def extractQuestion(input):
    index_end_q = input.find('?', 1)
    index_first_letter_of_q = 0 # TODO
    question = '\n ' . join(input[index_first_letter_of_q :index_end_q ])


2 comments

@game0ver 2019-07-06 09:42:46

A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant:

#!/usr/bin/env python

import enchant

GLOSSARY = enchant.Dict("en_US")

def isWord(word):
    return True if GLOSSARY.check(word) else False

sentences = [
"eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]

for sentence in sentences:
    for i,w in enumerate(sentence.split()):
        if isWord(w):
            print('index: {} => {}'.format(i, w))
            break

The above piece of code gives as a result:

index: 3 => What
index: 0 => What
index: 0 => Given

@LinkCoder 2019-07-06 09:49:15

Yes this would be a great solution for the problem, but what if the input BEFORE the question is also a valid english word? I updated the question btw

@game0ver 2019-07-06 09:53:20

@LinkCoder Then the problem is much more complicated from the one you initially described. Then maybe NLTK be of some help to help you recognize "logical" sentences, but as far as I know that's not an easy problem to solve (and I think it hasn't been solved by the time I posted this answer).

@LinkCoder 2019-07-06 09:59:18

Alright then assuming the input before question is garbage, then how could I find the index of the first letter of the question of the whole input because then I can slice it.

@game0ver 2019-07-06 10:04:06

@LinkCoder that's easy, you can use the python built-in find() function.

@LinkCoder 2019-07-06 10:19:44

Ok I get it now thanks for the effort that you have put into your answer :)

@game0ver 2019-07-06 10:21:35

@LinkCoder No problem, I'm glad my answer was of some help!

@tobias_k 2019-07-06 09:43:31

You could try a regular expression like \b[A-Z][a-z][^?]+\?, meaning:

  • The start of a word \b with an upper case letter [A-Z] followed by a lower case letter [a-z],
  • then a sequence of non-questionmark-characters [^?]+,
  • followed by a literal question mark \?.

This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.

>>> tests = ["eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
             "What is your\nlastname and email?\ndasf?lkjas",
             "\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]

>>> import re
>>> p = r"\b[A-Z][a-z][^?]+\?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
 'What is your\nlastname and email?',
 'Given your skills\nhow would you rate yourself?']

If that's one blob of text, you can use findall instead of search:

>>> text = "\n".join(tests)
>>> re.findall(p, text)
['What is your name?',
 'What is your\nlastname and email?',
 'Given your skills\nhow would you rate yourself?']

Actually, this also seems to work reasonably well for questions with names in them:

>>> t = "asdGARBAGEasd\nHow did you like St. Petersburg? more stuff with ?" 
>>> re.search(p, t).group()
'How did you like St. Petersburg?'

@LinkCoder 2019-07-06 10:19:16

Thanks for the effort that you put into your answer. I have a question though: How can I do the exact thing that you did in your answer on a single string variable?

@tobias_k 2019-07-06 10:59:36

@LinkCoder Wouldn't that be exactly what I did in the lower code with text?

@LinkCoder 2019-07-06 11:04:09

Yes I figured it out. I just used re.search(regex_pattern, input, flags=re.S).group().replace("\n", " ") so thanks for that. Btw is there a regex that doesn't miss names but does the same thing as what you did above?

@tobias_k 2019-07-06 11:06:20

Actually, I think it should work okay if there are names in the question, as long as the start of the question is the first word in the sentence starting with an upper-case letter.

@LinkCoder 2019-07-06 11:11:57

Ok I think I understand it thanks very much for your effort and time :)

Related Questions

Sponsored Content

23 Answered Questions

[SOLVED] How to check if the string is empty?

7 Answered Questions

[SOLVED] How do I trim whitespace from a string?

  • 2009-04-17 19:16:06
  • robert
  • 1106303 View
  • 1073 Score
  • 7 Answer
  • Tags:   python string trim

21 Answered Questions

[SOLVED] How to read a text file into a string variable and strip newlines?

  • 2011-12-03 16:47:54
  • klijo
  • 1074409 View
  • 811 Score
  • 21 Answer
  • Tags:   python

18 Answered Questions

[SOLVED] How to remove an element from a list by index?

  • 2009-03-09 18:16:11
  • Joan Venge
  • 2247816 View
  • 1256 Score
  • 18 Answer
  • Tags:   python list

22 Answered Questions

[SOLVED] How do I parse a string to a float or int?

5 Answered Questions

[SOLVED] How do I lowercase a string in Python?

30 Answered Questions

[SOLVED] How do I check if a string is a number (float)?

11 Answered Questions

[SOLVED] How to substring a string in Python?

  • 2009-03-19 17:29:41
  • Joan Venge
  • 2570281 View
  • 1908 Score
  • 11 Answer
  • Tags:   python string

6 Answered Questions

[SOLVED] How to change a string into uppercase

Sponsored Content