文章/答案/技术大牛

发布

社区首页 >问答首页 >在python中提取以转义字符结尾的文本

问在python中提取以转义字符结尾的文本
EN

Stack Overflow用户

提问于 2022-09-05 15:59:17

回答 1查看 38关注 0票数 -1

我试图通过python解析PDF文件的关键细节，并提取论文的标题、作者和他们的电子邮件。

from PyPDF2 import PdfReader

reader = PdfReader("paper.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

返回PDF的原始文本。

'Title\nGoes\nHere\nAuthor Name (sdsd@mail.net)\nUniversity of Teeyab\nSeptember 6, 2022\nSome text in the Document.\n'

我有一个函数，可以删除换行符和标签等等。

def remove_newlines_tabs(text):
    """
    This function will remove all the occurrences of newlines, tabs, and combinations like: \\n, \\.
    
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after removal of newlines, tabs, \\n, \\ characters.
        
    Example:
    Input : This is her \\ first day at this place.\n Please,\t Be nice to her.\\n
    Output : This is her first day at this place. Please, Be nice to her. 
    
    """
    
    # Replacing all the occurrences of \n,\\n,\t,\\ with a space.
    Formatted_text = text.replace('\\n', ' ').replace('\n', ' ').replace('\t',' ').replace('\\', ' ').replace('. com', '.com')
    return Formatted_text

回传

'Title Goes Here Author Name (sdsd@mail.net) University of Teeyab September 6, 2022 Some text in the Document. '

这使得提取电子邮件变得很容易。如何提取PDF和作者的标题？标题是最重要的，但我不确定最好的方法.

python

regex

回答 1

Stack Overflow用户

发布于 2022-09-05 17:03:14

下面是基于以下假设使用regex的解决方案

标题的每一个单词都用换行符分隔\n

every作者的单词被一个whitespace

email地址隔开总是用圆括号()

import re


test_string = 'Title\nGoes\nHere\nAuthor Name (sdsd@mail.net)\nUniversity of Teeyab\nSeptember 6, 2022\nSome text in the Document.\n'

# \w matches characters, numbers, and underscore
# \s matches whitespace and \t\n\r\f\v
# first, let's extract string that appears before parentheses
result = re.search(r"([\w\s]+)", test_string)
print(result) # <re.Match object; span=(0, 28), match='Title\nGoes\nHere\nAuthor Name '>

# clean up leading and trailing whitespaces using strip() and
# split the string by \n to separate title and author
title_author = result[0].strip().split("\n")
print(title_author) # ['Title', 'Goes', 'Here', 'Author Name']

# join the words of title as a single string
title = " ".join(title_author[:-1])
author = title_author[-1]

print(title) # Title Goes Here
print(author) # Author Name

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73612039

复制

相似问题

问在python中提取以转义字符结尾的文本
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在python中提取以转义字符结尾的文本EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在python中提取以转义字符结尾的文本
EN