我试图通过python解析PDF文件的关键细节,并提取论文的标题、作者和他们的电子邮件。
from PyPDF2 import PdfReader
reader = PdfReader("paper.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"返回PDF的原始文本。
'Title\nGoes\nHere\nAuthor Name (sdsd@mail.net)\nUniversity of Teeyab\nSeptember 6, 2022\nSome text in the Document.\n'我有一个函数,可以删除换行符和标签等等。
def remove_newlines_tabs(text):
"""
This function will remove all the occurrences of newlines, tabs, and combinations like: \\n, \\.
arguments:
input_text: "text" of type "String".
return:
value: "text" after removal of newlines, tabs, \\n, \\ characters.
Example:
Input : This is her \\ first day at this place.\n Please,\t Be nice to her.\\n
Output : This is her first day at this place. Please, Be nice to her.
"""
# Replacing all the occurrences of \n,\\n,\t,\\ with a space.
Formatted_text = text.replace('\\n', ' ').replace('\n', ' ').replace('\t',' ').replace('\\', ' ').replace('. com', '.com')
return Formatted_text回传
'Title Goes Here Author Name (sdsd@mail.net) University of Teeyab September 6, 2022 Some text in the Document. '这使得提取电子邮件变得很容易。如何提取PDF和作者的标题?标题是最重要的,但我不确定最好的方法.
发布于 2022-09-05 17:03:14
下面是基于以下假设使用regex的解决方案
标题的每一个单词都用换行符分隔\n
()import re
test_string = 'Title\nGoes\nHere\nAuthor Name (sdsd@mail.net)\nUniversity of Teeyab\nSeptember 6, 2022\nSome text in the Document.\n'
# \w matches characters, numbers, and underscore
# \s matches whitespace and \t\n\r\f\v
# first, let's extract string that appears before parentheses
result = re.search(r"([\w\s]+)", test_string)
print(result) # <re.Match object; span=(0, 28), match='Title\nGoes\nHere\nAuthor Name '>
# clean up leading and trailing whitespaces using strip() and
# split the string by \n to separate title and author
title_author = result[0].strip().split("\n")
print(title_author) # ['Title', 'Goes', 'Here', 'Author Name']
# join the words of title as a single string
title = " ".join(title_author[:-1])
author = title_author[-1]
print(title) # Title Goes Here
print(author) # Author Namehttps://stackoverflow.com/questions/73612039
复制相似问题