我一直在尝试从pdfs中提取文本,我正在使用python的PyPDF2和提取文本,但现在我试图从不可复制的PDFs中提取文本。它会返回空字符串。
我正在从这里在线将简单的可复制的PDF转换为不可复制的PDF:https://online-pdf-no-copy.com/
下面是我的代码:
from PyPDF2 import PdfFileReader
def get_info(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
if pdf.isEncrypted:
pdf.decrypt('')
page = pdf.getPage(1)
# print(page)
print('Page type: {}'.format(str(type(page))))
text = page.extractText()
print(text)
if __name__ == '__main__':
path = 'pdfs/finalNoCopy.pdf'
get_info(path)我的输出:
Page type: <class 'PyPDF2.pdf.PageObject'>
Process finished with exit code 0它给了我空字符串。
发布于 2020-01-10 17:34:52
你可以试试这段代码:
import fitz ## Pip install pymupdf
text1=""
file_path = r'your_file_name_with_path.pdf'
doc = fitz.open(file_path)
for page in doc:
text1+=(page.getText())https://stackoverflow.com/questions/59677920
复制相似问题