我目前正在使用requests.get从API中提取pdf。我不想下载它们,只是想从它们中提取文本。
response_pdf = requests.get(url, auth=TokenAuth(key))
text = convert_pdf_to_txt(response_pdf.content)下面是函数convert_pdf_to_txt的代码:
def convert_pdf_to_txt(filename):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
#codec ='ISO-8859-1'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(filename, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
text = str(text)
text = text.replace("\\n", "")
text = text.lower()
return text我得到以下错误: UnicodeDecodeError:'utf-8‘编解码器无法解码位置11中的字节0xb5 :无效的开始字节
response_pdf.content是一个“类‘字节’”对象,我不知道如何从其中提取文本。
任何帮助都将不胜感激!
发布于 2019-09-13 12:09:32
您正在传递一个字节字符串,以便将其解释为要打开的文件名,这是不好的。
相反,您可以将字节串读入io.BytesIO(),并以fp的形式传入
def convert_pdf_to_txt(fp):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = "utf-8"
# codec ='ISO-8859-1'
laparams = LAParams()
device = TextConverter(
rsrcmgr, retstr, codec=codec, laparams=laparams
)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(
fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True,
):
interpreter.process_page(page)
text = retstr.getvalue()
device.close()
retstr.close()
text = str(text)
text = text.replace("\\n", "")
text = text.lower()
return text
response_pdf = requests.get(url, auth=TokenAuth(key))
pdf_stream = io.BytesIO(response_pdf.content)
text = convert_pdf_to_txt(pdf_stream)这具有额外的精确性,您仍然可以在文件中使用它:
with open('my_pdf', 'rb') as pdf_stream:
text = convert_pdf_to_txt(pdf_stream)https://stackoverflow.com/questions/57923331
复制相似问题