文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在不下载Python的情况下从url中提取文本pdf

问如何在不下载Python的情况下从url中提取文本pdf
EN

Stack Overflow用户

提问于 2019-09-13 12:05:00

回答 1查看 412关注 0票数 0

我目前正在使用requests.get从API中提取pdf。我不想下载它们，只是想从它们中提取文本。

response_pdf = requests.get(url, auth=TokenAuth(key))
text = convert_pdf_to_txt(response_pdf.content)

下面是函数convert_pdf_to_txt的代码：

def convert_pdf_to_txt(filename):

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    #codec ='ISO-8859-1'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    fp = open(filename, 'rb')

    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    text = str(text)
    text = text.replace("\\n", "")
    text = text.lower()

    return text

我得到以下错误: UnicodeDecodeError：'utf-8‘编解码器无法解码位置11中的字节0xb5 :无效的开始字节

response_pdf.content是一个“类‘字节’”对象，我不知道如何从其中提取文本。

任何帮助都将不胜感激！

pdf

python-requests

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-09-13 12:09:32

您正在传递一个字节字符串，以便将其解释为要打开的文件名，这是不好的。

相反，您可以将字节串读入io.BytesIO()，并以fp的形式传入

def convert_pdf_to_txt(fp):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = "utf-8"
    # codec ='ISO-8859-1'
    laparams = LAParams()
    device = TextConverter(
        rsrcmgr, retstr, codec=codec, laparams=laparams
    )

    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(
        fp,
        pagenos,
        maxpages=maxpages,
        password=password,
        caching=caching,
        check_extractable=True,
    ):
        interpreter.process_page(page)

    text = retstr.getvalue()
    device.close()
    retstr.close()
    text = str(text)
    text = text.replace("\\n", "")
    text = text.lower()
    return text

response_pdf = requests.get(url, auth=TokenAuth(key))
pdf_stream = io.BytesIO(response_pdf.content)
text = convert_pdf_to_txt(pdf_stream)

这具有额外的精确性，您仍然可以在文件中使用它：

with open('my_pdf', 'rb') as pdf_stream:
  text = convert_pdf_to_txt(pdf_stream)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57923331

复制

相似问题

问如何在不下载Python的情况下从url中提取文本pdf
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在不下载Python的情况下从url中提取文本pdfEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在不下载Python的情况下从url中提取文本pdf
EN