首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用Python从pdf中提取图像

使用Python从pdf中提取图像
EN

Stack Overflow用户
提问于 2019-05-30 16:13:10
回答 2查看 5.6K关注 0票数 1

我们如何从PDF中提取图像(仅图像)。

我使用了很多在线工具,它们都不是通用的。在大多数PDF中,它工具是整个图像的屏幕截图,而不是图像。sg.inflibnet.ac.in:8080/jspui/bitstream/10603/121661/9/09_chapter链接-> PDF 4.pdf

EN

回答 2

Stack Overflow用户

发布于 2019-05-30 17:07:56

下面是一些使用pyPdf读取PDF文件、提取图像并将其作为PIL.Image生成的代码。您需要根据需要对其进行修改,这里只是演示如何遍历对象树。

代码语言:javascript
复制
import io
import pyPdf
import PIL.Image

infile_name = 'my.pdf'

with open(infile_name, 'rb') as in_f:
    in_pdf = pyPdf.PdfFileReader(in_f)
    for page_no in range(in_pdf.getNumPages()):
        page = in_pdf.getPage(page_no)

        # Images are part of a page's `/Resources/XObject`
        r = page['/Resources']
        if '/XObject' not in r:
            continue
        for k, v in r['/XObject'].items():
            vobj = v.getObject()
            # We are only interested in images...
            if vobj['/Subtype'] != '/Image' or '/Filter' not in vobj:
                continue
            if vobj['/Filter'] == '/FlateDecode':
                # A raw bitmap
                buf = vobj.getData()
                # Notice that we need metadata from the object
                # so we can make sense of the image data
                size = tuple(map(int, (vobj['/Width'], vobj['/Height'])))
                img = PIL.Image.frombytes('RGB', size, buf,
                                          decoder_name='raw')
                # Obviously we can't really yield here, do something with `img`...
                yield img
            elif vobj['/Filter'] == '/DCTDecode':
                # A compressed image
                img = PIL.Image.open(io.BytesIO(vobj._data))
                yield img
票数 3
EN

Stack Overflow用户

发布于 2019-06-08 04:29:06

这是一个使用PyMuPDF的解决方案:

代码语言:javascript
复制
#!python3.6
import fitz  # PyMuPDF


def get_pixmaps_in_pdf(pdf_filename):
    doc = fitz.open(pdf_filename)
    xrefs = set()
    for page_index in range(doc.pageCount):
        for image in doc.getPageImageList(page_index):
            xrefs.add(image[0])  # Add XREFs to set so duplicates are ignored
    pixmaps = [fitz.Pixmap(doc, xref) for xref in xrefs]
    doc.close()
    return pixmaps


def write_pixmaps_to_pngs(pixmaps):
    for i, pixmap in enumerate(pixmaps):
        pixmap.writePNG(f'{i}.png')  # Might want to come up with a better name


pixmaps = get_pixmaps_in_pdf(r'C:\StackOverflow\09_chapter 4.pdf')
write_pixmaps_to_pngs(pixmaps)
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/56374258

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档