blocks|key|97066|text|请注意，extractText()仍然存在正确提取文本的问题。来自extractText()的文档|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|97067|97068|这对于一些PDF文件效果很好，但对其他文件效果不佳，这取决于所使用的生成器。这将在未来得到改进。不要依赖于这个函数的文本顺序，因为如果这个函数变得更复杂，它就会改变。|blockquote|97069|97070|97071|因为它是您想要的文本，所以可以使用Linux命令pdftotext。|97072|要使用Python调用它，您可以这样做：|97073|>>>+import+subprocess
>>>+subprocess.call(['pdftotext',+'forms.pdf',+'output'])|code-block|syntax|javascript|97074|文本从forms.pdf中提取并保存到output。|97075|这适用于您的PDF文件，并提取您想要的文本。|97076|entityMap^0|4|D|X|D|0|0|0|0|0|O|9|0|0|0|3|9|J|6|0|0^^$0|@$1|2|3|4|5|6|7|10|8|@$9|11|A|12|B|C]|$9|13|A|14|B|C]]|D|@]|E|$]]|$1|F|3|-4|5|6|7|15|8|@]|D|@]|E|$]]|$1|G|3|H|5|I|7|16|8|@]|D|@]|E|$]]|$1|J|3|-4|5|6|7|17|8|@]|D|@]|E|$]]|$1|K|3|-4|5|6|7|18|8|@]|D|@]|E|$]]|$1|L|3|M|5|6|7|19|8|@$9|1A|A|1B|B|C]]|D|@]|E|$]]|$1|N|3|O|5|6|7|1C|8|@]|D|@]|E|$]]|$1|P|3|Q|5|R|7|1D|8|@]|D|@]|E|$S|T]]|$1|U|3|V|5|6|7|1E|8|@$9|1F|A|1G|B|C]|$9|1H|A|1I|B|C]]|D|@]|E|$]]|$1|W|3|X|5|6|7|1J|8|@]|D|@]|E|$]]|$1|Y|3|-4|5|6|7|1K|8|@]|D|@]|E|$]]]|Z|$]]

Note that <code>extractText()</code> still has problems extracting the text properly. From the documentation for <code>extractText()</code>:

<blockquote>
 This works well for some PDF files,
 but poorly for others, depending on
 the generator used. This will be
 refined in the future. Do not rely on
 the order of text coming out of this
 function, as it will change if this
 function is made more sophisticated.
</blockquote>

Since it is the text you want, you can use the Linux command <code>pdftotext</code>. 

To invoke that using Python, you can do this:

<pre><code>&gt;&gt;&gt; import subprocess
&gt;&gt;&gt; subprocess.call(['pdftotext', 'forms.pdf', 'output'])
</code></pre>

The text is extracted from <code>forms.pdf</code> and saved to <code>output</code>. 

This works in the case of your PDF file and extracts the text you want.

blocks|key|2004003|text|您还可以尝试使用pdfminer库(也是Python语言)，看看它在提取文本方面是否更好。然而，对于拆分，你将不得不坚持使用pyPdf，因为pdfminer不支持这一点。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|2004004|entityMap|0|LINK|mutability|MUTABLE|url|http://www.unixuser.org/~euske/python/pdfminer/index.html^0|8|8|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

You could also try the <a href="http://www.unixuser.org/~euske/python/pdfminer/index.html" rel="nofollow">pdfminer</a> library (also in python), and see if it's better at extracting the text. For splitting however, you will have to stick with pyPdf as pdfminer doesn't support that.

blocks|key|4889708|text|我发现有时将其转换为ps+(尝试使用pdf2ps和pdftops以了解潜在差异)然后再转换回pdf+(ps2pdf)是很有用的。然后再次尝试您的原始脚本。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|4889709|entityMap^0|A|2|I|6|P|7|1A|3|1F|6|0^^$0|@$1|2|3|4|5|6|7|H|8|@$9|I|A|J|B|C]|$9|K|A|L|B|C]|$9|M|A|N|B|C]|$9|O|A|P|B|C]|$9|Q|A|R|B|C]]|D|@]|E|$]]|$1|F|3|-4|5|6|7|S|8|@]|D|@]|E|$]]]|G|$]]

I find it sometimes useful to convert it to <code>ps</code> (try with <code>pdf2ps</code>and <code>pdftops</code> for potential differences) then back to <code>pdf</code> (<code>ps2pdf</code>). Then try your original script again.

blocks|key|98952|text|这不是一个真正的答案，但pyPdf的问题是:它还不支持CMaps。PDF允许字体使用CMaps将字符in+(PDF中的字节)映射到Unicode字符代码。当你有一个包含非ASCII字符的PDF时，可能正在使用一个CMap，有时甚至在没有非ASCII字符的情况下。当pyPdf遇到非标准Unicode编码的字符串时，它只看到一堆字节码；它不能将这些字节转换为Unicode，因此它只给出空字符串。实际上我也遇到了同样的问题，目前我正在编写源代码。这很耗时，但我希望在2011年年中的某个时候给维护者发一个补丁。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|98953|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

This isn't really an answer, but the problem with pyPdf is this: it doesn't yet support CMaps. PDF allows fonts to use CMaps to map character IDs (bytes in the PDF) to Unicode character codes. When you have a PDF that contains non-ASCII characters, there's probably a CMap in use, and even sometimes when there's no non-ASCII characters. When pyPdf encounters strings that are not in standard Unicode encoding, it just sees a bunch of byte code; it can't convert those bytes to Unicode, so it just gives you empty strings. I actually had this same problem and I'm working on the source code at the moment. It's time consuming, but I hope to send a patch to the maintainer some time around mid-2011.

blocks|key|2777132|text|我在一些pdf和windows上遇到了类似的问题，这对我来说工作得很好：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2777133|1.-下载适用于windows的Xpdf工具|2777134|2.-将pdftotext.exe从xpdf-tools-win-4.00\bin32复制到C:\Windows\System32以及C:\Windows\SysWOW64|2777135|3.-使用子进程从控制台运行命令：|2777136|import+subprocess

try:
++++extInfo+=+subprocess.check_output('pdftotext.exe+'%2BfilePath+%2B+'+-',shell=True,stderr=subprocess.STDOUT).strip()
except+Exception+as+e:
++++print+(e)+|code-block|syntax|javascript|2777137|entityMap^0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|P|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|Q|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|R|8|@]|9|@]|A|$]]|$1|H|3|I|5|J|7|S|8|@]|9|@]|A|$K|L]]|$1|M|3|-4|5|6|7|T|8|@]|9|@]|A|$]]]|N|$]]

I had similar problem with some pdfs and for windows, this is working excellent for me:

1.- Download Xpdf tools for windows

2.- copy pdftotext.exe from xpdf-tools-win-4.00\bin32 to C:\Windows\System32 and also to C:\Windows\SysWOW64

3.- use subprocess to run command from console:

<pre><code>import subprocess

try:
 extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
except Exception as e:
 print (e) 
</code></pre>

blocks|key|2010506|text|我开始认为我应该采用一个混乱的两部分解决方案。Pp1-82有文本页面标签(pdftotext可以提取)，pp83-end没有页面标签，但pyPDF可以提取，并且它明确知道页面。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2010507|我想我需要将两者结合起来。笨拙，但我看不出有什么办法。遗憾的是，我不得不在Windows机器上执行此操作。|2010508|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|F|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|G|8|@]|9|@]|A|$]]|$1|D|3|-4|5|6|7|H|8|@]|9|@]|A|$]]]|E|$]]

I'm starting to think I should adopt a messy two-part solution. there are two sections to the PDF, pp 1-82 which have text page labels (pdftotext can extract), and pp 83-end which have no page labels but pyPDF can extract and it explicitly knows pages. 

I think I need to combine the two. Clunky, but I don't see any way round it. Sadly I'm having to do this on a Windows machine.

I'm trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I've put an example file here:

<a href="http://www.4shared.com/document/kmJF67E4/forms.html" rel="noreferrer">http://www.4shared.com/document/kmJF67E4/forms.html</a>

If you run the following, the first 81 pages return no text, while the final 11 extract properly. Can anyone help?

<pre><code>from pyPdf import PdfFileReader 
input = PdfFileReader(file("forms.pdf", "rb")) 
for page in input1.pages: 
 print page.extractText() 
</code></pre>

pyPdf unable to extract text from some pages in my PDF

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

教程

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云智能顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

EdgeOne AI 安全实战专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

聚焦“写作效率、视觉美观与运行性能”三方面进行全面升级，为您提供更高效、稳定的创作环境

社区富文本&Markdown编辑器全新改版上线，欢迎大家体验!

诚挚邀请您参与本次调研，分享您的真实使用感受与建议。您的反馈至关重要，感谢您的支持与参与！

社区新版编辑器体验调研

我正在尝试使用pyPdf从多页PDF中提取和打印页面。问题是，不能从某些页面中提取文本。我在这里放了一个示例文件：如果运行以下命令，前81个页面不返回任何文本，而最后11个页面正确提取。有人能帮上忙吗？from pyPdf import PdfFileReader  input = PdfFileReader(file("forms.pdf", "rb"))  for page in input1

问pyPdf无法从我的PDF中的某些页面提取文本
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问pyPdf无法从我的PDF中的某些页面提取文本EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问pyPdf无法从我的PDF中的某些页面提取文本
EN