我使用Pymupdf模块中的“fitz”来提取数据,然后用熊猫将提取出来的数据转换成数据。
#从文件夹读取多个pdfs的代码:
from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")
# convert the glob generator out put to list
pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]#提取数据的代码:
for pdf in pdf_files:
with fitz.open(pdf) as doc:
pypdf_text = ""
for page in doc:
pypdf_text += page.getText()但是,上面的代码只是为文件夹中的最后一个pdf提取数据。因此,只给出了这个pdf的结果,但是,期望的目标是一个一个地从文件夹中的所有pdfs中提取数据。
请帮助我理解并解决为什么会发生这种情况?
发布于 2022-01-26 04:17:54
更改以下代码:
Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")至
files_pdf = [ file for file in glob.glob(path+"\*.pdf",recursive=True)]并给出路径作为变量。
发布于 2022-01-26 05:55:33
下面的代码对我有用,
from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")
# convert the glob generator out put to list
pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]#提取数据的代码:
pdf_txt = ""
for pdf in pdf_files:
with fitz.open(pdf) as doc:
for page in doc:
pdf_txt += page.getText()#将提取的数据转换为数据框架:
with open('pdf_txt.txt','w', encoding='utf-8') as f: #Converting to text file
f.write(pdf_txt)
data=pd.read_table('pdf_txt.txt',sep='\n') #Converting text file to dataframe谢谢你!,谢谢你!
https://stackoverflow.com/questions/70849771
复制相似问题