我有一个PDF文件,其中包含彩票获奖者,我想提取所有中奖根据他们的奖品。
PDF文件
我试过这个:
import re
import pdfplumber
prize_re = re.compile(r"^\d[a-z]")
cons_prize_re = re.compile(r"^Cons")
ticket1_line_re = re.compile(r"^\d[)]")
ticket2_line_re = re.compile(r"^\d{4}")
ticket3_line_re = re.compile(r"[A-Z] \d{6}")
with pdfplumber.open("./test11.pdf") as pdf:
for i in range(len(pdf.pages)):
page_text = pdf.pages[i].extract_text()
for line in page_text.split("\n"):
if prize_re.match(line) or cons_prize_re.match(line) or ticket1_line_re.match(line) or ticket2_line_re.match(line) or ticket3_line_re.search(line):
print(line)我知道了,我不知道如何分配每一张奖券给它的奖品,而且Cons奖券号码似乎有点奇怪,我不知道为什么( 867952AO 867952AP应该是=>一个867952 AO 867952 AP.):
1st Prize Rs :7000000/- 1) AU 867952 (MANANTHAVADY)
Cons Prize-Rs :8000/- AN 867952AO 867952AP 867952 AR 867952AS 867952
AT 867952 AV 867952 AW 867952AX 867952AY 867952
AZ 867952
2nd Prize Rs :500000/- 1) AZ 499603 (ADOOR)
3rd Prize Rs :100000/- 1) AN 215264 (KOTTAYAM)
2) AO 852774 (PATTAMBI)
3) AP 953655 (KOTTAYAM)
4) AR 638904 (PAYYANUR)
5) AS 496774 (VAIKKOM)
6) AT 878990 (WAYANADU)
7) AU 703702 (PUNALUR)
8) AV 418446 (WAYANADU)
9) AW 994685 (KOZHIKKODE)
10) AX 317550 (PATTAMBI)
11) AY 854780 (CHITTUR)
12) AZ 899905 (KARUNAGAPALLY
...相反,我想得到:
[
{
"1st Prize Rs :7000000",
"tickets": [
"AU 867952"
]
},
{
"Cons Prize-Rs :8000",
"tickets": [
"AN 867952",
"AO 867952",
"AP 867952",
"AR 867952",
...
]
},
...
]我怎样才能做到这一点?
发布于 2022-03-20 21:10:00
您可以首先从捕获组中的所有页面中获取所有完整的部分。
然后,您可以处理第三个捕获组,以获得单独的“票证”,并在一个循环中创建想要的数据结构。
对于第一个单独的组,可以使用与每个奖励节的开始匹配的模式,并捕获所有值直到下一个奖励部分。
^(\w+ Prize[-\s]Rs\s*):(\d+)/-(?:\s*\d+\))?\s*(.*(?:\n(?!\w+ Prize\b).*)*)对于后处理,您可以为票证格式使用一个模式,它匹配2个大写字符、空格和6个数字,或者匹配4个或更多位数,后面跟着空格边界。
(?:[A-Z]{2} \d{6}(?!\d)|(?<!\S)\d{4,}(?!\S))使用问题中的pdf文件的示例代码:
import re
import pdfplumber
import json
pattern = r"^(\w+ Prize[-\s]Rs\s*):(\d+)/-(?:\s*\d+\))?\s*(.*(?:\n(?!\w+ Prize\b).*)*)"
with pdfplumber.open("./test11.pdf") as pdf:
all_text = ""
for page in pdf.pages:
all_text += '\n' + page.extract_text()
matches = re.finditer(pattern, all_text, re.MULTILINE)
coll = []
for matchNum, match in enumerate(matches):
dct = {}
dct[match.group(1)] = match.group(2)
dct["tickets"] = re.findall(r"(?:[A-Z]{2} \d{6}(?!\d)|(?<!\S)\d{4,}(?!\S))", match.group(3))
coll.append(dct)
print(json.dumps(coll, indent=4))输出
[
{
"1st Prize Rs ": "120000000",
"tickets": [
"XG 218582"
]
},
{
"Cons Prize-Rs ": "500000",
"tickets": [
"XA 218582",
"XB 218582",
"XC 218582",
"XD 218582",
"XE 218582"
]
},
{
"2nd Prize Rs ": "5000000",
"tickets": [
"XA 788417",
"XB 161796",
"XC 319503",
"XD 713832",
"XE 667708",
"XG 137764"
]
},
....https://stackoverflow.com/questions/71542733
复制相似问题