我试着用regex识别不同学生的帖子。
这些职位的形式总是如下:
‘http://www.harryresume.com’
我如何使用regex创建一个列表,其中的元素是每个学生的帖子,按其发布的顺序。
学生们可以发布任何东西,所以我用\s\S+来捕捉它。我的尝试是:re.findall('(U\d+\n[\s\S]+?)',text)。然而,这只返回学生的ID,而不是他们的文本:['U3951583\n ', 'U39501492\n ', 'U5235098\n ']
在这种情况下,如何使用regex匹配?
发布于 2019-06-24 07:10:30
您可以使用re.findall方法:
import re
txt = "U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. \n U39501492\n That's a cool website. \n U5235098\n I'll have a look too"
print(re.findall(r'\bU\d{7,8}\b.*?(?=\bU\d{7,8}\b|\Z)', txt, re.S))
# => ["U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. \n ", "U39501492\n That's a cool website. \n ", "U5235098\n I'll have a look too"]分别获取名称和内容的变体:
for name, content in re.findall(r'\b(U\d{7,8})\b(.*?)(?=\bU\d{7,8}\b|\Z)', txt, re.S):
print("{}:{}".format(name.strip(), content.strip()))输出:
U3951583:Hi there my name is Harry. Check out http://www.harryresume.com. That's my website.
U39501492:That's a cool website.
U5235098:I'll have a look too请参阅这个Python演示
使用的正则表达式是
\b(U\d{7,8})\b(.*?)(?=\bU\d{7,8}\b|\Z)详细信息
\b -一个单词边界(不能立即在当前位置的左边显示字母/数字/_)(U\d{7,8}) -第1组:U和7或8位数字\b -一个单词边界(.*?) -第2组:任何0+字符,尽可能少(?=\bU\d{7,8}\b|\Z) --一种积极的前瞻性,它要求前面描述的模式(名称模式)立即位于当前位置或(|)字符串结束(\Z)的右侧。Python 3.7+
在最新的Python版本中,您可以使用匹配空字符串的模式进行re.split:
>>> import re
>>> txt = "U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website.
\n U39501492\n That's a cool website. \n U5235098\n I'll have a look too"
>>> print(re.split(r'(?!^)(?=\bU\d{7,8}\b)', txt))
["U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. \n ", "U3
9501492\n That's a cool website. \n ", "U5235098\n I'll have a look too"]因此,如果您不需要分别获取名称和内容,这可能是一种更简单的方法。
发布于 2019-06-24 07:25:12
您可以匹配U和7-8位数,后面跟着不以相同模式开头的行。
\bU\d{7,8}(?:\r?\n(?![ ]*U\d{7}).*)*解释
\bU\d{7,8}单词边界,匹配U,后面跟着7-8位数字(?:非捕获群\r?\n匹配换行符(?!负前瞻,断言右边的不是[ ]*\bU\d{7}匹配0+时,空格后面跟着单词边界,U和7位数
- `).*` Close negative lookahead and match any char 0+ times
)*关闭非捕获组并重复0+次数以匹配以下所有行例如
import re
s = "U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. \n U39501492\n That's a cool website. \n U5235098\n I'll have a look too"
regex = r"\bU\d{7,8}(?:\r?\n(?![ ]*U\d{7}).*)*"
print(re.findall(regex, s))结果
["U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. ", "U39501492\n That's a cool website. ", "U5235098\n I'll have a look too"]发布于 2019-06-24 07:09:33
https://stackoverflow.com/questions/56731379
复制相似问题