文章/答案/技术大牛

发布

社区首页 >问答首页 >在不移除任何字符的情况下，将字符串拆分为正则匹配。

问在不移除任何字符的情况下，将字符串拆分为正则匹配。
EN

Stack Overflow用户

提问于 2015-09-26 20:32:29

回答 6查看 122关注 0票数 0

我希望在日期上拆分此文本，但不从字符串中删除日期：

sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
   at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @
sep 25 fri The Holdup, The Wheeland Brothers
   at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **

数组中的第一个元素是：

sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
   at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @`

条目有可变的行数，所以我不能在新行上拆分。

日期的格式是：

month_abbreviation + space(or two) + day_number

就像这个伪码：

three_letter_word + whitespace(s) + one_or_two_digit_number

会起作用的。

ruby

regex

回答 6

Stack Overflow用户

回答已采纳

发布于 2015-09-27 00:17:07

您指定要在日期上分开。因此，我没有拆分任何具有不能转换为日期的指定日期格式的字符串，包括"Sep 31 Sat"和"Sep 26 Wed" (后者，今年是"Sat")。我假设日期子字符串可以出现在字符串中的任何位置。如果您想要求它们从每一行的开头开始，这当然是一个简单的修改。

str =
"sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
       at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @
sep 31 mon at some other place 
oct 26 sat The Holdup, The Wheeland Brothers
       at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **"

require 'date'

arr = str.split.
          map(&:capitalize).
          each_cons(3).
          map { |a| a.join(' ') }.
          select { |s| Date.strptime(s, '%b %d %a') rescue nil }
  #=> ["Sep 25 Fri", "Oct 26 Sat"]

r = /(#{ arr.join('|') })/i
  #=> /(Sep 25 Fri|Oct 26 Sat)/i

str.split(r)
  #=" ["",
  #    "sep 25 fri",
  # " The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n\
  #  at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n    sep 31\
  #   mon at some other place \n    ",
  # "oct 26 sat",
  # " The Holdup, The Wheeland Brothers\n           at the El Rey Theatre,\
  #   Chico 18+ (a/a with adult) 7:30pm/8:30pm **"]

若要避免返回数组的开头和结尾处的空字符串，请使用：

str.split(r).delete_if(&:empty?)

票数 1

Stack Overflow用户

发布于 2015-09-26 22:53:08

Ruby有一个很好的方法，它是Array (从枚举继承的)的一部分，名为slice_before。我会用它就像：

str = <<EOT
sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames
    at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @
sep 25 fri The Holdup, The Wheeland Brothers
    at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **
EOT

MONTHS = %w[jan feb mar apr may jun jul aug sep oct nov dec]
MONTH_PATTERN = Regexp.union(MONTHS).source # => "jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec"
MONTH_REGEX = /^(?:#{ MONTH_PATTERN })\b/i # => /^(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\b/i

schedule = str.lines.slice_before(MONTH_REGEX).to_a
# => [["sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n",
#      "    at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n"],
#     ["sep 25 fri The Holdup, The Wheeland Brothers\n",
#      "    at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **\n"]]

schedule[0]
# => ["sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n",
#     "    at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n"]

schedule[1]
# => ["sep 25 fri The Holdup, The Wheeland Brothers\n",
#     "    at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **\n"]

slice_before不工作在字符串上，它在数组或枚举数上工作，所以第一步是使用lines (返回枚举数)根据行结束拆分字符串。然后，slice_before查看数组中的每个元素，并根据它找到的匹配MONTH_REGEX的点击量创建子数组。

/^(?:jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\b/i基本上是这样说的：“从字符串的开头开始，查找与三个字母的月份名称相匹配的单词，不管它们的字母大小写是什么”。

因为它是一个正则表达式，用于匹配“之前的切片”点，因此很容易定制需要匹配的确切模式。在这种情况下，带前导空格的线是连续行，换句话说，它们是次要的，而不是最重要的。您会偶尔看到这种数据输出。没有前导空白的行是断线，标志着一个新记录的开始。我可以使用/^\S/模式，这意味着“找到一条以非空格开头的行，但我觉得匹配更具体的东西，即月份缩略语，在不浪费时间的情况下，在匹配过程中是有用的和足够具体的。/^\w{3} \d{1,2} \w{3} /也可以工作，但会被过度使用，因为匹配的子字符串必须出现在字符串的开头，因为因为^。如果这没有意义，那么阅读Regexp类在IRB中的文档和实验，因为这一点根本不困难。”

如果需要，可以将子数组返回到字符串中：

schedule.map(&:join)
# => ["sep 25 fri The Phenomenauts, The Atom Age, Los Pistoleros, The Shames\n    at Jub Jubs, 71 S Wells Avenue, Reno, NV 21+ 8pm *** @\n",
#     "sep 25 fri The Holdup, The Wheeland Brothers\n    at the El Rey Theatre, Chico 18+ (a/a with adult) 7:30pm/8:30pm **\n"]

这是我们内部使用的一种技术，通过将巨大的配置文件分解成一行，并为带有正则表达式的部分找到标记。

票数 1

Stack Overflow用户

发布于 2015-09-26 21:03:54

假设OP的描述：

three_letter_word +空格+ one_or_two_digit_number可以工作

是正确的，

text.split(/(?=\w{3} +\d{1,2})/)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/32801755

复制

相似问题

问在不移除任何字符的情况下，将字符串拆分为正则匹配。
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在不移除任何字符的情况下，将字符串拆分为正则匹配。EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在不移除任何字符的情况下，将字符串拆分为正则匹配。
EN