我有两个文本文件,略有重叠,即:
text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""如您所见,text1的最后一句和text2的第一句略有重叠。现在,我想消除这种重叠,基本上删除text2中的字符串,这些字符串也在text1的最后一句中。
为此,我可以提取text1的最后一句:
text1_last_sentence = list(filter(None,text1.split(".")))[-1]Text2的第一句:
text2_first_sentence = text2.split(".")[0]..。但现在的问题是:
我如何找到text2第一句话中应该留在text2并把一切都放回原处的部分呢?
编辑1:
预期产出:
text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""编辑2
以下是完整的代码:
text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""
text1_last_sentence = list(filter(None,text1.split(".")))[-1]
text2_first_sentence = text2.split(".")[0]
print(text1_last_sentence, "\n")
print(text2_first_sentence, "\n")其他的都是实验性的,这意味着很难创建一个实验来测试提出的理论或在
理论或研究更详细的现象
发布于 2019-10-19 10:22:39
这里有一种方法可以找到最大可能的重叠:
text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""
def remove_overlap(text1, text2):
"""Returns the part of text2 that doesn't overlap with text1"""
words1 = text1.split()
words2 = text2.split()
# all apperances of the last word of text1 in text2
last_word_appearances = [index for index, word in enumerate(words2) if word == words1[-1]]
# we look for the largest possible overlap
for n in reversed(last_word_appearances):
# are the first n+1 words of text2 the same as the (n+1) last from text1?
if words2[:n+1] == words1[-(n+1):]:
return ' '.join(words2[n+1:])
else:
# no overlap found
return text2
remove_overlap(text1, text2)
# 'greater detail.There are still some deficiencies in [...]发布于 2019-10-19 09:52:18
这有点麻烦,但它起作用了:
text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""
text1_ls = list(filter(None,text1.split(".")))[-1]
text2_fs = text2.split(".")[0]
temp2 = text2_fs.split(" ")
for i in range(1, len(temp2)):
if " ".join(temp2[:i]) not in text1_ls:
text2_fs = " ".join(temp2[(i - 1):])
break
print(text1_ls, "\n")
print(text2_fs, "\n")基本上,您从text2_fs获取越来越大的子字符串,直到它不再是text1_ls的子字符串,这告诉您,text2_fs子字符串的最后一个单词是不存在于text1_ls中的第一个单词。
发布于 2019-10-19 10:14:08
可能不适用于所有角的情况,但适用于所述文本。
first_word_text2 = text2.split()[0]
pos = len(text1) - text1.rfind(first_word_text2)
text2[pos:].strip()https://stackoverflow.com/questions/58462328
复制相似问题