首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用python消除两个文本块之间的重叠

使用python消除两个文本块之间的重叠
EN

Stack Overflow用户
提问于 2019-10-19 09:40:10
回答 3查看 37关注 0票数 1

我有两个文本文件,略有重叠,即:

代码语言:javascript
复制
text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""

text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""

如您所见,text1的最后一句和text2的第一句略有重叠。现在,我想消除这种重叠,基本上删除text2中的字符串,这些字符串也在text1的最后一句中。

为此,我可以提取text1的最后一句:

代码语言:javascript
复制
text1_last_sentence = list(filter(None,text1.split(".")))[-1]

Text2的第一句:

代码语言:javascript
复制
text2_first_sentence = text2.split(".")[0]

..。但现在的问题是:

我如何找到text2第一句话中应该留在text2并把一切都放回原处的部分呢?

编辑1

预期产出:

代码语言:javascript
复制
text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""

text2 = """greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""

编辑2

以下是完整的代码:

代码语言:javascript
复制
text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy.""" 

text1_last_sentence = list(filter(None,text1.split(".")))[-1]
text2_first_sentence = text2.split(".")[0]

print(text1_last_sentence, "\n")
print(text2_first_sentence, "\n")

其他的都是实验性的,这意味着很难创建一个实验来测试提出的理论或在

理论或研究更详细的现象

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2019-10-19 10:22:39

这里有一种方法可以找到最大可能的重叠:

代码语言:javascript
复制
text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy."""

def remove_overlap(text1, text2):
    """Returns the part of text2 that doesn't overlap with text1"""

    words1 = text1.split()
    words2 = text2.split()

    # all apperances of the last word of text1 in text2
    last_word_appearances = [index for index, word in enumerate(words2) if word == words1[-1]]
    # we look for the largest possible overlap
    for n in reversed(last_word_appearances):
        # are the first n+1 words of text2 the same as the (n+1) last from text1? 
        if words2[:n+1] == words1[-(n+1):]:
            return ' '.join(words2[n+1:])
    else:
        # no overlap found
        return text2


remove_overlap(text1, text2)
# 'greater detail.There are still some deficiencies in [...]
票数 2
EN

Stack Overflow用户

发布于 2019-10-19 09:52:18

这有点麻烦,但它起作用了:

代码语言:javascript
复制
text1 = """Some of the major unsolved problems in physics are theoretical, meaning that existing theories seem incapable of explaining a certain observed phenomenon or experimental result. The others are experimental, meaning that there is a difficulty in creating an experiment to test a proposed theory or investigate a phenomenon in"""
text2 = """theory or investigate a phenomenon in greater detail.There are still some deficiencies in the Standard Model of physics, such as the origin of mass, the strong CP problem, neutrino mass, matter–antimatter asymmetry, and the nature of dark matter and dark energy.""" 

text1_ls = list(filter(None,text1.split(".")))[-1]
text2_fs = text2.split(".")[0]

temp2 = text2_fs.split(" ")

for i in range(1, len(temp2)):  
    if " ".join(temp2[:i]) not in text1_ls:
        text2_fs = " ".join(temp2[(i - 1):])
        break

print(text1_ls, "\n")
print(text2_fs, "\n")

基本上,您从text2_fs获取越来越大的子字符串,直到它不再是text1_ls的子字符串,这告诉您,text2_fs子字符串的最后一个单词是不存在于text1_ls中的第一个单词。

票数 1
EN

Stack Overflow用户

发布于 2019-10-19 10:14:08

可能不适用于所有角的情况,但适用于所述文本。

代码语言:javascript
复制
first_word_text2 = text2.split()[0]
pos = len(text1) - text1.rfind(first_word_text2)
text2[pos:].strip()
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/58462328

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档