文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在bash (连接行)中查询LibreOffice主题词表的文本文件版本

问如何在bash (连接行)中查询LibreOffice主题词表的文本文件版本
EN

Stack Overflow用户

提问于 2022-07-07 20:08:59

回答 4查看 98关注 0票数 0

我正在尝试用bash编写一个简单的脚本来查询LibreOffice词库扩展名为文本文件。对于每个输入查询字符串，我希望输出是所有相关字符串。我想在巴什做这个。

要下载和解压缩同义词库，我需要

wget "https://extensions.libreoffice.org/assets/downloads/41/1653961771/dict-en-20220601_lo.oxt" # download LO dictionary & thesaurus

unzip -p dict-en-20220601_lo.oxt th_en_US_v2.dat > lo # extract contents of thesaurus to text file

查看文本文件的一部分：

nine|3
(adj)|9|ix|cardinal (similar term)
(noun)|9|IX|niner|Nina from Carolina|ennead|digit (generic term)|figure (generic term)
(noun)|baseball club|ball club|club|baseball team (generic term)
nine-banded armadillo|1
(noun)|peba|Texas armadillo|Dasypus novemcinctus|armadillo (generic term)
nine-fold|1
(adj)|nonuple|ninefold|multiple (similar term)
nine-membered|1
(adj)|9-membered|membered (similar term)
nine-sided|1
(adj)|multilateral (similar term)|many-sided (similar term)
nine-spot|1
(noun)|spot (generic term)

因此，例如，我希望能够输入“9”作为查询，并让bash返回类似的内容

9
ix
cardinal
9
IX
niner
Nina from Carolina
ennead
digit
figure
baseball club
ball club
club
baseball team

我认为在awk或sed中使用正确的语法应该相当容易，特别是因为所有包含查询术语的行都不是以"(“开头，而所有包含相关术语的行都以"(”开头)。

但我还是个新手，还没弄明白。对我来说，问题的关键似乎是将查询术语和所有相关的术语放在一行上。从那里，我知道如何sed我的方式到胜利。但要达到这一点对我来说是很有挑战性的。

蒂娅谢谢你的帮助！

附注：我正在尝试做类似的事情，但我的情况有点不同，我不太了解语法，无法根据我的需要修改它：https://www.unix.com/unix-for-dummies-questions-and-answers/184649-sed-join-lines-do-not-match-pattern.html

bash

awk

sed

回答 4

Stack Overflow用户

回答已采纳

发布于 2022-07-08 00:15:03

这可能对您有用(GNU sed)：

v=nine
sed -n ':a;/^'"${v}"'|/{:b;n;/^[^(]/ba;s/^[^|]*|\| ([^)]*)//g;y/|/\n/;p;bb}' file

将焦点放在输入变量匹配之后的任何行上。

取下一行，如果它不是以(开头，则重复上面的代码。

否则，删除第一个字段和父类之间的任何值，将字段分隔符|替换为换行符，打印结果并重复。

v=nine # set variable v to `nine`
sed -n ':a # turn off implicit printing and set goto label a
        /^'"${v}"'|/{ # match a line beginning with variable v
          :b # set goto label b
          n # fetch next line (do not print see option -n)
          /^[^(]/ba # goto label a if line does not begin (
          s/^[^|]*|\| ([^)]*)//g # remove first field and parens
          y/|/\n/ # translate | to newline for entire line
          p # print the result
          bb # goto label b
        }' file

要查看sed脚本的作用，请调用--debug选项。

票数 1

Stack Overflow用户

发布于 2022-07-07 22:49:30

使用sed

$ cat script.sed
N
{
    /\(/ {
        /9/!s/[^|]*\|//
        s/\n/ /
        {
            /[^|]*\|(9\|)/ { 
                s//\1/
                s/([^|]*)\|/\1\n/g
                s/\([^)]*\)//
                s/\([^)]*\)//g
                p
            }
        }
    }
}

$ sed -Enf script.sed input_file
nine
9
ix
cardinal
9
IX
niner
Nina from Carolina
ennead
digit
figure
baseball club
ball club
club
baseball team

票数 1

Stack Overflow用户

发布于 2022-07-07 21:13:00

如果我理解你的问题，奥克解决方案

文件search.awk

#! /usr/bin/awk -f

# This block is executed BEFORE input file treatment.
BEGIN {
    # Field Separator
    FS = "|"
}

# The next blocks are executed for each input file line only if the condition in front of the block is true

# '$1' is the first field/column. Remember, field separator in the pipe (|)
$1 == KEY {
    # Key found, flag it
    flag  = 1
    # Associated words init
    words = ""
    # Do not check the next blocks conditions, process the next line of the input file
    next
}

# If the flag is 1 and the line begins with an open parenthesis.
flag == 1 && $0 ~ /^\(/ {
    # Association found
    # For all associations (field)
    # The line treatment starts with the second field
    idx = 2
    # NF is the Number of Fields in the current line
    while (idx <= NF) {
        # get the current field word (idx in the field number, $ids it is its value)
        word = $idx
        # remove term in parenthesis
        # (in fact, replace all characters after the ' (' token by an empty string)
        gsub(/ \(.*$/, "", word)
        # save it (add it in 'words' string with a coma as separator)
        words = words "," word
        # next field
        idx += 1
    }
}

# If the flag is 1 and the line NOT begins with an open parenthesis.
#  It's the end of KEY treatment 
flag == 1 && $0 !~ /^\(/ {
    # End of association
    flag = 0
    # Print Key and words
    if (words != "") {
        print KEY words
    }
    # Reinit words
    words = ""
}

# This block is executed AFTER input file treatment.
END {
    # Special case, last word in thesaurus
    # Print Key and words
    if (words != "") {
        print KEY words
    }
}

可执行文件：

chmod 755 ./search.awk

像这样使用：

./search.awk -v KEY="nine" lo

输出：

nine,9,ix,cardinal,9,IX,niner,Nina from Carolina,ennead,digit,figure,baseball club,ball club,club,baseball team

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72903668

复制

相似问题

问如何在bash (连接行)中查询LibreOffice主题词表的文本文件版本
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在bash (连接行)中查询LibreOffice主题词表的文本文件版本EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在bash (连接行)中查询LibreOffice主题词表的文本文件版本
EN