首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在bash (连接行)中查询LibreOffice主题词表的文本文件版本

如何在bash (连接行)中查询LibreOffice主题词表的文本文件版本
EN

Stack Overflow用户
提问于 2022-07-07 20:08:59
回答 4查看 98关注 0票数 0

我正在尝试用bash编写一个简单的脚本来查询LibreOffice词库扩展名为文本文件。对于每个输入查询字符串,我希望输出是所有相关字符串。我想在巴什做这个。

要下载和解压缩同义词库,我需要

代码语言:javascript
复制
wget "https://extensions.libreoffice.org/assets/downloads/41/1653961771/dict-en-20220601_lo.oxt" # download LO dictionary & thesaurus

unzip -p dict-en-20220601_lo.oxt th_en_US_v2.dat > lo # extract contents of thesaurus to text file

查看文本文件的一部分:

代码语言:javascript
复制
nine|3
(adj)|9|ix|cardinal (similar term)
(noun)|9|IX|niner|Nina from Carolina|ennead|digit (generic term)|figure (generic term)
(noun)|baseball club|ball club|club|baseball team (generic term)
nine-banded armadillo|1
(noun)|peba|Texas armadillo|Dasypus novemcinctus|armadillo (generic term)
nine-fold|1
(adj)|nonuple|ninefold|multiple (similar term)
nine-membered|1
(adj)|9-membered|membered (similar term)
nine-sided|1
(adj)|multilateral (similar term)|many-sided (similar term)
nine-spot|1
(noun)|spot (generic term)

因此,例如,我希望能够输入“9”作为查询,并让bash返回类似的内容

代码语言:javascript
复制
9
ix
cardinal
9
IX
niner
Nina from Carolina
ennead
digit
figure
baseball club
ball club
club
baseball team

我认为在awksed中使用正确的语法应该相当容易,特别是因为所有包含查询术语的行都不是以"(“开头,而所有包含相关术语的行都以"(”开头)。

但我还是个新手,还没弄明白。对我来说,问题的关键似乎是将查询术语和所有相关的术语放在一行上。从那里,我知道如何sed我的方式到胜利。但要达到这一点对我来说是很有挑战性的。

蒂娅谢谢你的帮助!

附注:我正在尝试做类似的事情,但我的情况有点不同,我不太了解语法,无法根据我的需要修改它:https://www.unix.com/unix-for-dummies-questions-and-answers/184649-sed-join-lines-do-not-match-pattern.html

EN

回答 4

Stack Overflow用户

回答已采纳

发布于 2022-07-08 00:15:03

这可能对您有用(GNU sed):

代码语言:javascript
复制
v=nine
sed -n ':a;/^'"${v}"'|/{:b;n;/^[^(]/ba;s/^[^|]*|\| ([^)]*)//g;y/|/\n/;p;bb}' file

将焦点放在输入变量匹配之后的任何行上。

取下一行,如果它不是以(开头,则重复上面的代码。

否则,删除第一个字段和父类之间的任何值,将字段分隔符|替换为换行符,打印结果并重复。

代码语言:javascript
复制
v=nine # set variable v to `nine`
sed -n ':a # turn off implicit printing and set goto label a
        /^'"${v}"'|/{ # match a line beginning with variable v
          :b # set goto label b
          n # fetch next line (do not print see option -n)
          /^[^(]/ba # goto label a if line does not begin (
          s/^[^|]*|\| ([^)]*)//g # remove first field and parens
          y/|/\n/ # translate | to newline for entire line
          p # print the result
          bb # goto label b
        }' file

要查看sed脚本的作用,请调用--debug选项。

票数 1
EN

Stack Overflow用户

发布于 2022-07-07 22:49:30

使用sed

代码语言:javascript
复制
$ cat script.sed
N
{
    /\(/ {
        /9/!s/[^|]*\|//
        s/\n/ /
        {
            /[^|]*\|(9\|)/ { 
                s//\1/
                s/([^|]*)\|/\1\n/g
                s/\([^)]*\)//
                s/\([^)]*\)//g
                p
            }
        }
    }
}
代码语言:javascript
复制
$ sed -Enf script.sed input_file
nine
9
ix
cardinal
9
IX
niner
Nina from Carolina
ennead
digit
figure
baseball club
ball club
club
baseball team
票数 1
EN

Stack Overflow用户

发布于 2022-07-07 21:13:00

如果我理解你的问题,奥克解决方案

文件search.awk

代码语言:javascript
复制
#! /usr/bin/awk -f

# This block is executed BEFORE input file treatment.
BEGIN {
    # Field Separator
    FS = "|"
}

# The next blocks are executed for each input file line only if the condition in front of the block is true

# '$1' is the first field/column. Remember, field separator in the pipe (|)
$1 == KEY {
    # Key found, flag it
    flag  = 1
    # Associated words init
    words = ""
    # Do not check the next blocks conditions, process the next line of the input file
    next
}

# If the flag is 1 and the line begins with an open parenthesis.
flag == 1 && $0 ~ /^\(/ {
    # Association found
    # For all associations (field)
    # The line treatment starts with the second field
    idx = 2
    # NF is the Number of Fields in the current line
    while (idx <= NF) {
        # get the current field word (idx in the field number, $ids it is its value)
        word = $idx
        # remove term in parenthesis
        # (in fact, replace all characters after the ' (' token by an empty string)
        gsub(/ \(.*$/, "", word)
        # save it (add it in 'words' string with a coma as separator)
        words = words "," word
        # next field
        idx += 1
    }
}

# If the flag is 1 and the line NOT begins with an open parenthesis.
#  It's the end of KEY treatment 
flag == 1 && $0 !~ /^\(/ {
    # End of association
    flag = 0
    # Print Key and words
    if (words != "") {
        print KEY words
    }
    # Reinit words
    words = ""
}

# This block is executed AFTER input file treatment.
END {
    # Special case, last word in thesaurus
    # Print Key and words
    if (words != "") {
        print KEY words
    }
}

可执行文件:

代码语言:javascript
复制
chmod 755 ./search.awk

像这样使用:

代码语言:javascript
复制
./search.awk -v KEY="nine" lo

输出:

代码语言:javascript
复制
nine,9,ix,cardinal,9,IX,niner,Nina from Carolina,ennead,digit,figure,baseball club,ball club,club,baseball team
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/72903668

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档