文章/答案/技术大牛

发布

社区首页 >问答首页 >使用定义的字符串通过python进行regex搜索

问使用定义的字符串通过python进行regex搜索
EN

Stack Overflow用户

提问于 2020-03-04 11:01:21

回答 1查看 69关注 0票数 0

我希望提高我有下面的脚本。我想知道是否可以使用已定义的字符串(如'G', 'SG', 'PF', 'PG', 'SF', 'F', 'UTIL', 'C' )来搜索它们之间的名称，然后使用这些字符串作为列的名称。我对当前设置的问题是，如果一个名称以两个大写开头，就像下面的例子一样，它不知道区别。能够使用正则表达式设置要搜索的当前字符串，然后返回它们之间的文本，我认为这将是改进函数的下一步。

上一个问题：Python: Regex or Dictionary

import pandas as pd, numpy as np

dk_cont_lineup_df = pd.DataFrame(data=np.array([['G CJ McCollum SG Donovan Mitchell PF Robert Covington PG Collin Sexton SF Bojan Bogdanovic F Larry Nance Jr. UTIL Trey Lyles C Maxi Kleber'],['UTIL Nikola Vucevic PF Kevin Love F Robert Covington SG Collin Sexton SF Bojan Bogdanovic G Coby White PG RJ Barrett C Larry Nance Jr.']]))
dk_cont_lineup_df.rename(columns={ dk_cont_lineup_df.columns[0]: 'Lineup' }, inplace = True)


def calc_col(col):
    '''This function takes a string,
    finds the upper case letters or words placed as delimeter,
    converts it to a list,
    adds a number to the list elements if recurring.
    Eg. input list :['W','W','W','D','D','G','C','C','UTIL']
    o/p list: ['W1','W2','W3','D1','D2','G','C1','C2','UTIL']
    '''
    col_list = re.findall(" ?([A-Z]+) ", col)
    col_list2 = []
    for i_pos in col_list:
        cnt = col_list.count(i_pos)
        if cnt == 1:
            col_list2.append(i_pos)
        if cnt > 1:
            if i_pos in " ".join(col_list2):
                continue;
            col_list2 += [i_pos+str(k) for k in range(1,cnt+1)] 
    return col_list2


# START OF SPLIT LINEUP INTO SEPERATE COLUMNS
extr_row = dk_cont_lineup_df['Lineup'].replace(to_replace =" ?[A-Z]+ ", value="\n", regex = True) #split the rows on 
df_final = pd.DataFrame(columns = sorted(calc_col(dk_cont_lineup_df['Lineup'].iloc[0]))) #Create an empty data frame df3 with sorted columns
for i_pos in range(len(extr_row)): #traverse all the rows in the original dataframe and append the formatted rows to df3
    df_temp = pd.DataFrame((extr_row.values[i_pos].split("\n")[1:])).T
    df_temp.columns = calc_col(dk_cont_lineup_df['Lineup'].iloc[i_pos])
    df_temp= df_temp[sorted(df_temp)]
    df_final = df_final.append(df_temp)
df_final.reset_index(drop = True, inplace = True)

输出：

所需输出：

我想使用这个脚本为其他数据，有其他字符串，这将使它更容易定义我正在寻找的东西。正如我们从输入数据帧中看到的，搜索字符串的位置不是相同的顺序。上面的脚本将它们按顺序排列，我们可以在所需的输出数据帧中看到它们。

python

regex

回答 1

Stack Overflow用户

发布于 2020-03-05 07:44:26

我们可以简单地更新您的正则表达式来检查大写的单词是否紧挨着前面的单词。

r"(?<![A-Z] )\b([A-Z]+) "

请注意，我们已经添加了一个负向后视。如果上一个单词不是[A-Z]，则不匹配

您可以在这里找到关于上面的正则表达式的更深入的解释；https://regex101.com/r/j6RbSP/1

现在您可以更新您的代码以包含新的正则表达式模式，确保您记得在字符串前面添加r""。

import pandas as pd, numpy as np
import re

dk_cont_lineup_df = pd.DataFrame(data=np.array([['G CJ McCollum SG Donovan Mitchell PF Robert Covington PG Collin Sexton SF Bojan Bogdanovic F Larry Nance Jr. UTIL Trey Lyles C Maxi Kleber'],['UTIL Nikola Vucevic PF Kevin Love F Robert Covington SG Collin Sexton SF Bojan Bogdanovic G Coby White PG RJ Barrett C Larry Nance Jr.']]))
dk_cont_lineup_df.rename(columns={ dk_cont_lineup_df.columns[0]: 'Lineup' }, inplace = True)


def calc_col(col):
    '''This function takes a string,
    finds the upper case letters or words placed as delimeter,
    converts it to a list,
    adds a number to the list elements if recurring.
    Eg. input list :['W','W','W','D','D','G','C','C','UTIL']
    o/p list: ['W1','W2','W3','D1','D2','G','C1','C2','UTIL']
    '''
    col_list = re.findall(r"(?<![A-Z] )\b([A-Z]+) ", col)
    col_list2 = []
    for i_pos in col_list:
        cnt = col_list.count(i_pos)
        if cnt == 1:
            col_list2.append(i_pos)
        if cnt > 1:
            if i_pos in " ".join(col_list2):
                continue;
            col_list2 += [i_pos+str(k) for k in range(1,cnt+1)] 
    return col_list2


extr_row = dk_cont_lineup_df['Lineup'].replace(to_replace =r"(?<![A-Z] )\b([A-Z]+) ", value="\n", regex = True) #split the rows on 
df_final = pd.DataFrame(columns = sorted(calc_col(dk_cont_lineup_df['Lineup'].iloc[0])))

for i_pos in range(len(extr_row)): #traverse all the rows in the original dataframe and append the formatted rows to df3
    df_temp = pd.DataFrame((extr_row.values[i_pos].split("\n")[1:])).T
    df_temp.columns = calc_col(dk_cont_lineup_df['Lineup'].iloc[i_pos])
    df_temp= df_temp[sorted(df_temp)]
    df_final = df_final.append(df_temp)
df_final.reset_index(drop = True, inplace = True)

print(df_final.to_string())

生成所需的输出：

                 C                  F             G                 PF              PG                 SF                 SG             UTIL
0      Maxi Kleber   Larry Nance Jr.   CJ McCollum   Robert Covington   Collin Sexton   Bojan Bogdanovic   Donovan Mitchell       Trey Lyles 
1  Larry Nance Jr.  Robert Covington    Coby White         Kevin Love      RJ Barrett   Bojan Bogdanovic      Collin Sexton   Nikola Vucevic

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60518454

复制

相似问题

问使用定义的字符串通过python进行regex搜索
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用定义的字符串通过python进行regex搜索EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用定义的字符串通过python进行regex搜索
EN