我希望提高我有下面的脚本。我想知道是否可以使用已定义的字符串(如'G', 'SG', 'PF', 'PG', 'SF', 'F', 'UTIL', 'C' )来搜索它们之间的名称,然后使用这些字符串作为列的名称。我对当前设置的问题是,如果一个名称以两个大写开头,就像下面的例子一样,它不知道区别。能够使用正则表达式设置要搜索的当前字符串,然后返回它们之间的文本,我认为这将是改进函数的下一步。
上一个问题:Python: Regex or Dictionary
import pandas as pd, numpy as np
dk_cont_lineup_df = pd.DataFrame(data=np.array([['G CJ McCollum SG Donovan Mitchell PF Robert Covington PG Collin Sexton SF Bojan Bogdanovic F Larry Nance Jr. UTIL Trey Lyles C Maxi Kleber'],['UTIL Nikola Vucevic PF Kevin Love F Robert Covington SG Collin Sexton SF Bojan Bogdanovic G Coby White PG RJ Barrett C Larry Nance Jr.']]))
dk_cont_lineup_df.rename(columns={ dk_cont_lineup_df.columns[0]: 'Lineup' }, inplace = True)
def calc_col(col):
'''This function takes a string,
finds the upper case letters or words placed as delimeter,
converts it to a list,
adds a number to the list elements if recurring.
Eg. input list :['W','W','W','D','D','G','C','C','UTIL']
o/p list: ['W1','W2','W3','D1','D2','G','C1','C2','UTIL']
'''
col_list = re.findall(" ?([A-Z]+) ", col)
col_list2 = []
for i_pos in col_list:
cnt = col_list.count(i_pos)
if cnt == 1:
col_list2.append(i_pos)
if cnt > 1:
if i_pos in " ".join(col_list2):
continue;
col_list2 += [i_pos+str(k) for k in range(1,cnt+1)]
return col_list2
# START OF SPLIT LINEUP INTO SEPERATE COLUMNS
extr_row = dk_cont_lineup_df['Lineup'].replace(to_replace =" ?[A-Z]+ ", value="\n", regex = True) #split the rows on
df_final = pd.DataFrame(columns = sorted(calc_col(dk_cont_lineup_df['Lineup'].iloc[0]))) #Create an empty data frame df3 with sorted columns
for i_pos in range(len(extr_row)): #traverse all the rows in the original dataframe and append the formatted rows to df3
df_temp = pd.DataFrame((extr_row.values[i_pos].split("\n")[1:])).T
df_temp.columns = calc_col(dk_cont_lineup_df['Lineup'].iloc[i_pos])
df_temp= df_temp[sorted(df_temp)]
df_final = df_final.append(df_temp)
df_final.reset_index(drop = True, inplace = True)输出:

所需输出:

我想使用这个脚本为其他数据,有其他字符串,这将使它更容易定义我正在寻找的东西。正如我们从输入数据帧中看到的,搜索字符串的位置不是相同的顺序。上面的脚本将它们按顺序排列,我们可以在所需的输出数据帧中看到它们。
发布于 2020-03-05 07:44:26
我们可以简单地更新您的正则表达式来检查大写的单词是否紧挨着前面的单词。
r"(?<![A-Z] )\b([A-Z]+) "请注意,我们已经添加了一个负向后视。如果上一个单词不是[A-Z],则不匹配
您可以在这里找到关于上面的正则表达式的更深入的解释;https://regex101.com/r/j6RbSP/1
现在您可以更新您的代码以包含新的正则表达式模式,确保您记得在字符串前面添加r""。
import pandas as pd, numpy as np
import re
dk_cont_lineup_df = pd.DataFrame(data=np.array([['G CJ McCollum SG Donovan Mitchell PF Robert Covington PG Collin Sexton SF Bojan Bogdanovic F Larry Nance Jr. UTIL Trey Lyles C Maxi Kleber'],['UTIL Nikola Vucevic PF Kevin Love F Robert Covington SG Collin Sexton SF Bojan Bogdanovic G Coby White PG RJ Barrett C Larry Nance Jr.']]))
dk_cont_lineup_df.rename(columns={ dk_cont_lineup_df.columns[0]: 'Lineup' }, inplace = True)
def calc_col(col):
'''This function takes a string,
finds the upper case letters or words placed as delimeter,
converts it to a list,
adds a number to the list elements if recurring.
Eg. input list :['W','W','W','D','D','G','C','C','UTIL']
o/p list: ['W1','W2','W3','D1','D2','G','C1','C2','UTIL']
'''
col_list = re.findall(r"(?<![A-Z] )\b([A-Z]+) ", col)
col_list2 = []
for i_pos in col_list:
cnt = col_list.count(i_pos)
if cnt == 1:
col_list2.append(i_pos)
if cnt > 1:
if i_pos in " ".join(col_list2):
continue;
col_list2 += [i_pos+str(k) for k in range(1,cnt+1)]
return col_list2
extr_row = dk_cont_lineup_df['Lineup'].replace(to_replace =r"(?<![A-Z] )\b([A-Z]+) ", value="\n", regex = True) #split the rows on
df_final = pd.DataFrame(columns = sorted(calc_col(dk_cont_lineup_df['Lineup'].iloc[0])))
for i_pos in range(len(extr_row)): #traverse all the rows in the original dataframe and append the formatted rows to df3
df_temp = pd.DataFrame((extr_row.values[i_pos].split("\n")[1:])).T
df_temp.columns = calc_col(dk_cont_lineup_df['Lineup'].iloc[i_pos])
df_temp= df_temp[sorted(df_temp)]
df_final = df_final.append(df_temp)
df_final.reset_index(drop = True, inplace = True)
print(df_final.to_string())生成所需的输出:
C F G PF PG SF SG UTIL
0 Maxi Kleber Larry Nance Jr. CJ McCollum Robert Covington Collin Sexton Bojan Bogdanovic Donovan Mitchell Trey Lyles
1 Larry Nance Jr. Robert Covington Coby White Kevin Love RJ Barrett Bojan Bogdanovic Collin Sexton Nikola Vucevic https://stackoverflow.com/questions/60518454
复制相似问题