我编写了一个python脚本,它使用子进程模块调用unix排序。我试图根据两列(2和6)对表进行排序。以下是我所做的
sort_bt=open("sort_blast.txt",'w+')
sort_file_cmd="sort -k2,2 -k6,6n {0}".format(tab.name)
subprocess.call(sort_file_cmd,stdout=sort_bt,shell=True)但是,输出文件包含一个不完整的行,当我解析表时会产生一个错误,但是当我检查输入文件中用于排序的条目时,这一行看起来是完美的。当排序试图将结果写入指定的文件时,我想存在一些问题,但我不知道如何解决它。
这一行在输入文件中如下所示
gi|191252805|ref|NM_001128633.1|人边缘结合蛋白3C (RIMBP3C) mRNA gnl|BL_ORD_ID|4614 gi|124487059|ref|NP_001074857.1| RIMS结合蛋白2 musculus 103 2877 3176 846 941 1.0102e-07 138.0
但是,在输出文件中,只打印gi\19125。我该怎么解决这个问题?
任何帮助都将不胜感激。
随机存取存储器
发布于 2013-11-09 09:33:29
考虑到python有一个内置的项排序方法,使用子进程调用外部排序工具似乎很愚蠢。
查看示例数据,它似乎是带有|分隔符的结构化数据。下面是如何打开该文件,并以排序的方式迭代python中的结果:
def custom_sorter(first, second):
""" A Custom Sort function which compares items
based on the value in the 2nd and 6th columns. """
# First, we break the line into a list
first_items, second_items = first.split(u'|'), second.split(u'|') # Split on the pipe character.
if len(first_items) >= 6 and len(second_items) >= 6:
# We have enough items to compare
if (first_items[1], first_items[5]) > (second_items[1], second_items[5]):
return 1
elif (first_items[1], first_items[5]) < (second_items[1], second_items[5]):
return -1
else: # They are the same
return 0 # Order doesn't matter then
else:
return 0
with open(src_file_path, 'r') as src_file:
data = src_file.read() # Read in the src file all at once. Hope the file isn't too big!
with open(dst_sorted_file_path, 'w+') as dst_sorted_file:
for line in sorted(data.splitlines(), cmp = custom_sorter): # Sort the data on the fly
dst_sorted_file.write(line) # Write the line to the dst_file.FYI,这个代码可能需要一些抖动。我测试得不太好。
发布于 2013-11-09 13:22:01
您所看到的可能是尝试从多个进程同时写入文件的结果。
在Python中模拟:sort -k2,2 -k6,6n ${tabname} > sort_blast.txt命令:
from subprocess import check_call
with open("sort_blast.txt",'wb') as output_file:
check_call("sort -k2,2 -k6,6n".split() + [tab.name], stdout=output_file)您可以使用纯Python编写它,例如,对于一个小的输入文件:
def custom_key(line):
fields = line.split() # split line on any whitespace
return fields[1], float(fields[5]) # Python uses zero-based indexing
with open(tab.name) as input_file, open("sort_blast.txt", 'w') as output_file:
L = input_file.read().splitlines() # read from the input file
L.sort(key=custom_key) # sort it
output_file.write("\n".join(L)) # write to the output file如果需要对不适合内存的文件进行排序,请参见Sorting text file by using Python
https://stackoverflow.com/questions/19857907
复制相似问题