我的系统是OS7,我用./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings安装了torque-6.1.0配置。
我的服务器名为"node00“,我添加了一个名为"node01”的从节点。
[root@node00 torque]# pbsnodes
node01
state = free
power_state = Running
np = 16
ntype = cluster
status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=
mom_service_port = 15002
mom_manager_port = 15003我提交了一个简单的作业echo "sleep 5" | qsub,然后它在qstat -f中返回了一条错误消息
queue_type = E
sched_hint = Unable to copy files back - please see the mother superior's
log for exact details.
comment = Job started on Tue Dec 06 at 21:35所以我读了上级的日志vi /var/spool/torque/mom_logs/20161206
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;Log;Log opened
12/06/2016 21:35:33.397;02; pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;setpbsserver;node00
12/06/2016 21:35:33.404;02; pbs_mom.14693;Svr;mom_server_add;server node00 added
12/06/2016 21:35:33.405;02; pbs_mom.14694;n/a;initialize;independent
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe
12/06/2016 21:35:33.405;02; pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.
12/06/2016 21:35:33.407;128; pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;pbs_mom;Is up
12/06/2016 21:35:33.410;02; pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487
12/06/2016 21:35:33.414;02; pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:35:33.419;01; pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:36:18.445;01; pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervalsnode01和node00似乎不能互相发送数据。是对的吗?我怎么才能解决这个问题?
发布于 2018-10-13 15:41:04
关于标题文本:"read_tcp_reply,不匹配协议。期望的协议4但是读取0"的回复这是一个错误,在系统上显示如下:
在所有情况下,tcpdump都在pbs_server端显示,mom在发送状态更新后被发送tcp重置。它很容易被追踪到:
tcpdump -i <interface> tcp port 15001 and tcp[13]=4
08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0
On the node this is logged:
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals更新:我们最终通过在文件/var/spool/torque/server_priv/ MOM _层次结构中实现mom层次结构来解决这个问题。对于500个节点集群,我们定义了8个组(mom_hierarchy中的路径),其顶层为2个节点,另一个级别为该组中的其余节点。就像这样:
<path>
<level>node1,node2</level>
<level> comma separated list of some 60 nodes</level>
</path>
<path>
<level>node2,node1</level>
<level comma separated list of some 60 nodes</level>
</path>
<path>
<level>node3,node4</level>
<level>comma separated list of some 60 nodes</level>
</path>
<path>
<level>node4,node3</level>
<level>comma separated list of some 60 nodes</level>
</path>
.....https://stackoverflow.com/questions/40995829
复制相似问题