首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用torque运行作业时的错误消息。read_tcp_reply,不匹配协议。期望协议4,但读取0的应答

使用torque运行作业时的错误消息。read_tcp_reply,不匹配协议。期望协议4,但读取0的应答
EN

Stack Overflow用户
提问于 2016-12-06 12:49:39
回答 1查看 2.9K关注 0票数 0

我的系统是OS7,我用./configure --prefix=/opt/pbs --with-debug --with-scp --disable-gcc-warnings安装了torque-6.1.0配置。

我的服务器名为"node00“,我添加了一个名为"node01”的从节点。

代码语言:javascript
复制
[root@node00 torque]# pbsnodes
node01
     state = free
     power_state = Running
     np = 16
     ntype = cluster
     status = opsys=linux,uname=Linux node01 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64,nsessions=0,nusers=0,idletime=7057,totmem=98382176kb,availmem=97993700kb,physmem=32846184kb,ncpus=16,loadave=0.00,gres=,netload=286314300,state=free,varattr= ,cpuclock=Fixed,macaddr=0c:c4:7a:02:ba:98,version=6.1.0,rectime=1481028058,jobs=
     mom_service_port = 15002
     mom_manager_port = 15003

我提交了一个简单的作业echo "sleep 5" | qsub,然后它在qstat -f中返回了一条错误消息

代码语言:javascript
复制
queue_type = E
sched_hint = Unable to copy files back - please see the mother superior's
    log for exact details.
comment = Job started on Tue Dec 06 at 21:35

所以我读了上级的日志vi /var/spool/torque/mom_logs/20161206

代码语言:javascript
复制
12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;Log;Log opened
12/06/2016 21:35:33.397;02;   pbs_mom.14693;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.404;02;   pbs_mom.14693;Svr;setpbsserver;node00
12/06/2016 21:35:33.404;02;   pbs_mom.14693;Svr;mom_server_add;server node00 added
12/06/2016 21:35:33.405;02;   pbs_mom.14694;n/a;initialize;independent
12/06/2016 21:35:33.405;02;   pbs_mom.14694;Svr;dep_initialize;mom is now oom-killer safe
12/06/2016 21:35:33.405;02;   pbs_mom.14694;Svr;read_mom_hierarchy;No local mom hierarchy file found, will request from server.
12/06/2016 21:35:33.407;128;   pbs_mom.14694;Svr;pbs_mom;before init_abort_jobs
12/06/2016 21:35:33.410;02;   pbs_mom.14694;Svr;pbs_mom;Is up
12/06/2016 21:35:33.410;02;   pbs_mom.14694;Svr;setup_program_environment;MOM executable path and mtime at launch: /opt/pbs/sbin/pbs_mom 1481027487
12/06/2016 21:35:33.414;02;   pbs_mom.14694;Svr;pbs_mom;Torque Mom Version = 6.1.0, loglevel = 0
12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:35:33.419;01;   pbs_mom.14706;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals
12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
12/06/2016 21:36:18.445;01;   pbs_mom.14795;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 2 MOM status update intervals

node01node00似乎不能互相发送数据。是对的吗?我怎么才能解决这个问题?

EN

回答 1

Stack Overflow用户

发布于 2018-10-13 15:41:04

关于标题文本:"read_tcp_reply,不匹配协议。期望的协议4但是读取0"的回复这是一个错误,在系统上显示如下:

  1. pbs_mom运行在pbs_server未知的节点上(不包括在节点文件中)
  2. 当/var/spool/torque/server_priv/jobs目录被本应在终止作业时删除的作业文件阻塞时(这很容易增长到数千个文件,因为众所周知,pbs_server做清理工作是不好的)。同样的情况也适用于/var/spool/torque/server_priv/arrays目录。
  3. 在具有400个节点和1000个作业(排队和/或正在运行)的系统中,仍然可以看到清除上述两种情况。在这种情况下,每小时发生5-10次。

在所有情况下,tcpdump都在pbs_server端显示,mom在发送状态更新后被发送tcp重置。它很容易被追踪到:

代码语言:javascript
复制
    tcpdump -i <interface> tcp port 15001 and tcp[13]=4

    08:15:25.162128 IP 10.44.0.94.15001 > 10.44.1.220.215: Flags [R], seq 2437471647, win 0, length 0

On the node this is logged:
    10/13/2018 08:15:25.161;01; pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
    10/13/2018 08:15:25.161;01;   pbs_mom.17548;Svr;pbs_mom;LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update intervals

更新:我们最终通过在文件/var/spool/torque/server_priv/ MOM _层次结构中实现mom层次结构来解决这个问题。对于500个节点集群,我们定义了8个组(mom_hierarchy中的路径),其顶层为2个节点,另一个级别为该组中的其余节点。就像这样:

代码语言:javascript
复制
<path>
<level>node1,node2</level>
<level> comma separated list of some 60 nodes</level>
</path>
<path> 
<level>node2,node1</level>
<level comma separated list of some 60 nodes</level>
</path>
<path>
<level>node3,node4</level>
<level>comma separated list of some 60 nodes</level>
</path>
<path>
<level>node4,node3</level>
<level>comma separated list of some 60 nodes</level>
</path>
.....
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/40995829

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档