Issues with ssh-agent connecting to a large number of hosts at once

Bob Belnap bbelnap at gmail.com
Thu Apr 16 11:15:41 MDT 2009


Hi,

I'm having problems with ssh-agent when I am connecting to a large (several
hundred) hosts at once.  I'm using a kanif (
http://taktuk.gforge.inria.fr/kanif/) which is a very nice package that
distributes ssh connections across the hosts you are connecting to (a
fan-out sort of approach, so all connections are not coming from one host).
However, all hosts have to authenticate, so all the hosts have to wind their
way back to the ssh-agent.  This problem isn't isolated to just kanif,
however.   I see it when using other utilities that rely on many concurrent
connections to the ssh-agent.

running strace on the ssh-agent, things start out ok, then go sour and it
starts spitting out:

read(160, 0xbf8f300a, 1024)             = -1 EAGAIN (Resource temporarily
unavailable)
read(160, 0xbf8f300a, 1024)             = -1 EAGAIN (Resource temporarily
unavailable)
read(160, 0xbf8f300a, 1024)             = -1 EAGAIN (Resource temporarily
unavailable)

while pegging the cpu.  Tracking the number of connections to the agent once
every second (while true; do netstat -an | grep -c <agent socket name>;
sleep 1) looks like:

5
5
5
35
98
154
155
200
287
287

at that point I kill the agent, but it will stick at that value if I don't.
It's not always 287, but varies.  I've seen it as high as 447 connections at
once, but it's usually in the 200 range.

I've tried different ssh-agents on different kernels and machines, and
haven't found a combination that works.  However, I have tried it on a
FreeBSD box which did not have the problem.

It seems to me that I'm hitting some kind of kernel limit (open file limit
perhaps?)  But I've fiddled with various sysctl values with no good
results.  Has anyone ran across this, or have any further debugging
suggestions?

--Bob



More information about the PLUG mailing list