Issues with ssh-agent connecting to a large number of hosts at once

Bob Belnap bbelnap at gmail.com
Wed Apr 22 09:21:38 MDT 2009


Thanks for your help Frank.

On Tue, Apr 21, 2009 at 7:57 PM, Frank Sorenson <frank at tuxrocks.com> wrote:

>
> The manpage for read(2) shows:
>       EAGAIN Non-blocking I/O has been selected using O_NONBLOCK and no
> data was immediately available for reading.
>
> Can you show us the output of:  readlink /proc/`pidof ssh-agent`/fd/160
>  (change 160 to whatever fd is giving the EAGAIN)
> Or even just:  ls /proc/`pidof ssh-agent`/fd


How about both :)

root at chub:~# ls /proc/29019/fd/
0    106  114  122  130  139  147  155  19  27  35  43  51  6   68  76  84
92
1    107  115  123  131  14   148  156  2   28  36  44  52  60  69  77  85
93
10   108  116  124  132  140  149  157  20  29  37  45  53  61  7   78  86
94
100  109  117  125  133  141  15   158  21  3   38  46  54  62  70  79  87
95
101  11   118  126  134  142  150  159  22  30  39  47  55  63  71  8   88
96
102  110  119  127  135  143  151  16   23  31  4   48  56  64  72  80  89
97
103  111  12   128  136  144  152  160  24  32  40  49  57  65  73  81  9
98
104  112  120  129  137  145  153  17   25  33  41  5   58  66  74  82  90
99
105  113  121  13   138  146  154  18   26  34  42  50  59  67  75  83  91

this is when the strace shows:

read(160, 0xbfe1452a, 1024)             = -1 EAGAIN (Resource temporarily
unavailable)
read(160, 0xbfe1452a, 1024)             = -1 EAGAIN (Resource temporarily
unavailable)

readlink for 160 shows:

root at chub:~# readlink  /proc/29019/fd/160
socket:[6380248]

I believe this should map to:

bob at chub:~$ netstat -anp  | grep 6380248
unix  3      [ ]         STREAM     CONNECTED     6380248
-                   /tmp/keyring-gNQ6hA/ssh


> With so many ssh connections, I'd be curious to see what your entropy
> pool looks like.  Do you have any remaining
> in/proc/sys/kernel/random/entropy_avail or has the pool been exhausted?


I have plenty of entropy available, it only goes down slightly during the
whole process.

Another clue to the puzzle.  I have 1300 or so machines in a DC in Hong
Kong, only available through a jump server in the same DC.  If I'm running
my agent on my local machine, through the jump server, and connect to all
the machines, connections time out, agent locks up, etc.  However, if I copy
my keys to the jump box, and run the agent from there, no connections fail,
and all connections complete very quickly.  I assume that this is because
connections open and close quickly enough that whatever limit I'm hitting
isn't reached (netstat snapshots every second show around 200 max concurrent
connections).


--Bob



More information about the PLUG mailing list