A lot of TIME_WAIT TCP connections when video-walls on

Answered

Comments

5 comments

  • Avatar
    Sergey Yuldashev

    Hello Mykola,

    High number of TCP sockets in TIME_WAIT state is not a bad thing in general.
    Of course, if you are dealing with a lot of short-lived TCP connections. Which is true in the case of Nx Witness server.

    Every time you do anything in the Client GUI(video wall included), it will send a lot of http-requests to the Media Server. Whenever you open the camera item or change the resource state or make any command, client will initiate an individual connection. More cameras you "touch", more actions you do, higher values you get.

    The values from your screenshot does not even look troubling. In the kind of "high-load" servers I dealt with one time I saw numbers like 40K-50K. In that case I had to reduce the default "TIME_WAIT" timeout in the tcp configuration, otherwise  Server ran out of the free available sockets. Eventually we had to unload the server of course and add more nodes.

     

    Bottom line - I would not treat that as something suspicious in the current state or any time soon, just something worth checking regularly. 

    0
    Comment actions Permalink
  • Avatar
    Mykola Klitovchenko

    Sergey Yuldashev

    It sounds pretty clear for me, but I have issues that i can reproduce.
    So I just try to clarify my steps to reproduce.

    So, when I start to use video walls in our current project (video walls configuration I was sent before), we get a lot of issues for PTZ camera control. AFAIS, it's correlate with number of TCP TIME_WAIT connections. I understand, that is can be not relevant, but may be helpful.  

    When video walls is on, and we try to have good PTZ control experience, we don't reach it.

    TCP TIME_WAIT means server close connection. But why? 

    I've just analyzed client logs and found pretty clear evidence, that TCP connections, using in PTZ manipulation process, can be reused: 

    2021-08-09 20:42:38.183 309269000 VERBOSE nx::network::http::AsyncClient(0x7fe147eec000): Sending request POST /api/ptz?cameraId=%7B3d0b7346-d418-5eae-fd8d-162141a4e704%7D&command=ContinuousMovePtzCommand&rotationSpeed=0&sequenceId=%7B85f7e5dd-9b29-4484-95b1-3fa5563fb286%7D&sequenceNumber=156&type=operational&xSpeed=0&ySpeed=0&zSpeed=-1 HTTP/1.1 via reused connection

    In other worlds, when operator press the "Left arrow button" and nothing happen (just "Moving..." text box on UI), he continue to pressing button and generates a snowball problem. 

    Smoothly PTZ control is on of the main goals of our project, and we need to implement it ASAP. 

    0
    Comment actions Permalink
  • Avatar
    Mykola Klitovchenko

    BTW, using AsyncClient for PTZ controls seems quite strange for me, but maybe I don't understand meaning of AsyncClient at all.

    As for me, all PTZ actions should be straight sync, and until we don't have OK response from camera or go to timeout, another action shouldn't be invoked. I am pretty sure, that NX Witness product has big race tests coverage, but all async actions produce races.

    0
    Comment actions Permalink
  • Avatar
    Mykola Klitovchenko

    I've just try to to do 

    netstat -anpl | grep TIME_WAIT 

    and got

    tcp        0      0 192.168.135.13:22790    192.168.135.11:7001     TIME_WAIT   -                   
    tcp 0 0 192.168.135.13:61562 192.168.135.12:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:21782 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:64028 192.168.135.12:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:61694 192.168.135.12:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:23494 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:20208 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:19820 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:64092 192.168.135.12:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:19430 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:21592 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:23278 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:63368 192.168.135.12:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:20632 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:61496 192.168.135.12:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:62378 192.168.135.12:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:61282 192.168.135.12:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:20694 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:61924 192.168.135.12:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:22616 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:64902 192.168.135.12:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:22066 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:22938 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:23034 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:19852 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:60688 192.168.135.12:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:63054 192.168.135.12:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:63596 192.168.135.12:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:23230 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:23132 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:60936 192.168.135.12:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:20202 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:22348 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:22522 192.168.135.11:7001 TIME_WAIT -
    tcp 0 0 192.168.135.13:22578 192.168.135.11:7001 TIME_WAIT -

    where *.13, *.12 and *.11 Nx servers.

    Those connections - what is it?  

    0
    Comment actions Permalink
  • Avatar
    Sergey Yuldashev

    Hi Mykola Klitovchenko,

    Let me address some of your points in the inversed order as I read them now.

    > where *.13, *.12 and *.11 Nx servers.
    > Those connections - what is it?  

    Those are traces of the "client" connections from server 13 to other 2 ones. Which are the part of the regular servers workflow.
    Servers constantly communicate with each other for many different reasons.
    Just in case if you ask "why would server .13 use some weird ports to access standard 7001 port of servers .12 and .13", this is how TCP works.

     

    > As for me, all PTZ actions should be straight sync, and until we don't have OK response from camera or go to timeout, another action shouldn't be invoked. I am pretty sure, that NX Witness product has big race tests coverage, but all async actions produce races.

    Let me disagree here. In my experience I have faced devices which could spend seconds to execute regular commands. Waiting for that synchronously does not make any sense.
    But if you make the async calls with a proper timeout management, the server itself won't suffer from that at all.

    >So, when I start to use video walls in our current project (video walls configuration I was sent before), we get a lot of issues for PTZ camera control. AFAIS, it's correlate with number of TCP TIME_WAIT connections. I understand, that is can be not relevant, but may be helpful.  
    When video walls is on, and we try to have good PTZ control experience, we don't reach it.

    Correlation here does not mean that it's what caused the issues. Again, numbers like 500-600 TCP sockets in the TIME_WAIT state would not cause any issues at all. By default even Windows Servers are capable of allocating up to 16K of such sockets. 
    In this case I'd say you see some issues that end up in the increased frequency of the connections being opened and see the number of TIME_WAIT sockets to raise as the consequence or the co-symptome.

     AFAIK, you have assigned at least one more topic related to the PTZ issue itself, so let's leave it aside in this particular thread and give my colleagues a chance to troubleshoot it without unnecessary internvention.

    > TCP TIME_WAIT means server close connection. But why? 

    It is the designed behavior that all active TCP connections will eventually be finished and will be transferred to the TIME_WAIT state.
    When it comes to service HTTP or WS connections that make short data base updates, data proxy or short-time video transmission, the whole process looks like that(I'll consider an http case):

    1. Server A discovery module initiates an api call to the server B from the same system http://<server_b_ip>:port/api/moduleInformation.  This is a typical thing, just for a better understanding that servers DO that all the time.
    2. HTTP data transmission takes ~10 ms, it's just a small JSON transmission. During that time the "client" socket on Server A is in the ESTABLISHED state.
    3. After transmission is finished,  the servers will begin to "finish" the TCP connection. Server A(being a "client" in this particular connection) sends FIN to Server B. Server B responds with a FIN-ACK. 
      Then Server B sends FIN and Server A responds with a FIN-ACK. This usually takes 1-3 ms.
    4. At this point a network socket at the Server A is transferred to the TIME_WAIT state for the fail-safe purpose. And will live in this state depending on the Operating System you use. In Windows it might be up to 4 minutes for different examples.

    I.e. if you're using Windows, a short-lived(5-10 ms) connection generates a "TIME_WAIT" connection line in the netstat output that will live there for several minutes after the connection is actually finished. And that is expected.

     

    Everything we have discussed here so far can not help you resolve any issues with the PTZ functions. 
    I would recommend you look in a different direction and use other tools at this point.
    E.g. examine network traffic between the particular server and the endpoint device using the Wireshark in order to see if all the API requests are fullfilled by the camera and are not unexpectedly closed. 
    But I'm pretty sure that my colleagues that are working on this case are already doing/suggesting a similar approach.

    0
    Comment actions Permalink

Please sign in to leave a comment.