Precaution Notice on FusionInsight LibrA & Elk Random Port Number Exhaustion
Precaution ID: ENE-P-A-201824
Problem description
Random port resources of the OS on a cluster data node are used up. As a result, new connections fail to be established between nodes and the error message 'ERROR: pooler: failed to create connections in parallel mode for thread XXXX' is displayed.
Trigger conditions
l A cluster communication model needs to establish connections between any two CNs or DNs. When the number of concurrent services or DNs is increased, the number of connections between nodes increases, consuming more random port resources.
l If there is only a database deployed on a single physical server, ports must adhere to the following:
Number of random ports occupied by a database on a single physical server = Number of random ports occupied by the CN process + Number of random ports occupied by DN processes = Number of concurrent services x Total number of DNs in the cluster + Number of DNs on the local server x Total number of DNs in the cluster x 2 < Total number of random ports of the OS x 80%
Note 1:
The numbers of DNs involve only primary DNs. When all the standby DNs on the local server are promoted to primary, the number of the primary DNs increases accordingly.
Note 2:
The total number of random ports of the OS means the port range obtained by running the /sbin/sysctl -n net.ipv4.ip_local_port_range command.
The default value is 3276861000 (border included), indicating 28,324 ports.
Probability
This problem occurs when random port resources are used up.
Impact and Risk
After the problem is triggered, an error will be reported upon a query with no random port resources applied for. If a query has a random port applied for, no error will be reported.
Preparation
Identification Methods
l Use the actual numbers of concurrent services and DNs in the above formula to determine whether the problem has been triggered.
l If an error is reported upon a service query, run grep to search for the keyword port is sufficient in the CN log. If information similar to the following is displayed, the random port exhaustion problem has been triggered: ERROR: pooler: failed to create connections in parallel mode for thread 140401007466240, Error Message: Connection bind 100.144.192.219 is not successfull and errno[98]:Address already in use, please
check if the port is sufficient
l Check whether the number of random ports in use is close to the total number of random ports of the OS.
cat /proc/net/tcp|awk '{print $2}'|awk -F : '{print $2}'|sort|uniq -c|grep "
Solution workaround
l Increase the value range of net.ipv4.ip_local_port_range to prevent the ports for listening from being used as random ports. If listening ports fall into the random port range, add the ports to the local port reservation list by running net.ipv4.ip_local_reserved_ports.
If a cluster uses the default port configuration and a single node has less than or equal to eight primary DNs deployed, use the following port range settings:
net.ipv4.ip_local_port_range = 8192 65535
net.ipv4.ip_local_reserved_ports = 20000,20002,20003,20006-20028,21201,28443,21730,21731,21732,21700,21701,21702,21750,21780,25300-25308,25330-25362,25490-25522,25650-25682,25990,25991
After the settings are written to /etc/sysctl.conf, run sysctl –p to make the settings take effect.
If the cluster does not use the default port configuration or the node has more than eight primary DNs deployed, contact Huawei technical support to customize the port ranges.
l Enable the random port multiplexing function (that is, enable enable_stateless_pooler_reuse, which is enabled by default in R6C10) to multiplex
existing connections when there are multiple users in the same database. This can relieve the pressure on random port resources.
Solution
Upgrade FusionInsight LibrA & Elk to V100R002C80SPC300, in which the port multiplexing function has been incorporated.