This post describes how to calculate the delay between the opposite end image and the local end image.
The following is a comparison of multi-screen and local sound.
The local image is sent to the MCU for processing and then returned. Then, via the terminal codec, network equipment transmission, MCU codec and other key processing links, each link takes time to process. The local sound is not sent to the MCU for processing, so the image returned from the far end must be delayed compared to the local sound.

As shown in the figure, for a multi-picture scene, the local site A speaks at a time point of T1 and the local image is at time T1. The local image is transmitted to the MCU and then returned. The network device processes the delay T3 and the MCU encodes and decodes. Delay T4, terminal codec delay T2, then:
the time at which the local site hears the local sound is: T1;
the time at which the local site sees the local picture in the multi-picture returned by the MCU is: T1+T2+T3*2+T4.
Therefore, there is at least a delay between the local sound and the remotely transmitted image of the site: T2+T3*2+T4. That is, there is not only at least a delay composed of a terminal codec delay, a network device processing delay and an MCU codec delay. In addition, but also a television display delay, processing delay matrix, the signal converter delays and other delays.
In summary, there is a delay in the comparison between the local image sound and the far-end image, which is a normal phenomenon. The change in the size of the specific delay is related to the signal conversion link on the input and output paths.