Hello all,
Does anyone know the working principles of Elasticsearch? Don't worry, please find out below.
Internal architecture
The Elasticsearch provides various access interfaces through RESTful APIs or other languages (such as Java), uses the cluster discovery mechanism, supports script languages, and various plug-ins. The underlying layer is based on Lucene, with absolute independence of Lucene and stores indexes through local files, shared files, and HDFS, as shown in Figure 2.
Figure
2 Internal architecture
Inverted indexing
In the traditional search mode, as shown in Figure 3, documents are searched based on their IDs. During the search, keywords of each document are scanned to find all information containing the keywords. Forward indexing is easy to maintain but is time consuming.
Figure
3 Forward indexing
For Elasticsearch (Lucene), as shown in Figure 4, there is a dictionary composed of keywords and their statistic information, such as IDs, positions, and document frequencies. In this search mode, Elasticsearch searches for the keywords to locate the document ID and position and then finds the document, which is similar to the method of looking up a word in a dictionary. Alternatively, Elasticsearch searches the catalog to find the content on a specific page. Inverted indexing is efficient in search, but is time consuming for constructing indexes and costly for maintenance.
Figure
4 Inverted indexing
Distributed indexing flow
Figure 5 shows the process of Elasticsearch distributed indexing flow.
Figure
5 Distributed indexing
flow
The process is as follows:
Phase 1: The client sends an index request to any node, for example, Node 1.
Phase 2: Node 1 determines the shard to store the file based on the request. Assume the shard is shard 0. Node 1 then forwards the request to Node 3 where primary shard P0 of shard 0 exists.
Phase 3: Node 3 executes the request on primary shard P0 of shard 0. If the request is successfully executed, Node 3 sends the request to all the replica shard R0 in Node 1 and Node 2 concurrently. If all the replica shards successfully execute the request, a verification message is returned to Node 3. After receiving the verification messages from all the replica shards, Node 3 returns a success message to the user.
Distributed searching flow
The Elasticsearch distributed searching flow consists of two phases: Query and acquisition.
Figure 6 shows the query phase.
Figure
6 Query phase of the
distributed searching flow
The process is as follows:
Phase 1: The client sends a retrieval request to any node, for example, Node 3.
Phase 2: Node 3 sends the retrieval request to each shard in the index adopting the polling policy. One of the primary shards and all of its replica shards is randomly selected to balance the read request load. Each shard performs retrieval locally and adds the sorting result to the local node.
Phase 3: Each shard returns the local result to Node 3. Node 3 combines these values and performs global sorting.
In the query phase, the data to be retrieved is located. In the acquisition phase, these data will be collected and returned to the client. Figure 7 shows the acquisition phase.
Figure
7 Acquisition phase of
the distributed searching flow
The process is as follows:
Phase 1: After all data to be retrieved is located, Node 3 sends a request to related shards.
Phase 2: Each shard that receives the request from Node 3 reads the related files and return them to Node 3.
Phase 3: After obtaining all the files returned by the shards, Node 3 combines them into a summary result and returns it to the client.
Distributed bulk indexing flow
Figure 8 Distributed bulk indexing flow
The process is as follows:
Phase 1: The client sends a bulk request to Node 1.
Phase 2: Node 1 constructs a bulk request for each shard and forwards the requests to the primary shard according to the request.
Phase 3: The main segment is executed one by one in sequence. After an operation is complete, the main segment forwards the new document (or deletes the part) to the corresponding replication node, and then performs the next operation. After the operation is complete, the node sends a report to the request node. The request node sorts the response and sends the response to the client.
Distributed bulk searching flow
Figure 9 Distributed bulk searching flow
The process is as follows:
Phase 1: The client sends an mget request to Node 1.
Phase 2: Node 1 constructs a multi-data retrieval request for each shard and forwards the requests to the primary shard or its replica shard based on the requests. When all replies are received, Node 1 constructs a response and returns it to the client.
Routing algorithm
Elasticsearch provides two routing algorithms:
§ Default route: shard=hash (routing) %number_of_primary_shards. In this routing policy, the number of received shards is limited. During capacity expansion, the number of shards needs to be multiplied (ES6.x). In addition, when creating an index, you need to specify the capacity to be expanded in the future. ES5.x does not support capacity expansion. ES7.x can be expanded freely.
§ Custom route: In this routing mode, the routing can be specified to determine the shard to which the file is written, or only the specified routing can be searched.
Balancing algorithm
Elasticsearch provides the automatic balance function for capacity expansion, capacity reduction, and data import scenarios. The algorithm is as follows:
weight_index(node, index) = indexBalance * (node.numShards(index) - avgShardsPerNode(index))
Weight_node(node, index) = shardBalance * (node.numShards() - avgShardsPerNode)
weight(node, index) = weight_index(node, index) + weight_node(node, index)
Single-node multi-instance deployment
Multiple Elasticsearch instances can be deployed on the same node, and differentiated from each other based on the IP address and port number. This method increases the usage of the single-node CPU, memory, and disk, and improves the Elasticsearch indexing and searching capability.
Cross-node replica allocation policy
When multiple instances are deployed on a single node and multiple replicas exist, replicas can only be allocated across instances. A single-point failure may occur. Solve this problem by configuring the parameter cluster.routing.allocation.same_shard.host to true.
More content about Bigdata, stay tuned to our Community!