Got it

Index Layer

Latest reply: Mar 27, 2021 09:47:14 117 1 0 0 0

Problem Information

Table1 Basic problem information

Item

Description

Storage type

Distributed storage

Product version

FusionStorage OBS 7.0; FusionStorage 8.0.0; FusionStorage 8.0.1

Problem type

Basic service

Keyword

Flow control


Problem Symptom

The service fails due to flow control, and the error codes 503 and 9996 are reported.

Problem Diagnosis

  1. Find the service node that undertakes the service based on the failed client, and then run the following command to search for the index client log:

    cat /var/log/index_layer/client/27010/mongo/router.log | grep "Chunk FC"

    For historical logs, find the compressed package at the corresponding time and run the following commands to search for logs:

    cd /var/log/index_layer/client/27010/mongo/

    zgrep -a "Chunk FC" router.log.2019-01-24T08-30-17_0.tar.gz

    If the following log information is displayed, flow control occurs:

    2_en-us_image_0153172048.png

  2. Run the following commands to search for shard logs of all nodes.

    cd /var/log/index_layer/shard/27021/mongo/

    zgrep -a  "Stopping writes\|Stopping write\|Stalling write\|Stalling writes" shard.log.2018-08-28T*

    If the following log is found, the flow control occurs on the shard.

    2_en-us_image_0153172083.png

  3. Run the following command to check whether the compaction error or flush error occurs on the chunk that has flow control.

    zgrep –a "Compaction error" shard.log.2018-08-28T*

  4. Run the following command to check whether the length of the manifest file exceeds 128 MB.

    zgrep -a "switch manifest " shard.log.2018-08-28T*

    A log is displayed:

    shard.log.2018-08-30T18-56-03_0.tar.gz:][2018-08-30 10:10.370+0800][INFO][STORAGE ][thread117][35743,17][[RocksDB][0000000001e00559][ERROR] switch manifest:new file num: 3362692, old file num 3268646 size(134322704, 134217728), isSplit(0)

    As shown in the example log above, the values of new file num and old file num in the log are different, so the manifest file has been rewritten once.

    If the manifest file is rewritten once every several dozens of seconds on the same chunk, the chunk is too large. Run the following commands on the mongoshell client to check whether the split switch is enabled:

    use admin

    db.adminCommand({balancerStatus:1})

    If the split switch is enabled, you need to locate the reason why the chunk is not split.

  5. Run the following command to check files at which level are over-stacked, which cause flow control.

    zgrep -a  "Stopping writes\|Stopping write\|Stalling write\|Stalling writes" shard.log.2018-08-28T*

    For example: Found some logs:

    [2018-08-30 10:10.645+0800][INFO][STORAGE ][thread202][32163,7][[RocksDB][0000000000600178][WARN] [default] Stalling writes because we have 8 level-0 files rate 316908]

    [2018-08-30 10:10.313+0800][INFO][STORAGE ][conn62515][45567,0][[RocksDB][0000000000600178][WARN] [index] Stalling writes because we have 8 level-0 files rate 190144]

    The log shows that over-stacked level-0 files cause flow control.

    1. if writes because we have %d immutable memtables is displayed,

      The WAL files are over-stacked and the flush speed is slow. Run the following command to check the ftds dotting data:

      grep "rocksdb_flush_main" *_show.ios

      Found some logs:

      2018-08-30 10:10.581 rocksdb_flush_main 03010196 3 3 0 571573 722666 495917 1714719

      2018-08-30 10:10.254 rocksdb_flush_main 03010196 2 2 0 495139 502839 487439 990278

      2018-08-30 10:10.290 rocksdb_flush_main 03010196 1 1 0 507945 507945 507945 507945

      If because we have * level-0 files is displayed,

    2. the level-0 files are over-stacked and the compaction speed of the level-0 files is slow. Run the following command to check the ftds dotting data:

      grep "rocksdb_compact_run_cfd1_l0_l1\|rocksdb_compact_run_cfd0_l0_l1"  *_show.ios

      2018-08-30 10:10.119 rocksdb_compact_run_cfd0_l0_l1 03010188 1 1 0 1038857 1038857 1038857 1038857

      2018-08-30 10:10.119 rocksdb_compact_run_cfd1_l0_l1 0301018a 1 1 0 694628 694628 694628 694628

    3. If writes because of estimated pending compaction is displayed

      The level-1 files are over-stacked, especially in the index column family. The key-value (KV) data in the index column family is an index, and the size of each KV data is small. Therefore, when such files are in process, the compaction consumes a lot of CPU resources in the circular permutation of the KV data. When the CPU usage is high, this type of compaction is time-consuming. Therefore, for this cause of flow control, you need to check the CPU usage to locate the cause of high CPU usage.

Causes

There are four reasons for flow control:

  1. Compaction error or flush error on Rocksb.

  2. The read and write rates at the bottom layer are slow.

  3. CPU usage is high.

  4. The net length of the manifest file exceeds 128 MB. The manifest file needs to be rewritten whenever a file is compacted or flushed. As a result, the compaction file is too slow and the flow control occurs.

Solution

  1. If the problem is caused by chunk not splitting, log in to the node where the chunk is located and run the following command:

    #mongoshell shard

    >db.adminCommand({dbrepair:"db.collection",chunkId:"chunkid", operation:"ManualCompaction", options:""});

  2. If the problem is caused by slow read and write rates at the P layer, solve this problem or reduce the pressure at the P layer.

Check After Recovery

Run the following command to check whether there are remaining flow control logs:

tailf  /var/log/index_layer/client/27010/mongo/router.log | grep "Chunk FC"

Suggestion and Summary

N/A

Applicable Versions

All

Very good
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.