Got it

Error E40007 when trying to start YoloV3 training

Latest reply: Oct 25, 2021 09:02:50 482 5 0 0 0

Hi all, I keep getting error E40007 when running YoloV3 training on Atlas 800 (3010). 

I'm using this version of Yolo: https://www.hiascend.com/en/software/modelzoo/detail/2/7b5f73072a24453389602051affe9b31


The error occurs when trying to load darknet weights.


Here is the error message: 

INFO:tensorflow:Restoring parameters from ../../.././data/darknet_weights/darknet53.ckpt
2021-09-30 13:32:55.849385: W tf_adapter/util/infershape_util.cc:275] AddNode failed, errormsg is Dimension 0 in both shapes must be equal, but are 60 and 255. Shapes are [60] and [255]. for 'save/Assign_290' (op: 'Assign') with input shapes: [60], [255]..
2021-09-30 13:32:55.849444: W tf_adapter/util/infershape_util.cc:275] AddNode failed, errormsg is Dimension 3 in both shapes must be equal, but are 60 and 255. Shapes are [1,1,512,60] and [1,1,512,255]. for 'save/Assign_291' (op: 'Assign') with input shapes: [1,1,512,60], [1,1,512,255]..
2021-09-30 13:32:55.849462: W tf_adapter/util/infershape_util.cc:275] AddNode failed, errormsg is Dimension 0 in both shapes must be equal, but are 60 and 255. Shapes are [60] and [255]. for 'save/Assign_332' (op: 'Assign') with input shapes: [60], [255]..
2021-09-30 13:32:55.849477: W tf_adapter/util/infershape_util.cc:275] AddNode failed, errormsg is Dimension 3 in both shapes must be equal, but are 60 and 255. Shapes are [1,1,256,60] and [1,1,256,255]. for 'save/Assign_333' (op: 'Assign') with input shapes: [1,1,256,60], [1,1,256,255]..
2021-09-30 13:32:55.849491: W tf_adapter/util/infershape_util.cc:275] AddNode failed, errormsg is Dimension 0 in both shapes must be equal, but are 60 and 255. Shapes are [60] and [255]. for 'save/Assign_349' (op: 'Assign') with input shapes: [60], [255]..
2021-09-30 13:32:55.849505: W tf_adapter/util/infershape_util.cc:275] AddNode failed, errormsg is Dimension 3 in both shapes must be equal, but are 60 and 255. Shapes are [1,1,1024,60] and [1,1,1024,255]. for 'save/Assign_350' (op: 'Assign') with input shapes: [1,1,1024,60], [1,1,1024,255]..
2021-09-30 13:32:55.849522: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node _SOURCE is null.
2021-09-30 13:32:55.849529: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node _SINK is null.
2021-09-30 13:32:55.853388: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node save/Assign_290 is null.
2021-09-30 13:32:55.853409: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node save/Assign_291 is null.
2021-09-30 13:32:55.853557: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node save/Assign_332 is null.
2021-09-30 13:32:55.853567: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node save/Assign_333 is null.
2021-09-30 13:32:55.853652: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node save/Assign_349 is null.
2021-09-30 13:32:55.853660: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node save/Assign_350 is null.
2021-09-30 13:32:55.854274: W tf_adapter/util/infershape_util.cc:313] The InferenceContext of node save/restore_all is null.
2021-09-30 13:32:55.863400: W tf_adapter/kernels/geop_npu.cc:1176] [GEOP] There is no infershape of node : save/Assign_290
2021-09-30 13:32:55.863439: W tf_adapter/kernels/geop_npu.cc:1176] [GEOP] There is no infershape of node : save/Assign_291
2021-09-30 13:32:55.863871: W tf_adapter/kernels/geop_npu.cc:1176] [GEOP] There is no infershape of node : save/Assign_332
2021-09-30 13:32:55.863899: W tf_adapter/kernels/geop_npu.cc:1176] [GEOP] There is no infershape of node : save/Assign_333
2021-09-30 13:32:55.864175: W tf_adapter/kernels/geop_npu.cc:1176] [GEOP] There is no infershape of node : save/Assign_349
2021-09-30 13:32:55.864200: W tf_adapter/kernels/geop_npu.cc:1176] [GEOP] There is no infershape of node : save/Assign_350
2021-09-30 13:32:57.587722: W tensorflow/core/framework/op_kernel.cc:1639] Unavailable: failed
2021-09-30 13:33:02.338308: F tf_adapter/kernels/geop_npu.cc:766] GeOp3_0GEOP::::DoRunAsync Failed
Error Message is : 
E40007: Prebuild op[save/Assign_333] failed, oppath[/usr/local/Ascend/nnae/5.0.1/opp/op_impl/built-in/ai_core/tbe/impl/assign.py], optype[Assign]. Please check the op's compilation error message.
        Prebuild op[save/Assign_332] failed, oppath[/usr/local/Ascend/nnae/5.0.1/opp/op_impl/built-in/ai_core/tbe/impl/assign.py], optype[Assign]. Please check the op's compilation error message.
        Prebuild op[save/Assign_291] failed, oppath[/usr/local/Ascend/nnae/5.0.1/opp/op_impl/built-in/ai_core/tbe/impl/assign.py], optype[Assign]. Please check the op's compilation error message.
run_yolov3.sh: line 30: 55170 Aborted                 (core dumped) taskset -c $PID_START-$PID_END python3 $3/train.py --mode $4


Also, here are the error messages from /root/ascend/log/plog/ folder

[ERROR] GE(50377,python3):2021-09-30-10:26:00.727.952 [error_manager.cc:131]50646 ReportErrMessage: ErrorNo: -1(failed) [COMP][PRE_OPT][Report][Error]error_code  is not registered
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.727.997 [fusion_op.cc:5071]PrintOp Failed to pre-compile node [name:save/Assign_333, type: Assign]. Detailed pre-compilation info is: 
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.728.007 [fusion_op.cc:5328]GetFinishedCompilationTask FinishedTask[0]: taskID[139807837619968:2406], status[1], kernel[None], File[/usr/local/Ascend/nnae/5.0.1/opp/op_impl/built-in/ai_core/tbe/impl/assign.py], result: prebuild failed. module[impl.assign] func[assign], compile_info_key: 
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.728.011  ===========FOLLOWING IS ARGUMENTS OF NODE save/Assign_333, TYPE Assign=============
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.728.028  input0: {'shape': (1, 1, 256, 60), 'ori_shape': (1, 1, 256, 60), 'format': 'NHWC', 'sub_format': 0, 'ori_format': 'NHWC', 'dtype': 'float32', 'addr_type': 0, 'valid_shape': (), 'slice_offset': (), 'L1_workspace_size': -1, 'L1_fusion_type': -1, 'L1_addr_offset': 0, 'total_shape': (), 'split_index': 0}
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.728.032  input1: {'shape': (1, 1, 256, 255), 'ori_shape': (1, 1, 256, 255), 'format': 'NHWC', 'sub_format': 0, 'ori_format': 'NHWC', 'dtype': 'float32', 'addr_type': 0, 'valid_shape': (), 'slice_offset': (), 'L1_workspace_size': -1, 'L1_fusion_type': -1, 'L1_addr_offset': 0, 'total_shape': (), 'split_index': 0}
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.728.040  outputs0: {'shape': (1, 1, 256, 255), 'ori_shape': (1, 1, 256, 255), 'format': 'NHWC', 'sub_format': 0, 'ori_format': 'NHWC', 'dtype': 'float32', 'addr_type': 0, 'valid_shape': (), 'slice_offset': (), 'L1_workspace_size': -1, 'L1_fusion_type': -1, 'L1_addr_offset': 0, 'total_shape': (), 'split_index': 0}
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.728.044  Attributes are: {}
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.728.047  ===========FOLLOWING IS STACK INFO OF NODE save/Assign_333=============
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.728.053  Stack: 
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.728.060  Stack: 
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.728.064  ===========END OF DETAILED OP INFO OF NODE save/Assign_333=============
[ERROR] FE(50377,python3):2021-09-30-10:26:00.728.161 [tbe_op_store_adapter.cc:142]50646 ProcessFailPreCompTask:"tid[139807837619968], taskId[2406], node[save/Assign_333], precompile failed"
[ERROR] FE(50377,python3):2021-09-30-10:26:00.728.165 [tbe_op_store_adapter.cc:295]50646 ParallelPreCompileOp:"Pre-build Tbe op failed, graph_id[139807837619968]"
[ERROR] FE(50377,python3):2021-09-30-10:26:00.728.188 [op_compiler.cc:399]50646 PreCompileOp:"PreCompileOp failed, graph [partition0_rank187_new_sub_graph646]"
[ERROR] GE(50377,python3):2021-09-30-10:26:00.728.209 [graph_optimize.cc:118]50646 OptimizeSubGraph: ErrorNo: -1(failed) [COMP][PRE_OPT][OptimizeSubGraph][OptimizeFusedGraph]: graph optimize failed, ret:-1
[ERROR] GE(50377,python3):2021-09-30-10:26:00.728.214 [graph_manager.cc:2699]50646 ProcessSubGraphWithMultiThreads: ErrorNo: -1(failed) [COMP][PRE_OPT]SubGraph optimize Failed AIcoreEngine
[ERROR] GE(50377,python3):2021-09-30-10:26:00.737.531 [error_manager.cc:131]50637 ReportErrMessage: ErrorNo: -1(failed) [COMP][PRE_OPT][Report][Error]error_code  is not registered
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.737.576 [fusion_op.cc:5071]PrintOp Failed to pre-compile node [name:save/Assign_332, type: Assign]. Detailed pre-compilation info is: 
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.737.585 [fusion_op.cc:5328]GetFinishedCompilationTask FinishedTask[0]: taskID[139807988107008:2428], status[1], kernel[None], File[/usr/local/Ascend/nnae/5.0.1/opp/op_impl/built-in/ai_core/tbe/impl/assign.py], result: prebuild failed. module[impl.assign] func[assign], compile_info_key: 
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.737.589  ===========FOLLOWING IS ARGUMENTS OF NODE save/Assign_332, TYPE Assign=============
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.737.604  input0: {'shape': (60,), 'ori_shape': (60,), 'format': 'ND', 'sub_format': 0, 'ori_format': 'ND', 'dtype': 'float32', 'addr_type': 0, 'valid_shape': (), 'slice_offset': (), 'L1_workspace_size': -1, 'L1_fusion_type': -1, 'L1_addr_offset': 0, 'total_shape': (), 'split_index': 0}
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.737.607  input1: {'shape': (255,), 'ori_shape': (255,), 'format': 'ND', 'sub_format': 0, 'ori_format': 'ND', 'dtype': 'float32', 'addr_type': 0, 'valid_shape': (), 'slice_offset': (), 'L1_workspace_size': -1, 'L1_fusion_type': -1, 'L1_addr_offset': 0, 'total_shape': (), 'split_index': 0}
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.737.615  outputs0: {'shape': (255,), 'ori_shape': (255,), 'format': 'ND', 'sub_format': 0, 'ori_format': 'ND', 'dtype': 'float32', 'addr_type': 0, 'valid_shape': (), 'slice_offset': (), 'L1_workspace_size': -1, 'L1_fusion_type': -1, 'L1_addr_offset': 0, 'total_shape': (), 'split_index': 0}
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.737.619  Attributes are: {}
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.737.622  ===========FOLLOWING IS STACK INFO OF NODE save/Assign_332=============
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.737.629  Stack: 
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.737.636  Stack: 
[ERROR] TEFUSION(50377,python3):2021-09-30-10:26:00.737.640  ===========END OF DETAILED OP INFO OF NODE save/Assign_332=============
[ERROR] FE(50377,python3):2021-09-30-10:26:00.737.712 [tbe_op_store_adapter.cc:142]50637 ProcessFailPreCompTask:"tid[139807988107008], taskId[2428], node[save/Assign_332], precompile failed"
[ERROR] FE(50377,python3):2021-09-30-10:26:00.737.717 [tbe_op_store_adapter.cc:295]50637 ParallelPreCompileOp:"Pre-build Tbe op failed, graph_id[139807988107008]"
[ERROR] FE(50377,python3):2021-09-30-10:26:00.737.743 [op_compiler.cc:399]50637 PreCompileOp:"PreCompileOp failed, graph [partition0_rank201_new_sub_graph667]"
[ERROR] GE(50377,python3):2021-09-30-10:26:00.737.765 [graph_optimize.cc:118]50637 OptimizeSubGraph: ErrorNo: -1(failed) [COMP][PRE_OPT][OptimizeSubGraph][OptimizeFusedGraph]: graph optimize failed, ret:-1
[ERROR] GE(50377,python3):2021-09-30-10:26:00.737.770 [graph_manager.cc:2699]50637 ProcessSubGraphWithMultiThreads: ErrorNo: -1(failed) [COMP][PRE_OPT]SubGraph optimize Failed AIcoreEngine
[ERROR] GE(50377,python3):2021-09-30-10:26:00.778.326 [graph_manager.cc:707]50434 OptimizeSubGraphWithMultiThreads: ErrorNo: -1(failed) [COMP][PRE_OPT]subgraph 186 optimize failed
[ERROR] GE(50377,python3):2021-09-30-10:26:00.878.258 [graph_manager.cc:785]50434 SetSubgraph: ErrorNo: -1(failed) [COMP][PRE_OPT]Multiply optimize subgraph failed
[ERROR] GE(50377,python3):2021-09-30-10:26:00.878.330 [graph_manager.cc:3230]50434 OptimizeSubgraph: ErrorNo: -1(failed) [COMP][PRE_OPT]Graph set subgraph Failed
[ERROR] GE(50377,python3):2021-09-30-10:26:00.878.350 [graph_manager.cc:837]50434 PreRunOptimizeSubGraph: ErrorNo: -1(failed) [COMP][PRE_OPT]Failed to process GraphManager_OptimizeSubgraph
[ERROR] GE(50377,python3):2021-09-30-10:26:00.878.358 [graph_manager.cc:946]50434 PreRun: ErrorNo: -1(failed) [COMP][PRE_OPT]Run PreRunOptimizeSubGraph failed for graph:ge_default_20210930102559.
[ERROR] GE(50377,python3):2021-09-30-10:26:00.881.802 [graph_manager.cc:3094]50434 ReturnError: ErrorNo: -1(failed) [COMP][PRE_OPT]PreRun Failed..
[ERROR] FMK(50377,python3):2021-09-30-10:26:03.881.991 [tf_adapter/common/adp_logger.cc:34][TF_ADAPTER] [tf_adapter/kernels/geop_npu.cc:763] GeOp3_0GEOP::::DoRunAsync Failed


Could you please help me solve the issue?

Holle,Atlas 800(Model: 3010) is a inference server  and cannot be used for training.The product introduction is as follows,https://www.hiascend.com/en/hardware/ai-server,Atlas 800(Model: 9000,Model: 9010)  is a training server.

00


View more
  • x
  • convention:

Posted by wangyiyi at 2021-10-21 06:36 Holle,Atlas 800(Model: 3010) is a inference server  and cannot be used for training.The product int ...

Hello, I wasn't expressing myself correctly, the server has x86 CPU and two training cards (Atlas 300T). So even though the iBMC shows the model name is Atlas 800 3010, in fact, it is a training server with two training cards.

Here's the output of npu-smi info:

npu-smi info


View more
  • x
  • convention:

Hello, in order for you to get a more professional response, we have raised an issue on the ascend open source community.

Regarding the follow-up progress of this issue, you can follow the link below, and we will reply you as soon as possible in the link below

https://gitee.com/ascend/modelzoo/issues/I4F0ZQ
View more
  • x
  • convention:

Posted by user_3128835 at 2021-10-22 07:21 Hello, in order for you to get a more professional response, we have raised an issue on the ascend o ...
Hello again, we figured out that the problem occurs when restoring the pretrained weights. Once that is disabled the training works correctly. I downloaded darknet weights from the link given in README and used convert_weight script to convert them, but it seems this doesn't work correctly.

Since pretrained weights could be useful for my problem, is there another way to convert them?
View more
  • x
  • convention:

This issue is now resolved
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.