【MindSpore第六期两日集训营】MindElec作业记录
【MindSpore第六期两日集训营】于2021年11月6日到11月7日在B站拉开了帷幕,错过直播 https://live.bilibili.com/22127570 的老铁们别忘了还有录播,链接分别为:
第一天:
第六期两日集训营 | MindSpore AI电磁仿真 https://www.bilibili.com/video/BV1Y34y1Z7E8?spm_id_from=333.999.0.0
第六期两日集训营 | MindSpore并行使能大模型训练 https://www.bilibili.com/video/BV193411b7on?spm_id_from=333.999.0.0
第六期两日集训营 | MindSpore Boost,让你的训练变得飞快 https://www.bilibili.com/video/BV1c341187ML?spm_id_from=333.999.0.0
第二天:
第六期两日集训营 | MindSpore 控制流概述 https://www.bilibili.com/video/BV1A34y1d7G7?spm_id_from=333.999.0.0
第六期两日集训营 | MindSpore Lite1.5特性发布,带来全新端侧AI体验 https://www.bilibili.com/video/BV1f34y1o7mR?spm_id_from=333.999.0.0
第六期两日集训营 | 可视化集群调优重磅发布,从LeNet到盘古大模型都能调优 https://www.bilibili.com/video/BV1dg411K7Nb?spm_id_from=333.999.0.0
我们先看第一天第一讲,MindScience的MindElec——电磁仿真。
第一讲的作业如下:
其实张小白已经尝试过MindScience的MindSPONGE分子模拟套件包了:
具体链接如下:
论坛:https://bbs.huaweicloud.cn/forum/forum.php?mod=viewthread&tid=159269
博客:https://bbs.huaweicloud.cn/blogs/302842
但是既然作业2要求做MindElec电磁仿真,所以,作业1也可以用MindElec来做一下。
一、购买ECS GPU云服务器
我们使用ECS的GPU云服务器来完成这个作业的MindElec部分,MindSponge的部分请看前面的链接。
到华为云的控制台-》ECS,切换到北京四,按照下图所示购买:
点击立即购买:
由于费用是1小时7块多,所以张小白迫不及待地登陆进去。
先看了一下内存和CUDA的版本:11.0
二、安装Anaconda环境
由于MindSpore传统上都是使用Python 3.7.5环境(当然后面也支持了Python 3.9),所以先装conda环境:
...
...
source ~/.bashrc
发现装的版本太老了,只好重新下载最新的Anaconda:
下载好后将其传到服务器,执行:
bash ./Anaconda3-2021.05-Linux-x86_64.sh
安装的时候自然提示目录已存在,
rm -rf /root/anaconda3
重新执行:
bash ./Anaconda3-2021.05-Linux-x86_64.sh
三、创建mindspore1.5的conda环境:
conda create -n mindspore1.5 python=3.7.5
。。。
conda activate mindspore1.5
conda install -c conda-forge pythonocc-core=7.5.1 cudatoolkit=11.1
按Y继续:
conda环境的CUDA 11.1的包比较大(1.2G),要耐心等待下载。
pythonocc也在其中。
四、安装mindspore 1.5的GPU版本
pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/1.5.0/MindSpore/gpu/x86_64/cuda-11.1/mindspore_gpu-1.5.0-cp37-cp37m-linux_x86_64.whl --trusted-host ms-release.obs.cn-north-4.myhuaweicloud.com -i https://pypi.tuna.tsinghua.edu.cn/simple
五、安装mindelec:
我们直接使用官网提供的MindElec的包安装吧,虽然名字写的是ascend,但是老师说gpu也能用。
pip install ./mindscience_mindelec_ascend-0.1.0-cp37-cp37m-linux_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
验证安装:
出错了,cuda是11.0版本了,而且cudnn似乎没有安装。
六·、安装cuda 11.1和对应的cudnn 8.0.5
sh cuda_11.1.0_455.23.05_linux.run
按下面的方式选择:
按图中的提示方式修改~/.bashrc:
PATH 增加 /usr/local/cuda-11.1/bin
LD_LIBRARY_PATH 增加 /usr/local/cuda-11.1/lib64
再检查一下CUDA版本:
nvidia-smi
是11.1了。
下载CUDA 11.1对应的cudnn 8.0.5(其他版本也可以装,只要对应CUDA 11.1即可),并将其上传到服务器:
解压
tar -zxvf cudnn-11.1-linux-x64-v8.0.5.39.tgz
将其拷贝到cuda的相应目录下:
七、验证mindspore 1.5和MindElec的安装:
python -c "import mindspore;mindspore.run_check()"
或者 vi test.py
python test.py
验证mindelec的安装:
python -c 'import mindelec'
好像万事俱备。
那么能不能成功尝试mindelec的例子呢?
八、下载MindElec代码仓:
git clone https://gitee.com/mindspore/mindscience.git
九、安装依赖包
1、安装easydict
2、安装opencv
pip install opencv-python -i https://pypi.tuna.tsinghua.edu.cn/simple
十、验证
1、试验数据驱动的参数化电磁仿真:
https://gitee.com/mindspore/mindscience/tree/master/MindElec/examples/data_driven/parameterization
以下试验均需将相关代码中的Ascend改为GPU后再进行验证,以后不再赘述。
。。。
终于结束了:
具体结果如下:
epoch: 9966 step: 55, loss is 1.067301e-06
epoch time: 156.272 ms, per step time: 2.841 ms
epoch: 9967 step: 55, loss is 1.6718128e-06
epoch time: 161.586 ms, per step time: 2.938 ms
epoch: 9968 step: 55, loss is 1.9428162e-06
epoch time: 165.269 ms, per step time: 3.005 ms
epoch: 9969 step: 55, loss is 1.1494253e-06
epoch time: 160.396 ms, per step time: 2.916 ms
epoch: 9970 step: 55, loss is 1.2750754e-06
epoch time: 154.781 ms, per step time: 2.814 ms
epoch: 9971 step: 55, loss is 1.2550026e-06
epoch time: 160.627 ms, per step time: 2.920 ms
epoch: 9972 step: 55, loss is 1.4948789e-06
epoch time: 159.846 ms, per step time: 2.906 ms
epoch: 9973 step: 55, loss is 1.8957531e-06
epoch time: 164.061 ms, per step time: 2.983 ms
epoch: 9974 step: 55, loss is 1.8941449e-06
epoch time: 164.542 ms, per step time: 2.992 ms
epoch: 9975 step: 55, loss is 2.340197e-06
epoch time: 166.823 ms, per step time: 3.033 ms
epoch: 9976 step: 55, loss is 1.5545256e-06
epoch time: 152.811 ms, per step time: 2.778 ms
epoch: 9977 step: 55, loss is 9.994957e-07
epoch time: 171.435 ms, per step time: 3.117 ms
epoch: 9978 step: 55, loss is 2.12672e-06
epoch time: 154.989 ms, per step time: 2.818 ms
epoch: 9979 step: 55, loss is 1.5981371e-06
epoch time: 159.917 ms, per step time: 2.908 ms
epoch: 9980 step: 55, loss is 1.6546201e-06
epoch time: 151.021 ms, per step time: 2.746 ms
epoch: 9981 step: 55, loss is 1.5869264e-06
epoch time: 162.313 ms, per step time: 2.951 ms
epoch: 9982 step: 55, loss is 1.1969032e-06
epoch time: 168.984 ms, per step time: 3.072 ms
epoch: 9983 step: 55, loss is 1.1927513e-06
epoch time: 163.749 ms, per step time: 2.977 ms
epoch: 9984 step: 55, loss is 1.0608298e-06
epoch time: 160.595 ms, per step time: 2.920 ms
epoch: 9985 step: 55, loss is 1.964669e-06
epoch time: 155.398 ms, per step time: 2.825 ms
epoch: 9986 step: 55, loss is 1.5706166e-06
epoch time: 165.935 ms, per step time: 3.017 ms
epoch: 9987 step: 55, loss is 1.3382705e-06
epoch time: 163.523 ms, per step time: 2.973 ms
epoch: 9988 step: 55, loss is 1.2119517e-06
epoch time: 168.339 ms, per step time: 3.061 ms
epoch: 9989 step: 55, loss is 1.7882771e-06
epoch time: 159.096 ms, per step time: 2.893 ms
epoch: 9990 step: 55, loss is 1.1589409e-06
epoch time: 160.459 ms, per step time: 2.917 ms
epoch: 9991 step: 55, loss is 8.78855e-07
epoch time: 156.461 ms, per step time: 2.845 ms
epoch: 9992 step: 55, loss is 1.3546548e-06
epoch time: 157.824 ms, per step time: 2.870 ms
epoch: 9993 step: 55, loss is 3.1089023e-06
epoch time: 158.035 ms, per step time: 2.873 ms
epoch: 9994 step: 55, loss is 1.4939134e-06
epoch time: 160.428 ms, per step time: 2.917 ms
epoch: 9995 step: 55, loss is 2.164372e-06
epoch time: 155.159 ms, per step time: 2.821 ms
epoch: 9996 step: 55, loss is 9.635824e-07
epoch time: 156.919 ms, per step time: 2.853 ms
epoch: 9997 step: 55, loss is 1.0471658e-06
epoch time: 160.262 ms, per step time: 2.914 ms
epoch: 9998 step: 55, loss is 1.4574234e-06
epoch time: 160.660 ms, per step time: 2.921 ms
epoch: 9999 step: 55, loss is 2.0352143e-06
epoch time: 150.130 ms, per step time: 2.730 ms
epoch: 10000 step: 55, loss is 9.816508e-07
epoch time: 156.031 ms, per step time: 2.837 ms
Eval current epoch: 10000 loss: 0.0002412886533234922 l2_s11: 0.0030976369803562306
ckpt下应该是训练好的模型:
在eval_res下有49张图片:
将其下载下来可以看到:
2、试验物理驱动的AI求解频域麦克斯韦方程:
cd ~/mindscience/MindElec/examples/physics_driven/frequency_domain_maxwell
python solve.py
。。
具体结果如下:
(mindspore1.5) root@ecs-zhanghui-gpu:~/mindscience/MindElec/examples/physics_driven/frequency_domain_maxwell# python solve.py
pid: 2676
check test dataset shape: (10201, 2), (10201, 1)
[WARNING] OPTIMIZER(2676,7feb1bba3740,python):2021-11-09-00:05:06.369.176 [mindspore/ccsrc/frontend/optimizer/ad/dfunctor.cc:803] GetPrimalUser] J operation has no relevant primal call in the same graph. Func graph: 679_75_construct.92, J user: 679_75_construct.92:construct{[0]: [CNode]93, [1]: x0, [2]: u}
[WARNING] OPTIMIZER(2676,7feb1bba3740,python):2021-11-09-00:05:06.382.175 [mindspore/ccsrc/frontend/optimizer/ad/dfunctor.cc:803] GetPrimalUser] J operation has no relevant primal call in the same graph. Func graph: 622_132_construct.94, J user: 622_132_construct.94:construct{[0]: [CNode]95, [1]: x0, [2]: u}
[WARNING] OPTIMIZER(2676,7feb1bba3740,python):2021-11-09-00:05:06.595.722 [mindspore/ccsrc/frontend/optimizer/ad/dfunctor.cc:803] GetPrimalUser] J operation has no relevant primal call in the same graph. Func graph: 894_465_7_construct.116, J user: 894_465_7_construct.116:construct{[0]: [CNode]117, [1]: [CNode]118, [2]: [CNode]119}
[WARNING] OPTIMIZER(2676,7feb1bba3740,python):2021-11-09-00:05:06.614.336 [mindspore/ccsrc/frontend/optimizer/ad/dfunctor.cc:803] GetPrimalUser] J operation has no relevant primal call in the same graph. Func graph: 894_465_7_construct.116, J user: 894_465_7_construct.116:construct{[0]: [CNode]120, [1]: [CNode]118, [2]: [CNode]121}
[WARNING] CORE(2676,7feb1bba3740,python):2021-11-09-00:05:07.738.476 [mindspore/core/ir/anf_extends.cc:65] fullname_with_scope] Input 0 of cnode is not a value node, its type is CNode.
epoch: 1 step: 78, loss is 600.0
epoch time: 11268.853 ms, per step time: 144.472 ms
epoch: 2 step: 78, loss is 225.4
epoch time: 1389.687 ms, per step time: 17.816 ms
epoch: 3 step: 78, loss is 199.9
================================Start Evaluation================================
Total prediction time: 0.19255661964416504 s
l2_error: 0.20626515080160301
=================================End Evaluation=================================
epoch time: 1610.614 ms, per step time: 20.649 ms
epoch: 4 step: 78, loss is 10.19
epoch time: 1730.271 ms, per step time: 22.183 ms
epoch: 5 step: 78, loss is 2.803
epoch time: 1429.185 ms, per step time: 18.323 ms
epoch: 6 step: 78, loss is 2.316
================================Start Evaluation================================
Total prediction time: 0.0025403499603271484 s
l2_error: 0.019291123630052236
=================================End Evaluation=================================
epoch time: 1420.687 ms, per step time: 18.214 ms
epoch: 7 step: 78, loss is 2.2
epoch time: 1844.602 ms, per step time: 23.649 ms
epoch: 8 step: 78, loss is 1.953
epoch time: 1408.553 ms, per step time: 18.058 ms
epoch: 9 step: 78, loss is 1.856
================================Start Evaluation================================
Total prediction time: 0.0025916099548339844 s
l2_error: 0.015916268073532643
=================================End Evaluation=================================
epoch time: 1404.208 ms, per step time: 18.003 ms
epoch: 10 step: 78, loss is 1.33
epoch time: 1459.013 ms, per step time: 18.705 ms
l2 error: 0.0159162681
per step time: 18.7052916258
3、试验物理驱动的AI求解点源麦克斯韦方程组
cd ~/mindscience/MindElec/examples/physics_driven/incremental_learning
修改为GPU之后执行:
python piad.py --mode=pretrain
。。。
耐心等待:
突然发现pretrain的epoch是3000:
由于张小白囊中羞涩,所以果然暂停了训练:
但是估计mindspore团队是经过估算的,只有跑3000个epoch才能把loss降到0.1以下吧。。。现在loss虽然在收敛,但是还是蛮高的。
4、试验物理驱动的AI求解点源麦克斯韦方程组
cd ~/mindscience/MindElec/examples/physics_driven/time_domain_maxwell
改下GPU。
基于上个试验的教训,果然的修改配置,减少下epoch:
将epoch从6000降到100。
开始训练:
100个还是蛮快的。
同样的,虽然减少了epoch,但是loss确实在收敛之中,想必修炼6000次之后确实会成为六神装。
但是,张小白不能用自己的血汗钱去试,所以,这个时候关机走人是最好的解脱了。
这样子,基本上就完成了MindScience的MindElec作业。
(全文完,谢谢阅读)
- 点赞
- 收藏
- 关注作者
评论(0)