今日推荐英文原文：《How to PyTorch in Production》
推荐理由：相信大家都有看过前端或者后台的学习路线图之类的 Roadmap 了，这次要介绍的是一个关于 Node.js 的路线图，推荐给那些想要学习 Node.js 的朋友或者不知道自己下一步应该学习什么的朋友，其中包括了 web 框架，实时通信，数据库和设计模式等几个大方面。不过别忘了，路线图只是一个指引的作用，自己学会了解什么时候什么工具最适合才是最重要的。
今日推荐英文原文：《How to PyTorch in Production》作者：Taras Matsyk
推荐理由：如何避免使用 PyTorch 的时候犯下一些常见错误
How to PyTorch in ProductionML is fun, ML is popular, ML is everywhere. Most of the companies use either TensorFlow or PyTorch. There are some oldfags who prefer caffe, for instance. Mostly it’s all about Google vs Facebook battle.
Most of my experience goes to PyTorch, even though most of the tutorials and online tutorials use TensofFlow (or hopefully bare numpy). Currently, at Lalafo (AI-driven classified), we are having fun with PyTorch. No, really, we tried caffe, it is awesome unless you have not spent a few days on installing it already. Even better, PyTorch is 1.0 now, we were using it from 0.3 and it was dead simple and robust. Ahhh.. maybe a few tweaks here, a few tweaks there. Most of the issues were easy to fix and did not cause any problems for us. Walk in the park, really.
Here, I want to share the most common 5 mistakes for using PyTorch in production. Thinking about using a CPU? Multithreading? Using more GPU memory? We’ve gone through it. Now let me guide you too.
Mistake #1 — Storing dynamic graph during in the inference modeIf you have used TensorFlow back in the days, you are probably aware of the key difference between TF and PT — static and dynamic graphs. It was extremely hard to debug TFlow due to rebuilding graph every time your model has changed. It took time, efforts and your hope away too. Of course, TensorFlow is better now.
Overall, to make debugging easier ML frameworks use dynamic graphs which are related to so-called Variables in PyTorch. Every variable you use links to the previous variable building relationships for back-propagation.
Here is how it looks in practice:
In most of cases, you want to optimize all computations after the model has been trained. If you look at the torch interface, there are a lot of options, especially in optimization. A lot of confusion caused by eval mode, detach and no_grad methods. Let me clarify how they work. After the model is trained and deployed here are things you care about: Speed, Speed and CUDA Out of Memory exception
To speed up pytorch model you need to switch it into eval mode. It notifies all layers to use batchnorm and dropout layers in inference mode (simply saying deactivation dropouts). Now, there is a detach method which cuts variable from its computational graph. It’s useful when you are building model from scratch but not very when you want to reuse State of Art mdoel. A more global solution would be to wrap forward in torch.no_grad context which reduces memory consumptions by not storing graph links in results. It saves memory, simplifies computations thus - you get more speed and less memory used. Bingo!
Mistake #2 — Not enabling cudnn optimization algorithmsThere is a lot of boolean flags you can set in nn.Module, the one you must be aware of stored in cudnn namespace. To enable cudnn optimization use cudnn.benchmark = True. To make sure cudnn does look for optimal algorithms, enable it by setting cudnn.enabled = True. NVIDIA does a lot of magic for you in terms of optimization which you could benefit from.
Please be aware your data must be on GPU and model input size should not vary. The more variety in shape of data — the fewer optimizations can be done. To normalize data you can pre-process images, for instance. Overall, be creative, but not too much.
Mistake #3 — Re-using JIT-compilationPyTorch provides an easy way to optimize and reuse your models from different languages (read Python-To-Cpp). You might be more creative and inject your model in other languages if you are brave enough (I am not, CUDA: Out of memory is my motto)
JIT-compilation allows optimizing computational graph if input does not change in shape. What it means is if your data does not vary too much (see Mistake #2) JIT is a way to go. To be honest, it did not make a huge difference comparing to no_grad and cudnn mentioned above, but it might. This is only the first version and has huge potential.
Please be aware that it does not work if you have conditions in your model which is a common case in RNNs.
Full documentation can be found on pytorch.org/docs/stable/jit
Mistake #4 — Trying to scale using CPU instancesGPUs are expensive, as VMs in the cloud. Even if you check AWS one instance will cost you around 100$/day (the lowest price is 0.7$/h) Reference: aws.amazon.com/ec2/pricing/on-demand/. Another useful cheatsheet I use is www.ec2instances.info Every person who graduated from 3d grade could think: “Ok, what if I buy 5 CPU instances instead of 1 GPU”. Everyone who has tried to run NN model on CPU knows this is a dead end. Yes, you could optimize a model for CPU, however in the end it still will be slower than a GPU one. I strongly recommend to relax and forget about this idea, trust me.
Mistake #5 — Processing vectors instead of matrices
- cudnn - check
- no_grad - check
- GPU with correct version of CUDA - check
- JIT-compilation - check
Now it’s time to use a bit of math. If you remember how most of NN are trained using so-called Tensor(s). Tensor is an N-dimensional array or multi-linear geometric vectors mathematically speaking. What you could do is to group inputs (if you have a luxury to) into tensors or matrix and feed it into your model. For instance, using an array of images as a matrix sent to PyTorch. Performance gain equals to the number of objects passed simultaneously.
This is an obvious solution but few people actually using it as most of the time objects are processed one by one and it might be a bit hard to set up such flow architecturally. Do not worry, you’ll make it!
What’s next?There are definitely more tips on how to optimize models in PyTorch. I will continue posting on our experience using Facebook kid in the wild. What about you, what are your tips to achieve better performance on inference?