[AI Boostcamp Day 14] All about the Training Process

What I studied today:

Multi-GPU Training
- Huge Data -> need of powerful hardware
- GPU vs Node
- GPU is a card
- Node is a computer
- Multi Node Multi GPU: collection of fucking awesome computing machines - Parallel
- Model Parallel
  - Not Popular. Hard
- Data Parallel
  - e.g. independently computing the loss and averaging it to compute the gradient
  - simple data parallel is very easy
    # just have to add a single code! model = torch.nn.DataParallel(Model)
  - Disbtributed data parallel is a bit more complicated (needs multiprocessing)
Hyperparameter Tuning
- finding the learning rate, epoch etc.
- Recently small room for improvement with hyperparameter tuning
- Grid vs Random
- grid: via designated space
- random: random literally - Ray
- Standard multi-node multi processing module for ML and DL
- Numerous module for hyperparameter search
- the whole training process should be in a single function to use ray
PyTorch Trobuleshooting
- OOM (Out of Memory)
- Hard to know why and where it occurred
- How to solve? smaller batch size! - Dealing with the problem in GPU
- checking if the RAM is leaking; check if the RAM usage is increasing at each iteration
```
 from GPUtil import showUtilization
 showUtilization()
```
- deleting the tensors and emptying the cache to secure needed RAM
```
 del tensorList
 torch.cuda.empty_cache()
```
- tensor variables accumulate.
- change to python basic variable to prevent the accumulation
- variables inside the loop still exist after the loop
- torch.no_grad()
  - during no_grad status, no backward pass -> less use of memory
    - Most of the errors from CNN are from the mistach of size
    - can change the precision of tensor to 16bit

Share on

Twitter Facebook LinkedIn

[AI Boostcamp Day 14] All about the Training Process

What I studied today:

Share on

You may also enjoy

Source of Bias in MRC

Advanced Topis in MRC

Elastic Search Summary

부스트캠프 11주차 주간학습 정리