딥러닝에서 메모리를 효율적으로 관리하기 위한 기법들

2020. 2. 13. 02:06Research

1. Activation Compression

아주 적은 정도의 accuracy만을 희생하여 compressed representation을 만들어 사용하는 방법이다.

 

"Faster Neural Networks Straight from JPEG"에서는 일부 디코딩된 JPEG 이미지에서 뽑아낸 discrete cosine transform code를 통해서 ImageNet 이미지를 분류하는 DNN 알고리즘을 고안하였다. 이를 통해서 inferencing 속도 향상과 memory 점유를 낮출 수 있었다.

Faster Neural Networks Straight from JPEG

The simple, elegant approach of training convolutional neural networks (CNNs) directly from RGB pixels has enjoyed overwhelming empirical success. But could more performance be squeezed out of networks by using different input representations? In this paper we propose and explore a simple idea: train CNNs directly on the blockwise discrete cosine transform (DCT) coefficients computed and available in the middle of the JPEG codec. Intuitively, when processing JPEG images using CNNs, it seems unnecessary to decompress a blockwise frequency representation to an expanded pixel representation, shuffle it from CPU to GPU, and then process it with a CNN that will learn something similar to a transform back to frequency representation in its first layers. Why not skip both steps and feed the frequency domain into the network directly? In this paper, we modify libjpeg to produce DCT coefficients directly, modify a ResNet-50 network to accommodate the differently sized and strided input, and evaluate performance on ImageNet. We find networks that are both faster and more accurate, as well as networks with about the same accuracy but 1.77x faster than ResNet-50.

 

 

 

"Gist: Efficient Data Encoding for Deep Neural Network Training"에서는 뉴럴넷 학습시에 feature map이나 activation의 메모리 점유율이 매우 크다는 점을 해결하기 위해서 activation의 (불필요하게 긴) precision을 줄임으로 써 메모리 사용량을 절반으로 줄였다.

Modern deep neural networks (DNNs) training typically relies on GPUs to train complex hundred-layer deep networks. A significant problem facing both researchers and industry practitioners is that, as the networks get deeper, the available GPU main memory becomes a primary bottleneck, limiting the size of networks it can train. In this paper, we investigate widely used DNNs and find that the major contributors to memory footprint are intermediate layer outputs (feature maps). We then introduce a framework for DNN-layer-specific optimizations (e.g., convolution, ReLU, pool) that significantly reduce this source of main memory pressure on GPUs. We find that a feature map typically has two uses that are spread far apart temporally. Our key approach is to store an encoded representation of feature maps for this temporal gap and decode this data for use in the backward pass; the full-fidelity feature maps are used in the forward pass and relinquished immediately. Based on this approach, we present Gist, our system that employs two classes of layer-specific encoding schemes – lossless and lossy – to exploit existing value redundancy in DNN training to significantly reduce the memory consumption of targeted feature maps. For example, one insight is by taking advantage of the computational nature of back propagation from pool to ReLU layer, we can store the intermediate feature map using just 1 bit instead of 32 bits per value. We deploy these mechanisms in a state-of-the-art DNN framework (CNTK) and observe that Gist reduces the memory footprint to upto 2× across 5 state-of-the-art image classification DNNs, with an average of 1.8× with only 4% performance overhead. We also show that further software (e.g., CuDNN) and hardware (e.g., dynamic allocation) optimizations can result in even larger footprint reduction (upto 4.1×).

 

 

Compression 기법들은 어느 정도의 압축률까지는(보통은 constant factor 정도...) accuracy loss가 없지만 특정선을 넘어가면 accuracy가 크게 감소하는 단점이 있다.

 

 

2. Checkpointing and Rematerialization

Rematerialization이란 이후 재사용될 가능성이 있는 값을 메모리에 store - load 하는 대신, 값을 일단 버리고 나중에 다시 필요할 때 재계산 하는 것을 뜻한다. 다시 사용할 값이라면 저장을 해두었다가 나중에 다시 불러와 사용을 하여 추가적인 CPU 연산을 하지 않도록 하는 것이 직관적이지만, 어떤 경우엔 필요한 값을 나중에 다시 계산하는 것이 효율적일 때가 있다. (예컨대 Compiler가 머신코드를 생성할 때 register spilling이 있는 상황에서는 값을 다시 계산해서 사용하는 것이 더 효율적일 수 있다.)

 

[NOTE] Register Spilling: 컴파일러가 머신코드를 생성할 때 머신이 가지고 있는 레지스터의 수 보다 더 많은 live variable들이 있는 경우, 어쩔 수 없이 몇몇 값들은 register에서 memory로 store 했다가 나중에 값이 필요할 때 다시 load 하여 사용해야한다. 마치 꽉찬 컵에 물을 더 붓는 듯한 이런 경우를 일컬어 register spill 이라고 한다.

 

컴파일러의 경우 register resident dependency가 있는 몇몇 예외적인 값들에 한해서만 rematerialization이 효율적이지만, Checkmate와 같은 딥러닝 메모리 최적화 상황에서는 전체 operation subgraph를 재계산하여 사용할 수 있다. GPU memory에서 RAM으로 spilling이 발생하는 cost가 매우 크기 때문이다.

 

 

 

3. Reversible Networks

DNN 학습시에 intermediate value들은 이후의 forward propagation 계산 과정에서의 값들로 부터 재계산 가능하다. 마치 gradient checkpointing과 같이, forward pass의 activation 값들은 메모리에 저장하지 않더라도 backward pass에서 재계산될 수 있다. "In-Place Activated BatchNorm for Memory-Optimized Training of DNNs"에서는 ReLU와 batch norm 레이어를 가역적인(invertible) 연산으로 대치함으로써 backward pass 동안 중간 값들을 복원할 수 있도록 하여 메모리 사용량을 50% 감소시켰다.

 

Reversibility를 활용한 메모리 최적화는 아직 널리 쓰이지는 않는다고 한다.

 

In this work we present In-Place Activated Batch Normalization (INPLACE-ABN) – a novel approach to drastically reduce the training memory footprint of modern deep neural networks in a computationally efficient way. Our solution substitutes the conventionally used succession of BatchNorm + Activation layers with a single plugin layer, hence avoiding invasive framework surgery while providing straightforward applicability for existing deep learning frameworks. We obtain memory savings of up to 50% by dropping intermediate results and by recovering required information during the backward pass through the inversion of stored forward results, with only minor increase (0.8- 2%) in computation time. Also, we demonstrate how frequently used checkpointing approaches can be made computationally as efficient as INPLACE-ABN. In our experiments on image classification, we demonstrate on-par results on ImageNet-1k with state-of-the-art approaches. On the memory-demanding task of semantic segmentation, we report results for COCO-Stuff, Cityscapes and Mapillary Vistas, obtaining new state-of-the-art results on the latter without additional training data but in a single-scale and -model scenario

 

4. Distributed Computation

model parallelism은 메모리 한계를 극복할 수 있는 좋은 방법이지만 추가적인 엔지니어링이 필요하고, 추가적인 가속기 머신, 빠른 네트워크 환경 등이 요구되기 때문에 적용이 상당히 까다롭다.

 

Gradient accumulation은 GPU memory의 한계를 해결하기 위한 또 하나의 방법이다.

 

Accumulating gradients just means that, before calling optimizer.step() to perform a step of gradient descent, we will sum the gradients of several backward operations in the 'parameter.grad' tensors. This is straightforward to do in PyTorch as the gradient tensors are not reset unless we call 'model.zero_grad()' or 'optimizer.zero_grad()'. We’ll also need to divide by the number of accumulation steps if our loss is averaged over the training samples.

 

Gradient Accumulation

 

model.zero_grad()                                   # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     # Forward pass
    loss = loss_function(predictions, labels)       # Compute loss function
    loss = loss / accumulation_steps                # Normalize our loss (if averaged)
    loss.backward()                                 # Backward pass
    if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
        optimizer.step()                            # Now we can do an optimizer step
        model.zero_grad()                           # Reset gradients tensors
        if (i+1) % evaluation_steps == 0:           # Evaluate the model when we...
            evaluate_model()                        # ...have no gradients accumulated