[ICLR 2019] ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware

2020. 7. 28. 02:41카테고리 없음

  • Directly applying NAS to a large scale task (e.g. ImageNet) is computationally expensive or impossible.
  • To solve this problem, Some works proposed to search for building blocks on proxy tasks, such as training for fewer epochs, starting with a smaller dataset (e.g. CIFAR-10), or learning with fewer blocks. $\to$ Cannot guarantee to be optimal on the target task.
  • ProxylessNAS directly learns the architectures on the target task and HW without proxy.

 

 

Method

1. Construction of Over-Parameterized Network

Notation

  • $N(e, ..., e_n)$: Neural networks composed with $e_i$ which represents a certain edge in the DAG.
  • $O = \{o_i\}$: the set of $N$ candidate primitive operations (e.g. conv, pooling, ...)

 

  • To construct the over-parameterized network that includes any architecture in the search space, each edge is set to be a mixed operation ($m_O$) with $N$ parallel paths as shown in Figure 2.
  • Given input $x$, the output of a mixed operation $m_O$ is defined based on the outputs of its $N$ paths.
  • By the following equation, the output feature maps of all $N$ paths are calculated and stored in the memory, while training a compact model only involves one path.

  • Due to the existence of $N$ different paths, it requires $N$ times memory compared to a compact model $to$ exceeds memory limits.
  • The problem can be solved with the binarized path.

 

2. Learning Binarized Path

  • To save memory, only one path is kept when training the over-parameterized network.
  • For each $N$ different path, only one path of activation is active in memory at run-time by using binary gates $g$.

  • The output of the mixed operation is given as:

 

  • When training weights parameters, the architecture parameters are frozen, and binary gates are stochastically sampled according to the above equation for each input batch.
  • Then the weight parameters of active paths are updated via standard gradient descent.
  • When training architecture parameters, the weight parameters are frozen, then binary gates are reset, and architecture parameters are updated based on the validation set.
  • These two steps are performed alternatively.
  • Once the training of architecture parameters is finished, the path with the highest path weight is kept and the other redundant paths are pruned.