Understanding Pooling in Transformer Architecture, Aggregating Outputs for Downstream Tasks


In the context of transformers, pooling refers to the process of summarizing the outputs of the transformer layers into a fixed-size vector, often used for downstream tasks such as classification.

In a transformer architecture, the input sequence is processed by a series of self-attention and feedforward layers. Each layer produces a sequence of output vectors, which encode the input sequence in a higher-level representation. Pooling involves taking the output vectors from one or more of these layers and aggregating them into a single vector.

There are different types of pooling mechanisms used in transformer architectures, including:

  1. Max Pooling: where the maximum value across the sequence of output vectors is selected as the summary representation.

  2. Mean Pooling: where the average of the output vectors is taken as the summary representation.

  3. Last Hidden State: where the final output vector of the transformer is used as the summary representation.

  4. Self-Attention Pooling: where a weighted sum of the output vectors is computed, with the weights determined by a learned attention mechanism.

Overall, pooling is an important component of transformer architectures, as it allows for the extraction of a fixed-size representation of the input sequence, which can be used for a variety of downstream tasks.


Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC