GlobEnc/gen
Table of Contents
1. @citations/0 @2022/May GlobEnc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers
1.2. @read [1401/07/29]
1.3. @abstract
There has been a growing interest in interpreting the underlying dynamics of Transformers. While self-attention patterns were initially deemed as the primary option, recent studies have shown that integrating other components can yield more accurate explanations. This paper introduces a novel token attribution analysis method that incorporates all the components in the encoder block and aggregates this throughout layers. Through extensive quantitative and qualitative experiments, we demonstrate that our method can produce faithful and meaningful global token attributions. Our experiments reveal that incorporating almost every encoder component results in increasingly more accurate analysis in both local (single layer) and global (the whole model) settings. Our global attribution analysis significantly outperforms previous methods on various tasks regarding correlation with gradient-based saliency scores. Our code is freely available at https://github.com/mohsenfayyaz/GlobEnc.
@inproceedings{Modarressi2022GlobEncQG, title={GlobEnc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers}, author={A. Modarressi and Mohsen Fayyaz and Yadollah Yaghoobzadeh and Mohammad Taher Pilehvar}, booktitle={NAACL}, year={2022} }
1.4. @ideas
In future work, we plan to apply our global analysis method on various datasets and models, to provide valuable insights into model decisions and interpretability.
- @idea/small We can also add tests with a fixed residual mixing ratio bigger than 0.5. (E.g., 0.6, 0.7, 0.8, 0.9, 0.95, and 0.99 can all be tried as well.) (I think in the paper’s notation, it is actually a ratio smaller than 0.5, as the paper uses the ratio to mix the context, not the residual.)
- @idea/accepted We can try extending the work to other transformer networks (encoder-decoder, decoder-only) and other taks. (The classification task being the only task currently studied.)
1.5. @highlights
- Norm-based attention
While one may inter- pret the attention mechanism using the attention weights A, Kobayashi et al. (2020) argued that do- ing so would ignore the norm of the transformed vectors multiplied by the weights, elucidating that the weights are insufficient for interpretation.
As a small vector will have little impact even if it has a large weight.
- QUESTION A central assumption of this whole Kobayashi line of work is that a small vector norm means little information. Is this warranted? Aren’t the positional embeddings, for example, small in magnitude? Ultimately, a lower bit of a weight can carry as much information as a higher bit, no?
- The magnitude of the input vector is not the only other factor that matters; the magnitude of the liner projections that convert it to a value vector and finally mix the concatenated outputs of all the heads, matter, as well.
- QUESTION A central assumption of this whole Kobayashi line of work is that a small vector norm means little information. Is this warranted? Aren’t the positional embeddings, for example, small in magnitude? Ultimately, a lower bit of a weight can carry as much information as a higher bit, no?
- By reformulating Equation 1, we can consider zi as a summation over the attentions heads:
- the value vectors \(v(x_j)\)
- the projection \(W_O\) with the shape
(head_count*head_dim, hidden_dim)
- CONFIRM One might wonder whether \(W_O\) is superfluous as there is no nonlinearity between \(v^h\) and \(W_O\). I think the key point is that these two matrices are a decomposition of a bigger matrix B:
\(v^h_{hidden \times head\_dim} \times W^h_{O,\ head\_dim \times hidden} = B_{hidden \times hidden}\)
- _
- the value vectors \(v(x_j)\)
1.6. @questions
- @togrok
- QUESTION Does “computational intensity” mean computational cost?
Additionally, gradient-based alternatives (Si- monyan et al., 2014; Kindermans et al., 2016; Li et al., 2016) have been argued to provide a more ro- bust basis for token attribution analysis (Atanasova et al., 2020; Brunner et al., 2020; Pascual et al., 2021). Nonetheless, the gradient-based alternatives have not been able to fully replace attention-based counterparts, mainly due to their high computational intensity.
- CONFIRM Does \(1[i=j]\) mean that if \(i=j\), output one else output zero?
- CONFIRM So the tenth equation is bogus, right? It shows what we desired, but the equality is actually not true.
- QUESTION Shouldn’t the attribution matrix \(\mathcal{N}\) be normalized along each of its rows?
- QUESTION This is especially concering when we multiply the attribution matrices of the consequtive layers. Without the normalization, this operation doesn’t make sense IMO.
See this example. \(input_{1}\) and \(input_{3}\) contribute equally IMO, but \(input_{1}\) is 10 and \(input_{3}\) is 1!
Of course, this particular example might be impossible due to the layer norms, but the general principle stands.
\begin{aligned} \mathcal{N}_1 &= \begin{bmatrix} 10 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \\ \end{bmatrix} \\ \mathcal{N}_2 &= \begin{bmatrix} 1 & 0 & 1 \\ \hdotsfor{3} \\ \hdotsfor{3} \\ \end{bmatrix} \\ \mathcal{N}_2 \times \mathcal{N}_1 &= \begin{bmatrix} 1 & 0 & 1 \\ \hdotsfor{3} \\ \hdotsfor{3} \\ \end{bmatrix} \times \begin{bmatrix} 10 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \\ \end{bmatrix} = \begin{bmatrix} 10 & 0 & 1 \\ \hdotsfor{3} \\ \hdotsfor{3} \\ \end{bmatrix} \end{aligned} - QUESTION The diagrams are also unclear. Is each row supposed to sum to one?!
- QUESTION This is especially concering when we multiply the attribution matrices of the consequtive layers. Without the normalization, this operation doesn’t make sense IMO.
- _
Figure 3: Spearman’s rank correlation of aggregated at- tribution scores with saliency scores across layers. The 99% confidence intervals are shown as (narrow) shaded areas around each line.
- QUESTION Why are some of the coefficients negative? Does this mean that more raw attention paid was actually negatively correlated with impact on the final output?!
- QUESTION We are using the saliency scores just as a way to check our work, correct? They are not the actual ground truth, are they?
- If they are, why are bothering with calculating GlobEnc in the first place?
- @citations/76 @2020/October The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?
- @DrPilehvar
- The gradient based hidden token attributions (HTAs) are very expensive to compute. But GlobEnc is cheap, and can show us the attribution across all the layers. Normal gradient w.r.t the output only give us the final layer, and not the intermediate ones.
- Gradient-based methods need us to focus on the effect of the inputs on a single continuous output. So the gradient-based methods are good when we have a classifier.
- GlobEnc, due to its limitations and skipping the feedforward networks, can help us isolate the effects of the different parts of the model. E.g., our experiments show that GlobEnc becomes mostly fixed after some initial finetuning (around one epoch if the dataset is not too big). This means the later epochs are mostly adjusting these feedforward networks.
- The gradient based hidden token attributions (HTAs) are very expensive to compute. But GlobEnc is cheap, and can show us the attribution across all the layers. Normal gradient w.r.t the output only give us the final layer, and not the intermediate ones.
- @citations/76 @2020/October The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?
- If they are, why are bothering with calculating GlobEnc in the first place?
- QUESTION Why are some of the coefficients negative? Does this mean that more raw attention paid was actually negatively correlated with impact on the final output?!
- QUESTION I did not understand what use these HTAs were, and how exactly GlobEnc was “better” than them.
Using a newly proposed and improved version of Hidden Token Attribution, we demonstrated that encoder-based attribution analy- sis is more accurate when compared to other partial solutions in a single layer (local-level). This is con- sistent with our global observations.
- QUESTION Which weights are these?
They can’t be the bias weights (not incorporated in GlobEnc). So they are the elementwise standard deviations of the result?