Analyzing Transformers and Implementing Multi-Head Attention in PyTorch for Machine Translation

Summary

In a group of two students, we analyzed a baseline machine translation model for German to English. We both analyzed the code and evaluated the performance using the BLEU score and other metrics.
Then, in order to improve the performance, we implemented the lexical attention model as described by Nguyen and Chiang (2017).
Furthermore, we analyzed a basic implementation of the Transformer architecture and implemented the multi-head attention mechanism according to Vaswani et al. (2017). We worked with PyTorch.
We also spend a good amount of time analyzing and optimizing the training data.

I learned to transform machine-learning papers into a working implementations. This requires understanding the papers in detail and transforming formulas and diagrams into code.
In addition, I learned to explore and extend an existing codebase.
Furthermore, I gained a quite deep understanding of transformers, multi-head attention. Additionally, I learned about related topics such as beam search.