Visual piano transcription (VPT) focuses on extracting a symbolic representation of a piano performance from visual information only (e.g., from a top-down video of the piano keyboard). In this work, we propose a Vision Transformer (ViT) based system for VPT which surpasses previous visual methods based on convolutional neural networks (CNNs). Our system is trained on a new dataset, R3, which contains 31 hours of synchronized video, MIDI, and audio recordings of piano performances. For this task we additionally introduce an approach to predict note offsets, which has not been previously explored in this context, and we apply it to our method as well as to the CNN based methods. We show that our system outperforms the state-of-the-art on the PianoYT dataset and that training on our dataset, combined with the proposed offset prediction method, improves the performance of both our ViT based system and the CNN based methods.
As supplamentary materials we provide a Google Colab showcasing the inference procedure of our model, a YouTube playlist with example predictions accross all datasets, and our codebase.