MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks | Heykuki News