Text this: Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning