Can interleaved cross-attention learn image-text correlations better than CLIP? | Heykuki News