I do have a decent amount of experience with ML and have worked with an image to text generator before on the flowers and birds dataset.
However this is a very different problem compared to that. For some context, new yorker style cartoons have a scene and a line below the scene which acts as a punchline. The actual scene would represent an another description.
For example, the image here is basically of a deserted island but the text below the scene talks about searching for someone. https://www.instagram.com/p/CeZSmJ6hr8z/?utm_source=ig_web_copy_link
I believe it would be better if I just focused on the visual description of the image than the scene based description. But I would love some inputs on that
Also would it be possible for me to use a larger text to image model and train it on a few images to get this style of output? Any resources on that would be appreciated!