Details: I am researching several computer programs, and would like a random sample of ~100 real-world files made by users of each application. For example, I'd like to get a bunch of .pub (Publisher) files, .vpp (Visual Paradigm) files, .eap (Enterprise Architect) and .vsd (Visio) files.
Google and Bing index a few popular binary file formats, so I can do searches like this:
mySearchTerm filetype:pdf
mySearchTerm filetype:doc
mySearchTerm filetype:pptx
However, they don't support these lesser known binary formats in the same way. As a result, searching for these other file extensions yields many fewer results; I suspect there are way more files with that extension on the internet, but they are just not turning up on my search. Maybe files with a proprietary MIME type are ignored by the search engines.
Also, many of the results are not in the file format of interest; they are just regular web pages that happen to have that string at the end of the URL. For example, searching for filetype:eap returns www.facebook.com/rachel.eap, which is not what I am looking for in this context.
My next idea was to search for a string that occurs in the binary file if I open it in a text editor. For example, Visio *.vsd files tend to have the string _VPID_ALTERNATENAMES buried in the binary. But that approach didn't work very well.
Any ideas on how I can achieve the goal?