Ask HN: Databases, Scraping, and Copyright Question

8 points

16 years ago

Hope you guys can settle an argument I am having with another HNer. I'm going to make up a hypothetical question because I don't want to discuss the real one in detail.

Let's say I have a site that is all about politics and stocks. I list the stock and the board of directors, and site members can submit their political opinions about how they think the corporate officers are acting. In addition, I write a program to go out and find the FaceBook, Twitter, etc accounts for these officers and I list them with the stocks.

Not wanting to do a lot of data entry, I find a database on an ftp site somewhere that has stocks and the corporate officers that are related to them. It is part of some huge system dedicated to do something completely different, like a brokerage system or a system for investing. Although they provide the data openly, on each dump there is a notice about how you can't use the information for commercial purposes (but you are free to use for personal purposes and/or download, copy, and give to friends) Their price is $50K. For that amount of money, you might be able to use part of the data, depending on what your actual needs are (translated: we want to know how much we can get out of you, so we're not committing to anything)

Do I have to physically type in each stock and the directors? And then update it all the time? When there are multiple sources of publicly-available information that gives me the same thing? That sounds crazy.

I was speaking with another HNer just now. He said if I downloaded the database tables I needed directly that was bad. But if I crawled a site created from the database from the static information located on each stock's page, then that was different. I say that matters of fact -- which people are officers for which company -- cannot be copyrighted. It would be different if I were pulling down big hunks of the database and reusing/re-purposing them, but I'm only asking a simple question about information that is very widely distributed.

So who is more right here? If you were trying out an idea for a website, what would you do?

9 comments