Best JavaScript scraping infrastructure

10 points

11 years ago

Over the years, I've had many projects that involve Web scraping with public health or academic research imperatives. These projects went from text to element retrieval to entity recognition and annotations, so I've learned a lot, but am frustrated since I've never found a way to carry work forward. My most recent attempt became SenseBase[1], which worked well for the most part with some leading features, but I want to work with something easier to support.

I'm facing a couple more scraping projects, which are not my full-time focus, so I'd like to find some loosely coupled yet cohesive components which will work for years.

I've settled on Javascript as a language and ecosystem that is likely to be around for some time, and practical to use for Web scraping, particularly since it can be embedded in pages (and speaks the page language).

I'm strongly committed to open source and open data (I'd like to see all sites recognize the value of making their sites into data).

Here's what I'd identify as components:

1. Shared schemas and functions to identify sites, their sections, and content elements.

2. Identify when a site has an API that relates to a page, to retrieve more precise data.

3. A robust and easy to maintain store for original text and annotations.

4. A programming language neutral way to interact with 3.

I'm looking for something as lightweight as possible. A front end is separate, the goal is schemas and a store that can be developed and interacted with over years.

Semantic Web solutions would handle 3, but add so much weight the project becomes diverted. In the past I've used ElasticSearch, which worked fine, but I'm considering LevelDB as projects built on it seem to support useful text and annotation capabilities. Using the Open Annotation spec seems reasonable with an appropriate library.

I know this is a tough general problem. Thanks for any comments.

1. https://github.com/vid/SenseBase