Scraping the web with Node.js
So what do we mean by scraping the web? Scraping consists of viewing a web page and pulling selected information from it. I am fairly new to Node and have a slight fascination with web scraping so I set out to attempt to find information for creating a simple web scraper. A short search brought back cheerio. Cheerio is billed as “Fast, flexible, and lean implementation of core jQuery designed specifically for the server.” It can be used to load and parse html. First things first you must install Cheerio. Open your command prompt and type “npm install cheerio”. This should install the library.
But how do we get content for cheerio to process? The request library to the rescue. Make sure you have the request library by typing “npm install request”. This will install the request library if you do not already have it.
var request = require("request");
var cheerio = require("cheerio");
This instantiates the libraries and assigns to the respective variables.
Now to retrieve the page. Retrieving a webpage with node is super simple:
request({
uri: http://www.google.com,
}, function(error, response, body) {
console.log(body);});
The code above contacts the website located in “uri” and then loads the contents in the body variable. Then the body is output to the log.
Now the fun begins. Using cheerio we can start to do really cool things with the data that is returned.
For example, lets say we wanted to return all the links in a page:
request({
uri: http://www.google.com,
}, function(error, response, body) {
$("a").each(function() {
var link = $(this);
var href = link.attr(“href”);
console.log(link.text() + “ – “ href);
}
});
This will spit out all the link in the supplied uri.
Cheerio uses most of the same syntax as JQuery so it is fairly easy to pick up. As you can see this makes it extremely simple to grab data from a web page using Node.
Next up Saving data to RavenDB using node!
-codedragon