Apache Nutch

From TYO Lab Wiki
Jump to: navigation, search

For the beginner users, it is recommended to use Nutch 2.1 with Mysql support, it is easy to play with. When you have enough experiences and Mysql is not up for the needs, then try Nutch 2.3

2.3

For the basic data storage like using HBase which requires ZooKeeper, it is not easy to get Nutch running without too much troubles.

2.1

It has Mysql support, easy to use, easy to play with. However, you have to create the database and crawlId table(s) before using it. The default database will be nutch, and table is webpage.

Nutch Database

CREATE DATABASE nutch;

Crawler Table

  • Create the default table.
CREATE TABLE `webpage` (
  `id` VARCHAR(767) CHARACTER SET latin1 NOT NULL,
  `headers` BLOB,
  `text` text,
  `status` INT(11) DEFAULT NULL,
  `markers` BLOB,
  `parseStatus` BLOB,
  `modifiedTime` BIGINT(20) DEFAULT NULL,
  `score` FLOAT DEFAULT NULL,
  `typ` VARCHAR(32) CHARACTER SET latin1 DEFAULT NULL,
  `baseUrl` VARCHAR(512) CHARACTER SET latin1 DEFAULT NULL,
  `content` mediumblob,
  `title` VARCHAR(2048) DEFAULT NULL,
  `reprUrl` VARCHAR(512) CHARACTER SET latin1 DEFAULT NULL,
  `fetchInterval` INT(11) DEFAULT NULL,
  `prevFetchTime` BIGINT(20) DEFAULT NULL,
  `inlinks` mediumblob,
  `prevSignature` BLOB,
  `outlinks` mediumblob,
  `fetchTime` BIGINT(20) DEFAULT NULL,
  `retriesSinceFetch` INT(11) DEFAULT NULL,
  `protocolStatus` BLOB,
  `signature` BLOB,
  `metadata` BLOB,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
  • Or create the table with crawlerId, which will be [crawler id]_webpage.