Saturday, August 30, 2008

Configuring Apache Nutch

Configuring Apache Nutch to search local files and online files.
(please add comments if you found this post useful)

Nutch is an opensource search tool that we can configure to our need so that we can use it as a search tool for our internal documents. We can also use it to search the web.
For a full documentation visit:
http://lucene.apache.org/nutch/



Below is the list of softwares and configurations that are needed to run nutch:

Softwares


You can install these softwares to any directory you want. For the sake of simplicity i have mentioned the directories I used.



Step 1: Install Nutch

Unzip nutch to C:\nutch-0.9. If you are getting the source from svn, checkout to C:\nutch-0.9



Step 2: Install Java

Install Java to, C:\program files\



Step 3: Install Apache Tomcat

Install tomcat and run it. Make sure that it is running at http://localhost:8080/ or some other port if you have a custom port number.



Step 4: Install Cygwin

This is to get a linux like environment to run the commands.



Step 5: Set JAVA_HOME and add update classpath

Set JAVA_HOME environment variable [Eg: C:\Program Files\Java\jdk1.6.0_05 ].
Add %JAVA_HOME %\bin to classpath



Step 6: Install Apache ant

Install Apache ant eg: C:\apache-ant-1.7.0\



Step 7: Add ant to the classpath:

Add ant to classpath[eg: C:\apache-ant-1.7.0\bin]



Step 8: Build the project

From command prompt:
cd C:\nutch-0.9
ant
ant war



Step 9: Testing Current Environment

Open the cygwin console and:
cd c:/
cd nutch-0.9/bin/
./nutch
THE OUTPUT WILL BE:

    Usage: nutch COMMAND
    where COMMAND is one of:
    crawl one-step crawler for intranets
    readdb read / dump crawl db
    convdb convert crawl db from pre-0.9 format
    mergedb merge crawldb-s, with optional filtering
    readlinkdb read / dump link db
    inject inject new urls into the database
    generate generate new segments to fetch from crawl db
    freegen generate new segments to fetch from text files
    fetch fetch a segment's pages
    fetch2 fetch a segment's pages using Fetcher2 implementation
    parse parse a segment's pages
    readseg read / dump segment data
    mergesegs merge several segments, with optional filtering and slicing
    updatedb update crawl db from segments after fetching
    invertlinks create a linkdb from parsed segments
    mergelinkdb merge linkdb-s, with optional filtering
    index run the indexer on parsed segments and linkdb
    merge merge several segment indexes
    dedup remove duplicates from a set of segment indexes
    plugin load a plugin and run one of its classes main()
    server run a search server
    or
    CLASSNAME run the class named CLASSNAME
    Most commands print help when invoked w/o parameters.



Step 10: Create the urls directory

Now create a directory called ‘urls’ inside, C:\nutch-0.9



Step 11: Create a file for the crawler to find the url

Create a file with any name to include the urls to crawl. I have created a file named source.txt Enter the sites which are to be crawled.
For eg:
file:///c:/MySearch/samplefiles/
http://www.apache.org/



Step 12: Edit conf/crawl-urlfilter.txt

# skip file:, ftp:, & mailto: urls
-^(ftpmailto):
# skip image and other suffixes we can't yet parse
-\.(gifGIFjpgJPGpngPNGicoICOcsssitepswmfzippptmpgxlsgzrpmtgzmovMOVexejpegJPEGbmpBMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
+^file://c:/MySearch/samplefiles/*
+^http://([a-z0-9]*\.)*apache.org/

# Accept everything else
+.



Step 13: Edit conf/nutch-site.xml

Make sure to edit atleast the following entries:
searcher.dir--> this is the directory where we are going to make nutch’s database. All the indexing will be done in this folder.
plugin.includes--> Requires plugins
file.content.limit--> Set it to -1
http.agent.name--> Give your search agent a name

Eg:

<property>

<name>searcher.dir</name>

<value>C:\nutch-0.9\crawl</value>

</property>

<property>

<name>plugin.includes</name>

<value>protocol-fileprotocol-httpclientprotocol-httpurlfilter-regexparse-texthtmljsmswordpdf)index-basicquery-basicsiteurl)summary-basicscoring-opicurlnormalizer-</valuepassregexbasic)/value>

</property>

<property>

<name>file.content.limit</name>

<value>-1</value>

</property>

<property>

<name>http.agent.name</name>

<value>MySearch</value>

<description>My Search Engine </description>

</property>



Step 14: Run the crawl command

Once all the above steps are done, now its time to run the crawler.
The most common options include to crawl command include
-dir dir names the directory to put the crawl in.
-threads threads determines the number of threads that will fetch in parallel.
-depth depth indicates the link depth from the root page that should be crawled.
-topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

You can Now run the crawl command from the cygwin console.
Open cygwin:
cd c:/
cd nutch/bin

./nutch crawl urls –dir crawl –depth 3 –topN 50



Step 15: Copy the war file to Tomcat

Copy nutch-0.9.war from C:\nutch-0.9\build to tomcat’s webapps directory.
Restart tomcat.



Step 16: Configure nutch-site.xml

Open C:\ tomcat\webapps\nutch-0.9\WEB-INF\classes\nutch-site.xml and make sure that searcher.dir is pointing to the crawl directory (The directory you mentioned in the ./nutch command)

<property>

<name>searcher.dir</name>

<value>C:\nutch-0.9\crawl</value>

</property>

Restart tomcat.


Step 17: Access Nutch search

Open browser and access http://localhost:8080/nutch-0.9
Enter your string to search.


(please add comments if you found this post useful)

0 comments: