Configuring Apache Nutch to search local files and online files.
(please add comments if you found this post useful)
Nutch is an opensource search tool that we can configure to our need so that we can use it as a search tool for our internal documents. We can also use it to search the web.
For a full documentation visit: http://lucene.apache.org/nutch/
Below is the list of softwares and configurations that are needed to run nutch:
Softwares
- Nutch 0.9: http://www.apache.org/dyn/closer.cgi/lucene/nutch/ (If things doesn’t work at some stage and you suspect that it is due to the problem in the source, try to download from their svn: http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.9/ )
- JAVA JDK 6 : http://java.sun.com/javase/downloads/index.jsp
- Apache Tomcat web server 6: http://tomcat.apache.org/download-60.cgi
- Cygwin: http://www.cygwin.com/
- Apache ant http://ant.apache.org/bindownload.cgi
You can install these softwares to any directory you want. For the sake of simplicity i have mentioned the directories I used.
Step 1: Install Nutch
Unzip nutch to C:\nutch-0.9. If you are getting the source from svn, checkout to C:\nutch-0.9
Step 2: Install Java
Install Java to, C:\program files\
Step 3: Install Apache Tomcat
Install tomcat and run it. Make sure that it is running at http://localhost:8080/ or some other port if you have a custom port number.
Step 4: Install Cygwin
This is to get a linux like environment to run the commands.
Step 5: Set JAVA_HOME and add update classpath
Set JAVA_HOME environment variable [Eg: C:\Program Files\Java\jdk1.6.0_05 ].
Add %JAVA_HOME %\bin to classpath
Step 6: Install Apache ant
Install Apache ant eg: C:\apache-ant-1.7.0\
Step 7: Add ant to the classpath:
Add ant to classpath[eg: C:\apache-ant-1.7.0\bin]
Step 8: Build the project
From command prompt:
cd C:\nutch-0.9
ant
ant war
Step 9: Testing Current Environment
Open the cygwin console and:
cd c:/
cd nutch-0.9/bin/
./nutch
THE OUTPUT WILL BE:
Usage: nutch COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets
readdb read / dump crawl db
convdb convert crawl db from pre-0.9 format
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
fetch2 fetch a segment's pages using Fetcher2 implementation
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
index run the indexer on parsed segments and linkdb
merge merge several segment indexes
dedup remove duplicates from a set of segment indexes
plugin load a plugin and run one of its classes main()
server run a search server
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
Step 10: Create the urls directory
Now create a directory called ‘urls’ inside, C:\nutch-0.9
Step 11: Create a file for the crawler to find the url
Create a file with any name to include the urls to crawl. I have created a file named source.txt Enter the sites which are to be crawled.
For eg:
file:///c:/MySearch/samplefiles/
http://www.apache.org/
Step 12: Edit conf/crawl-urlfilter.txt
# skip file:, ftp:, & mailto: urls
-^(ftpmailto):
# skip image and other suffixes we can't yet parse
-\.(gifGIFjpgJPGpngPNGicoICOcsssitepswmfzippptmpgxlsgzrpmtgzmovMOVexejpegJPEGbmpBMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept hosts in MY.DOMAIN.NAME
+^file://c:/MySearch/samplefiles/*
+^http://([a-z0-9]*\.)*apache.org/
# Accept everything else
+.
Step 13: Edit conf/nutch-site.xml
Make sure to edit atleast the following entries:
searcher.dir--> this is the directory where we are going to make nutch’s database. All the indexing will be done in this folder.
plugin.includes--> Requires plugins
file.content.limit--> Set it to -1
http.agent.name--> Give your search agent a name
Eg:
<property>
<name>searcher.dir</name>
<value>C:\nutch-0.9\crawl</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-fileprotocol-httpclientprotocol-httpurlfilter-regexparse-texthtmljsmswordpdf)index-basicquery-basicsiteurl)summary-basicscoring-opicurlnormalizer-</valuepassregexbasic)/value>
</property>
<property>
<name>file.content.limit</name>
<value>-1</value>
</property>
<property>
<name>http.agent.name</name>
<value>MySearch</value>
<description>My Search Engine </description>
</property>
Step 14: Run the crawl command
Once all the above steps are done, now its time to run the crawler.
The most common options include to crawl command include
-dir dir names the directory to put the crawl in.
-threads threads determines the number of threads that will fetch in parallel.
-depth depth indicates the link depth from the root page that should be crawled.
-topN N determines the maximum number of pages that will be retrieved at each level up to the depth.
You can Now run the crawl command from the cygwin console.
Open cygwin:
cd c:/
cd nutch/bin
./nutch crawl urls –dir crawl –depth 3 –topN 50
Step 15: Copy the war file to Tomcat
Copy nutch-0.9.war from C:\nutch-0.9\build to tomcat’s webapps directory.
Restart tomcat.
Step 16: Configure nutch-site.xml
Open C:\ tomcat\webapps\nutch-0.9\WEB-INF\classes\nutch-site.xml and make sure that searcher.dir is pointing to the crawl directory (The directory you mentioned in the ./nutch command)
<property>
<name>searcher.dir</name>
<value>C:\nutch-0.9\crawl</value>
</property>
Restart tomcat.
Step 17: Access Nutch search
Open browser and access http://localhost:8080/nutch-0.9
Enter your string to search.
(please add comments if you found this post useful)