Saturday, August 30, 2008

Configuring Apache Nutch

Configuring Apache Nutch to search local files and online files.
(please add comments if you found this post useful)

Nutch is an opensource search tool that we can configure to our need so that we can use it as a search tool for our internal documents. We can also use it to search the web.
For a full documentation visit:
http://lucene.apache.org/nutch/



Below is the list of softwares and configurations that are needed to run nutch:

Softwares


You can install these softwares to any directory you want. For the sake of simplicity i have mentioned the directories I used.



Step 1: Install Nutch

Unzip nutch to C:\nutch-0.9. If you are getting the source from svn, checkout to C:\nutch-0.9



Step 2: Install Java

Install Java to, C:\program files\



Step 3: Install Apache Tomcat

Install tomcat and run it. Make sure that it is running at http://localhost:8080/ or some other port if you have a custom port number.



Step 4: Install Cygwin

This is to get a linux like environment to run the commands.



Step 5: Set JAVA_HOME and add update classpath

Set JAVA_HOME environment variable [Eg: C:\Program Files\Java\jdk1.6.0_05 ].
Add %JAVA_HOME %\bin to classpath



Step 6: Install Apache ant

Install Apache ant eg: C:\apache-ant-1.7.0\



Step 7: Add ant to the classpath:

Add ant to classpath[eg: C:\apache-ant-1.7.0\bin]



Step 8: Build the project

From command prompt:
cd C:\nutch-0.9
ant
ant war



Step 9: Testing Current Environment

Open the cygwin console and:
cd c:/
cd nutch-0.9/bin/
./nutch
THE OUTPUT WILL BE:

    Usage: nutch COMMAND
    where COMMAND is one of:
    crawl one-step crawler for intranets
    readdb read / dump crawl db
    convdb convert crawl db from pre-0.9 format
    mergedb merge crawldb-s, with optional filtering
    readlinkdb read / dump link db
    inject inject new urls into the database
    generate generate new segments to fetch from crawl db
    freegen generate new segments to fetch from text files
    fetch fetch a segment's pages
    fetch2 fetch a segment's pages using Fetcher2 implementation
    parse parse a segment's pages
    readseg read / dump segment data
    mergesegs merge several segments, with optional filtering and slicing
    updatedb update crawl db from segments after fetching
    invertlinks create a linkdb from parsed segments
    mergelinkdb merge linkdb-s, with optional filtering
    index run the indexer on parsed segments and linkdb
    merge merge several segment indexes
    dedup remove duplicates from a set of segment indexes
    plugin load a plugin and run one of its classes main()
    server run a search server
    or
    CLASSNAME run the class named CLASSNAME
    Most commands print help when invoked w/o parameters.



Step 10: Create the urls directory

Now create a directory called ‘urls’ inside, C:\nutch-0.9



Step 11: Create a file for the crawler to find the url

Create a file with any name to include the urls to crawl. I have created a file named source.txt Enter the sites which are to be crawled.
For eg:
file:///c:/MySearch/samplefiles/
http://www.apache.org/



Step 12: Edit conf/crawl-urlfilter.txt

# skip file:, ftp:, & mailto: urls
-^(ftpmailto):
# skip image and other suffixes we can't yet parse
-\.(gifGIFjpgJPGpngPNGicoICOcsssitepswmfzippptmpgxlsgzrpmtgzmovMOVexejpegJPEGbmpBMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
+^file://c:/MySearch/samplefiles/*
+^http://([a-z0-9]*\.)*apache.org/

# Accept everything else
+.



Step 13: Edit conf/nutch-site.xml

Make sure to edit atleast the following entries:
searcher.dir--> this is the directory where we are going to make nutch’s database. All the indexing will be done in this folder.
plugin.includes--> Requires plugins
file.content.limit--> Set it to -1
http.agent.name--> Give your search agent a name

Eg:

<property>

<name>searcher.dir</name>

<value>C:\nutch-0.9\crawl</value>

</property>

<property>

<name>plugin.includes</name>

<value>protocol-fileprotocol-httpclientprotocol-httpurlfilter-regexparse-texthtmljsmswordpdf)index-basicquery-basicsiteurl)summary-basicscoring-opicurlnormalizer-</valuepassregexbasic)/value>

</property>

<property>

<name>file.content.limit</name>

<value>-1</value>

</property>

<property>

<name>http.agent.name</name>

<value>MySearch</value>

<description>My Search Engine </description>

</property>



Step 14: Run the crawl command

Once all the above steps are done, now its time to run the crawler.
The most common options include to crawl command include
-dir dir names the directory to put the crawl in.
-threads threads determines the number of threads that will fetch in parallel.
-depth depth indicates the link depth from the root page that should be crawled.
-topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

You can Now run the crawl command from the cygwin console.
Open cygwin:
cd c:/
cd nutch/bin

./nutch crawl urls –dir crawl –depth 3 –topN 50



Step 15: Copy the war file to Tomcat

Copy nutch-0.9.war from C:\nutch-0.9\build to tomcat’s webapps directory.
Restart tomcat.



Step 16: Configure nutch-site.xml

Open C:\ tomcat\webapps\nutch-0.9\WEB-INF\classes\nutch-site.xml and make sure that searcher.dir is pointing to the crawl directory (The directory you mentioned in the ./nutch command)

<property>

<name>searcher.dir</name>

<value>C:\nutch-0.9\crawl</value>

</property>

Restart tomcat.


Step 17: Access Nutch search

Open browser and access http://localhost:8080/nutch-0.9
Enter your string to search.


(please add comments if you found this post useful)

Tuesday, August 12, 2008

Convert Staxsource to StreamSource



import java.io.File;
import java.io.FileInputStream;
import java.io.StringReader;
import javax.xml.transform.Source;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

public class SourceConvertor
{
private static Source convertStaxToStream(Source request)
{
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = null;
File fp = null;
FileInputStream fInp = null;
try
{
transformer = factory.newTransformer();
fp = new File("tempFile.txt");
transformer.transform(request, new StreamResult(fp));
fInp = new FileInputStream(fp);
} catch (Exception e)
{
e.printStackTrace();
}
return new StreamSource(fInp);
}
public static void main(String args[])
{
try
{
String message ="RaiGodOfSmallThings";
Source original = new StreamSource(new StringReader(message));
Source converted = convertStaxToStream(original);

TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer();
transformer.transform(converted, new StreamResult(System.out));
}
catch (Exception e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

Saturday, August 2, 2008

Program to write to a file


package com.milestone.snippets;

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;

/**
* @author AARYA
*
*/
public class CodeSnippetTester
{
/**
* @param args
*/
public static void main(String[] args)
{
try
{
BufferedWriter out = new BufferedWriter(new FileWriter("addressbook.txt"));
out.write("contact1");
out.close();
} catch (IOException e)
{
e.printStackTrace();
}
}
}

Program to read text from a file


package com.milestone.snippets;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

/**
* @author AARYA
*
*/
public class CodeSnippetTester
{
/**
* @param args
*/
public static void main(String[] args)
{
String str = " ";
try
{
BufferedReader in = new BufferedReader(new FileReader(
"addressbook.txt"));
while ((str = in.readLine()) != null)
{
}
in.close();
} catch (IOException e)
{
e.printStackTrace();
}
System.out.println("Output : " + str);
}
}

Program to read text from console


package com.milestone.snippets;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;

/**
* @author AARYA
*
*/
public class CodeSnippetTester
{
/**
* @param args
*/
public static void main(String[] args)
{
String str = "";
try
{
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
str = in.readLine();
} catch (IOException e)
{
e.printStackTrace();
}
System.out.println("The string you just entered is" + str);
}
}

Program to Delete a file

Java Code to delete a file:


package com.milestone.snippets;

import java.io.File;
import java.io.IOException;

/**
* @author AARYA
*
*/
public class CodeSnippetTester
{
/**
* @param args
*/
public static void main(String[] args)
{
boolean flag = (new File("addressbook.txt")).delete();
if (!flag)
{
System.out.println("Unable to delete the file");
}
}
}

Program to create a new file

Source code to create a new file:


package com.milestone.snippets;

import java.io.File;
import java.io.IOException;

/**
* @author AARYA
*
*/
public class CodeSnippetTester
{
/**
* @param args
*/
public static void main(String[] args)
{
try
{
File file = new File("addressbook.txt");
boolean success = file.createNewFile();
if (success)
{
System.out.println("New File Created");
} else
{
System.out.println("File Already Exists");
}
} catch (IOException e)
{
}
}
}

Program to check if a file exists or not

It would be always nice to check that if a file exists or not if you are going to take a file as input. Find below a code snippet for this:


package com.milestone.snippets;

import java.io.File;

/**
* @author AARYA
*
*/
public class FileExistaOrNot
{
/**
* @param args
*/
public static void main(String[] args)
{
boolean exists = (new File("inputfile.txt")).exists();
if (exists)
{
System.out.println("The File Exists");
} else
{
System.out.println("The File Doesnt Exist");
}
}
}