「垂直搜索引擎」简单垂直搜索引擎入门(java+Lucene)

垂直搜索引擎

一、垂直搜索引擎介绍

垂直搜索引擎是搜索引擎的一种，是搜索引擎的细分和延伸，可以简单地理解为在某个领域的搜索引擎，例如在你的所有文档中搜索相关内容，在你的项目文件中搜索带有“test”字眼的文档。

二、Lucene介绍

Lucene是一款开源的，高性能，可扩展的信息检索工具库；是一个java实现的jar包用来管理搜索引擎索引库。可以从Lucene官网下载最新版本的Lucene，本文采用的是旧版Lucene，4.6版本，在CSDN上有大神提供这个版本的下载。

三、搜索引擎原理

一个完整的搜索引擎例如Google，百度等等，首先要做的事情是信息获取，所谓信息获取，对于他们来说就是利用爬虫技术将网络上的大部分内容爬取下来，至于爬虫如何爬取网络，如何避免网站重复爬取等等问题，这里就不细讲了；当我们在用一个搜索引擎时，它能够在极端的时间内查询到你所需要的信息并排好序发送给用户，在这么短的时间内，在一个庞大的数据库中搜索内容可想而知有多么困难，而搜索引擎之所以能够如此快速地查询关键在于数据库的索引。

网络爬虫将爬取的内容分解、分析，并以巨大的表格形式存入数据库，这个过程就是建立索引的过程；搜索引擎的核心数据结构为倒排索引(Inverted index)，倒排索引是相对于正向索引来说的，首先用正向索引来存储每个文档对应的单词列表，然后再建立倒排索引，根据单词来索引文档编号。

用户在输入需要的搜索的内容后，搜索引擎首先要对搜索内容进行分词，然后进行去除停用词等操作，然后再进行搜索，从索引数据库中找出所有包含搜索词的内容。最后一步工作，就是对所有搜索得到的内容进行排序，排序的方法就有些复杂很有意思了，这里暂时不讲，这篇文章实现了建立倒排索引与返回未排序搜索结果。

搜索引擎原理：

四、环境配置

既然Lucene是基于java的首先需要配置jdk，不介绍了........

下载了Lucene4.6之后将Lucene拷贝至自己的java工程中，我使用的IDE是Eclipse，在Eclipse中需要将相关的jar包导入到项目中，这里要用到的jar包有core/lucene-core-4.6.0.jar；analysis/lucene-analyzers-common-4.6.0.jar；queryparser/lucene-queryparser-4.6.0.jar

五、具体代码

package Test1;

import java.io.File;  
import java.io.FileReader;

import org.apache.lucene.analysis.Analyzer;  
import org.apache.lucene.analysis.standard.StandardAnalyzer;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.document.field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;  
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;  
import org.apache.lucene.store.FSDirectory;  
import org.apache.lucene.util.Version;  
  
public class HelloLucene {  
    /** 
     * 建立索引 
     */  
    public void index() {  
        IndexWriter indexWriter = null;  
        try {  
            // 1、创建Directory  
            Directory directory = FSDirectory.open(new File("index/"));  
  
            // 2、创建IndexWriter  
            Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);  
            IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_46, analyzer);  
            indexWriter = new IndexWriter(directory, indexWriterConfig);  
  
            File dFile = new File("documents/");
            File[] files = dFile.listFiles();
            for (File file : files) {
                // 3、创建Document对象
                Document document = new Document();
  
                // 4、为Document添加Field
                // 第三个参数是FieldType 但是定义在TextField中作为静态变量，看API也不好知道怎么写  
                document.add(new Field("content", new FileReader(file), TextField.TYPE_NOT_STORED));  
                document.add(new Field("filename", file.getName(), TextField.TYPE_STORED));  
                document.add(new Field("filepath", file.getabsolutePath(), TextField.TYPE_STORED));  
  
                // 5、通过IndexWriter添加文档到索引中  
                indexWriter.addDocument(document);  
            }
        } catch (Exception e) {  
            e.printstacktrace();  
        } finally {  
            try {  
                if (indexWriter != null) {  
                    indexWriter.close();  
                }  
            } catch (Exception e) {  
                e.printStackTrace();  
            }  
        }
    }
    
    //搜索方法，返回文件名和文件路径
    public void search(String con) {  
        DirectoryReader directoryReader = null;  
        try {  
            // 1、创建Directory  
            Directory directory = FSDirectory.open(new File("index/"));  
            // 2、创建IndexReader  
            directoryReader = DirectoryReader.open(directory);  
            // 3、根据IndexReader创建IndexSearch
            IndexSearcher indexSearcher = new IndexSearcher(directoryReader);  
            // 4、创建搜索的Query  
            Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);  
            // 创建parser来确定要搜索文件的内容，第二个参数为搜索的域  
            QueryParser queryParser = new QueryParser(Version.LUCENE_46, "content", analyzer);  
  
            // 创建Query表示搜索域为content包含con的文档  
            Query query = queryParser.parse(con);  
  
            // 5、根据searcher搜索并且返回TopDocs  
            TopDocs topDocs = indexSearcher.search(query, 1);  
  
            // 6、根据TopDocs获取ScoreDoc对象  
            ScoreDoc[] scoreDocs = topDocs.scoreDocs;  
            for (ScoreDoc scoreDoc : scoreDocs) {
                // 7、根据searcher和ScoreDoc对象获取具体的Document对象  
                Document document = directoryReader.document(scoreDoc.doc);
                // 8、根据Document对象获取需要的值  
                System.out.println(document.get("filename") + " " + document.get("filepath"));  
            }
        } catch (Exception e) {
            e.printStackTrace();  
        } finally {  
            try {  
                if (directoryReader != null) {  
                    directoryReader.close();  
                }  
            } catch (Exception e) {  
                e.printStackTrace();  
            }  
        }  
    }
}

package Test1;

public class LuceneTest {
	public static void main(String args[]){
		HelloLucene helloLucene = new HelloLucene();
		helloLucene.index();
		helloLucene.search("测试");
	}
}

数据内容：

1、test.txt

测试lucene

2、hahah.txt

测试2

3、pupup.txt

hahahaha

运行结果：

简单垂直搜索引擎入门(java+Lucene)