Lucene 3.0 学习（一）

最新推荐文章于 2025-06-27 12:35:44 发布

原创最新推荐文章于 2025-06-27 12:35:44 发布 · 950 阅读

1 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#lucene #query #file #null #文档

搜索相关专栏收录该内容

1 篇文章

订阅专栏

本文介绍 Lucene 3.0 的实战入门经验，通过对官方 demo 的详细解析，帮助初学者快速掌握 Lucene 的核心原理及使用技巧。

最近因项目需要，开始学习 Lucene 。先从官网上下载了相关资源（Lucene 3.0.1 ）。然后，上网搜索相关的入门指南、学习笔记等，想直接利用前辈们的经验迅速上手，但这些文档大都是基于 2.X 甚至更老的版本，与新版本并不兼容 (Lucene 的向下兼容似乎做得不是太好 ) 。最后，还是通过 Lucene 自带的 demo 代码的分析及修改，对 Lucene 有了些初步的理解。这里把自己的心得作个总结，希望会对刚刚接触 Lucene 的人有所帮助，也希望能够和更多的 Lucene 爱好者一起交流学习，共同进步。

一、启动 Demo

Lucene 自带的 demo 程序中，与搜索相关的主要有三个文件

java org.apache.lucene.demo.IndexFiles

功能：对指定文件夹下的所有文件建立索引，并将结果保存到指定的索引文件夹中。

用法：只要指定一个要索引文档的路径就可以。

java org.apache.lucene.demo.SearchFiles

功能：理论上是给出查询结果，包括 txt ， html ， xls ， doc ， .js,.htm 等文件。

用法：不用设置参数。

org.apache.lucene.demo.FileDocument

功能：确定每个文档中要索引的内容以及索引的方式。

用法：被 IndexFiles 调用

运行 demo 的时候，先导入相关 jar 文件，然后执行 IndexFiiles 建立索引，然后执行 SearchFiles 搜索，能够以文件路径形式返回查询结果，并且支持 txt ， html ， doc ， xls 等多种文件格式。

进一步测试，会发现 demo 版本有如下问题：

1、中文不支持，如果文本内容里面有中文，查询的时候并不能返回正确结果。

2、不支持文件名的查询。

3、结果显示并不是常见的文本内容片段的形式，如 google ， baidu 等。

以及进一步需要解决的问题：

4、索引如何与文本同步？也就是增量索引的处理。

5、更多文件类型支持？

二、基本原理

要解决这些问题，首先要了解 Lucene 的工作原理【 2 】：

同 demo 一样， Lucene 工作过程分为两部分，建立索引与查询索引。建立索引，是指对于给定内容，进行语言处理后，得到一系列的 Term （这里 Term 指的是被索引文档的基本元素，对于英文就是单词，并且是大小写，单复数，时态等统一后的单词；对于中文，是一系列有意义的词。）并以此为基础建立字典。查询索引，首先要分析用户输入的查询语句，同样得到一系列的 Term ，然后到建好的索引中查询，并对结果进行相关性分析并排序，最终得到输出结果。

三、源码分析

结合工作原理进行源码分析 ( 省略部分原有的注释以及与搜索无关的代码 ) ：

1 、 org.apache.lucene.demo.IndexFiles

public class IndexFiles {

// 定义索引所在的路径

static final File INDEX_DIR = new File("index");

/** Index all text files under a directory. */

public static void main(String[] args) {

final File docDir = new File(args[0]); // 被索引的文档路径

try {

// 定义 IndexWriter 用于建立索引

// new StandardAnalyzer(Version.LUCENE_CURRENT) 是 Lucene 自带的一种分词器

IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); // ------------- ①

System.out.println("Indexing to directory '" +INDEX_DIR+ "'...");

// 递归为指定路径下所有的文件建立索引

indexDocs(writer, docDir); // -------- ②

System.out.println("Optimizing...");

writer.optimize(); // 索引优化

writer.close();

} catch (IOException e) {

System.out.println(" caught a " + e.getClass() +

"/n with message: " + e.getMessage());

}

// 递归方法为每个文件建立索引

static void indexDocs(IndexWriter writer, File file)

throws IOException {

if (file.canRead()) {

if (file.isDirectory()) {

String[] files = file.list();

if (files != null) {

for (int i = 0; i < files.length; i++) {

indexDocs(writer, new File(file, files[i]));

}

} else {

System.out.println("adding " + file);

try {

// 添加索引

writer.addDocument(FileDocument.Document(file)); // ------- ③

}

catch (FileNotFoundException fnfe) {

;

}

2 、 org.apache.lucene.demo. FileDocument

public class FileDocument {

public static Document Document(File f)

throws java.io.FileNotFoundException {

Document doc = new Document();

// 这里是为文档的各种域（ field ）分别建立索引

// 为文件路径添加索引

doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED)); // ------------- ④

// 为文件修改日期添加索引

doc.add(new Field("modified",

DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),

Field.Store.YES, Field.Index.NOT_ANALYZED));

// Note that FileReader expects the file to be in the system's default encoding.

// If that's not the case searching for special characters will fail.

// 为文件内容添加索引

doc.add(new Field("contents", new FileReader(f))); // ---------- ⑤

// return the document

return doc;

}

private FileDocument() {}

}

3 、 org.apache.lucene.demo. SearchFiles

public class SearchFiles {

public static void main(String[] args) throws Exception {

// 执行是可不输入任何参数

String index = "index"; // 索引所在的路径

String field = "contents"; // 指定要检索的域，此处是内容，也可以是路径、修改时间等，但该域需要先建立索引，如 ④

String queries = null;

int repeat = 0;

boolean raw = false;

String normsField = null;

boolean paging = true;

int hitsPerPage = 10;

// 打开索引

IndexReader reader = IndexReader.open(FSDirectory.open(new File(index)), true);

Searcher searcher = new IndexSearcher(reader); // ---------- ⑥

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

BufferedReader in = null;

if (queries != null) {

in = new BufferedReader(new FileReader(queries));

} else {

in = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));

}

// 查询解析器，用于查询关键字的语法解析

QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, field, analyzer); // ---------------- ⑦

while (true) {

if (queries == null) // prompt the user

System.out.println("Enter query: ");

String line = in.readLine();

if (line == null || line.length() == -1)

break;

line = line.trim();

if (line.length() == 0)

break;

// 对搜索的关键字进行语法分析

Query query = parser.parse(line); // ----------------- ⑧

System.out.println("Searching for: " + query.toString(field));

if (repeat > 0) { // repeat & time as benchmark

Date start = new Date();

for (int i = 0; i < repeat; i++) {

searcher.search(query, null, 100);

}

Date end = new Date();

System.out.println("Time: "+(end.getTime()-start.getTime())+"ms");

}

if (paging) {

doPagingSearch(in, searcher, query, hitsPerPage, raw, queries == null);

} else {

doStreamingSearch(searcher, query);

}

reader.close();

}

public static void doPagingSearch(BufferedReader in, Searcher searcher, Query query, int hitsPerPage, boolean raw, boolean interactive) throws IOException {

// 这部分内容主要是实现搜索结果的分页显示

// 真正和搜索相关代码只有下面几行 h

// 搜索结果存储对象

TopScoreDocCollector collector = TopScoreDocCollector.create(

5 * hitsPerPage, false);

// 执行搜索

searcher.search(query, collector); // ------------- ⑨

// 搜索结果是按照其相关性得分，从大到小保存的

ScoreDoc[] hits = collector.topDocs().scoreDocs; // ----------- ⑩

…….

for (int i = start; i < end; i++) {

// 根据 hits ，可以获得各种文档的信息，如得分，内容，路径等等