Xapian and Chinese Indexing&Searching

Posted by Cofyc, on April 27, 2008, 7:04 pm,

Xapian is an Open Source Search Engine Library, released under the GPL. It's written in C++, with bindings to allow use from Perl, Python, PHP, Java, Tcl, C# and Ruby (so far!)

Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators.

If you're after a packaged search engine for your website, you should take a look at Omega: an application we supply built upon Xapian. Unlike most other website search solutions, Xapian's versatility allows you to extend Omega to meet your needs as they grow.

-- www.xapian.org

一个开源搜索引擎库。基于GPL发布,C++语言所写。通过SWIG可与Perl,Python、PHP等绑定。

研究了两天,挺有趣的东西。操作、索引的思路与Lucene有点差别。Xapian是基于概率模型,而Lucene是基于向量模型。Xapian内置只支持English、Danish、French、Spanish等类似的语言,不支持使用汉字的中文。但目前中文资料实在太少了,现在还没弄清楚如何自己写stemmer。还只能预先对中文进行分词并用空格分隔后,将其当作一个个英文单词,并使用英语的stemmer来索引。不过这样,到也能达到索引效果。

这里有一篇文章,是讲使用Xapian来进行中文索引和搜索的:Chinese Xapian Indexing and Searching continue...

0 comment  - Tags: xapian, chinese, forbidden, kingdom, 功夫之王, jet li, jackie chan, michael angarano, 李冰冰, 刘亦菲

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License
Powered by Project Neverland, Theme modified from gluedideas