Wednesday, January 07, 2009
Register  |  Login
Information Science * Information Retrieval  * Information Indexing With Lucene
 Links  
 Print     
Hover here, then click toolbar to edit content
 Indexing Stategies For Lucene  
The Basics | Stemming, Synonyms & Linguistic Control | Semantic Indexing | Bibliography
 
Show as single page

 Introducing Lucene

 The opensource Apache Lucene  Projects allows developers to create powerfull search solutions. 

Here are some insight for planning and carring out a standard search solution.

Documents And Fields

The indexing in Lucene is based on creating Documents based on Field. Search engines generaly relay on an inverted file data structre to store the index. This reduces the size of the index and speeds up searches.

 

However lucene is flexible in that it gives the indexer serveral choices.

  • Fields can be stored within the index or they can be left out. If the origianls Documents are avaiable storing my not be needed
  • Document Term Position vectors can be stored along with the ... position and with offsets.

 

One Field Or Many

There is a some advantage in creating a single full text field.

  1. It can simplify indexing.
  2. Secondly it can be used during highlighting.
  3. It's Size can be capped

Adding multiple fields can make the index look more like a structed data file.
If information beyond pure text exists in the document then by placing it into fields can enable more advanced search solutions.

Stop Words 

In every language some words appeat more frequently than others. Examples in English are the words And, Or,  & The. A list of such words  is called a stop word list. Removing the stop words has the advantage of reducing index size & speeding search. A more subtle advantage is the improvement in acuracy - this can be understood when one considers stop words as noise within the document's information. of the For standard IR practice they offer few advantages and are therefore stripperd at indexing time and are also removed from user queries.

Lucene allows as well as searcare more common than Some words like "The" Major search engines exclude StopCommon Practice 

 


The Basics | Page 1 of 4 | Stemming, Synonyms & Linguistic Control
 Print