Document Collection

Choose a document collection by clicking the button "Choose" and selecting a directory. Soekia parses all HTML documents ending in ".htm" or ".html".

If you want to search all subdirectories for HTML files, please select the option "include subdirectories".

The buttons:

Opens this help window.
Displays information about Soekia.

Index Location

The index is stored in a directory. You have to specify a directory on your hard disk. Click the button "Choose" to open a file chooser dialog. Soekia creates a subdirectory called "soekia-index" unless the specified directory has this name.

Index Parameters

Language You can choose between English and German. The selected language is important for the stemming algorithm and the stop word list.
Stop words Stop words are very frequent words that are not to be listed in the index. There is a predefined list of the 50 most frequent words of the selected language. Alternatively you can specify your own list.
Stemming For the English language the famous Porter stemming algorithm is used. For German we developped a simple algorithm that cuts off the most frequent German endings as well as the prefixes ge-, ver- and un-.

Index Creation

Clicking the button "Create index" builds the index for the specified document collection using the selected index parameters. If there exisits already an index, the index is overwritten. If the index creation lasts long, a dialog with progress bar and cancel button will appear. If you cancel the creation the index is corrupt and has to be rebuild in order to use it.

The button "Show index" opens a browser window displaying the index in a table. You can open several windows to compare different indices. Sometimes the programm runs out of memory and cannot display the index.

Query

To start a query type some search terms into the text field and click "Search". The order of the search terms has no effect on the result.

The ranking principles determine how the result is sorted. Soekia provides the following ranking principles:

  1. The more search terms a document contains, the more relevant it is.
  2. The more frequent a search term occurs in a document, the more relevant it is.
  3. A document that contains rare search terms is more relevant than a document that contains frequent search terms.