Developer Info
This is a Java (Swing) Application.
Download the source archive or use git clone from Codeberg.
Notes:
The main class is called DocSearch.java.
The method that performs indexing is called createNewIndex
Other classes of interest are the wrapper objects
for various file types;
- WordProps ; for working with POI HDF API
- ExcelProps ; for working with POI HSSF API
- PdfToText ; for working with PDF Box API
- RtfToText ; uses javax.swing.rtf API
- OoToText ; uses java.util.zip API to unzip the star office / open office documents and extract the content XML file
DocSearcher creates and stores its indexes and all related files in the .docsearcher2 folder underneath the user's home
directory. On a linux system this might be
/home/<username>/.docsearcher2
and on a windows system it might be something like:
C:\users\<username>\.docsearcher2
DocSearcher indexes are Lucene indexes with the following fields and types:
Field |
Description |
Indexing properties |
author |
taken from the document meta data |
stored, tokenized, indexed |
path |
file handle |
stored |
mod_date |
date document was last modified |
stored |
title |
title obtained via meta data (if exists) otherwise a grab of the first few lines or characters |
stored, tokenized, indexed |
summary |
first few lines of text |
stored, tokenized, indexed |
body |
text of entire document (without meta data) |
tokenized, indexed |
URL |
if the index is created as a "web" index - DocSearcher will construct a URL for each file |
stored, tokenized, indexed |
keywords |
taken from document meta data (if exits); mostly relevant on indexed web page documents |
stored, tokenized, indexed |
size |
size in bytes |
stored |
type |
document suffix (htm, doc, pdf, etc...) |
stored, tokenized, indexed |
|