1. Introduction

pyygle is a simple search engine for text based files.

Content files will be scanned and put into a database.

The search space is a chapter which can be a part of a document or the whole document. If more than one words ar searched then all these words must lying in one chapter. In HTML documents chapters are starting with <h1> or <h2>.

Words will be normalized while storing in the index and before searching. This will make better results: If "burn" has been searched "burned" will be found too.

2. Database

create table word (
 word_id integer PRIMARY KEY,
 word varchar(64),
 count integer,
 hits integer
);
create table document (
 doc_id integer PRIMARY KEY,
 doc_type integer,
 link varchar(255),
 date integer,
 size integer,
 doctree_id integer,
 FOREIGN KEY(doctree_id) REFERENCES docTree(doctree_id)
);
create table chapter (
 chapter_id integer PRIMARY KEY,
 doc_id integer,
 size integer,
 anchor varchar(255),
 pure_text text,
 rating integer,
 title varchar(255),
 FOREIGN KEY(doc_id) REFERENCES document(doc_id)
 );
create table wordOfChapter (
 word_id integer,
 chapter_id integer,
 count integer,
 FOREIGN KEY(word_id) REFERENCES word(word_id),
 FOREIGN KEY(chapter_id) REFERENCES chapter(chapter_id)
);
create table normOfWord (
 variant varchar(64),
 word_id integer,
 FOREIGN KEY(word_id) REFERENCES word(word_id)
);
create table rawWord (
 word varchar(64)
);
create table docTree (
 doctree_id integer PRIMARY KEY,
 url varchar(255),
 path varchar(255)
);

3. Application

The application is pyygle.

3.1. Usage

Call it without arguments and you well see this short message:
pyygle, a simple search engine (C) Hamatoma 2012 Version %s
usage: %s <global_opts> mode <args>
<mode>:
  db      adminstrates the database
  parse   traverse a file tree for indexing text files
  search  search in the database
<global_opts>:
  --db=<path>          name of the database. Default: /var/lib/pyygle/pyygle.db
  --logfile=<file>     name of the logfile. Default: /tmp/pyygle.log
  --quiet              no logging of more info
<db-mode-args>:
  create               initializes the database: create the tables
  import-norm <file>   reads <file> and put the entries into the table normOfWord
  statistic            displays some statistic data
  export-raw-words <file> 
                       writes the table rawWord into <file>
<parse-mode-args>:
  fill-db [--add] [<directory> [<pattern>]]
                       scans the directory and inserts the data into the db
    <directory>        the start of the file search. All subdirectories will be
                       inspected too. Default: the current directory
    <pattern>          pattern is a regular expression for the files to index
    --add              the directory data will be added: normally the db 
                       will be created first
<serch-mode-args>:
  <search-opts> <phrase_1> ...
  --url=<url>          the prefix of the links in the result file
  --output=<file>      the result will be written to this file. Default: stdout
  --browser=<browser>  the result will be shown with this browser
  --no-frame           the result has pure info, no html header
  <phrase_N>:          a word: no prefix
                       a excludes word: prefix '~'
                       an exact matching phrase:  prefix: '='
example for a search:   
burn =iso-file        finds the following chapter: "burn" finds "burned" too, "iso-file" must be exact.
   The downloaded file must be burned as an ISO-file. ....   

3.2. Examples

Shows some interesting data of a db:
pyygle --db=/var/lib/sidu-manual/manual.db db statistic 
Builds a new database with all files from a directory tree:
pyygle --db=/var/lib/sidu-manual/manual.db parse fill-db /usr/share/sidu-manual/statics/de "-de.htm"
Adds some other files to the existing database:
pyygle --db=/var/lib/sidu-manual/manual.db parse fill-db --add /usr/share/sidu-manual/debian
Search for a chapter containing the words "burn" and "iso". The result should be displayed with iceweasel:
pyygle --db=/var/lib/sidu-manual/manual.db search --browser=iceweasel burn iso

4. Word Normalizing

The database must be populated with data that a comfortable search could be done: All forms of a word must be imported into the table normOfWord. This can be done with:
pyygle --db=/var/lib/sidu-manual/manual.db db import-normalized=/usr/share/sidu-manual/resources/normalized_de.txt 
The format of the file: each forms of one verb should be listed in one line separated by a whitespace (blank or tabulator). Lines beginning with # are comments and will be ignored.
Example:
burn burned burns burner
take took taken