pyygle is a simple search engine for text based files.
Content files will be scanned and put into a database.
The search space is a chapter which can be a part of a document or the whole document. If more than one words ar searched then all these words must lying in one chapter. In HTML documents chapters are starting with <h1> or <h2>.
Words will be normalized while storing in the index and before searching. This will make better results: If "burn" has been searched "burned" will be found too.
create table word ( word_id integer PRIMARY KEY, word varchar(64), count integer, hits integer ); create table document ( doc_id integer PRIMARY KEY, doc_type integer, link varchar(255), date integer, size integer, doctree_id integer, FOREIGN KEY(doctree_id) REFERENCES docTree(doctree_id) ); create table chapter ( chapter_id integer PRIMARY KEY, doc_id integer, size integer, anchor varchar(255), pure_text text, rating integer, title varchar(255), FOREIGN KEY(doc_id) REFERENCES document(doc_id) ); create table wordOfChapter ( word_id integer, chapter_id integer, count integer, FOREIGN KEY(word_id) REFERENCES word(word_id), FOREIGN KEY(chapter_id) REFERENCES chapter(chapter_id) ); create table normOfWord ( variant varchar(64), word_id integer, FOREIGN KEY(word_id) REFERENCES word(word_id) ); create table rawWord ( word varchar(64) ); create table docTree ( doctree_id integer PRIMARY KEY, url varchar(255), path varchar(255) );
pyygle, a simple search engine (C) Hamatoma 2012 Version %s usage: %s <global_opts> mode <args> <mode>: db adminstrates the database parse traverse a file tree for indexing text files search search in the database <global_opts>: --db=<path> name of the database. Default: /var/lib/pyygle/pyygle.db --logfile=<file> name of the logfile. Default: /tmp/pyygle.log --quiet no logging of more info <db-mode-args>: create initializes the database: create the tables import-norm <file> reads <file> and put the entries into the table normOfWord statistic displays some statistic data export-raw-words <file> writes the table rawWord into <file> <parse-mode-args>: fill-db [--add] [<directory> [<pattern>]] scans the directory and inserts the data into the db <directory> the start of the file search. All subdirectories will be inspected too. Default: the current directory <pattern> pattern is a regular expression for the files to index --add the directory data will be added: normally the db will be created first <serch-mode-args>: <search-opts> <phrase_1> ... --url=<url> the prefix of the links in the result file --output=<file> the result will be written to this file. Default: stdout --browser=<browser> the result will be shown with this browser --no-frame the result has pure info, no html header <phrase_N>: a word: no prefix a excludes word: prefix '~' an exact matching phrase: prefix: '=' example for a search: burn =iso-file finds the following chapter: "burn" finds "burned" too, "iso-file" must be exact. The downloaded file must be burned as an ISO-file. ....
pyygle --db=/var/lib/sidu-manual/manual.db db statisticBuilds a new database with all files from a directory tree:
pyygle --db=/var/lib/sidu-manual/manual.db parse fill-db /usr/share/sidu-manual/statics/de "-de.htm"Adds some other files to the existing database:
pyygle --db=/var/lib/sidu-manual/manual.db parse fill-db --add /usr/share/sidu-manual/debianSearch for a chapter containing the words "burn" and "iso". The result should be displayed with iceweasel:
pyygle --db=/var/lib/sidu-manual/manual.db search --browser=iceweasel burn iso
pyygle --db=/var/lib/sidu-manual/manual.db db import-normalized=/usr/share/sidu-manual/resources/normalized_de.txtThe format of the file: each forms of one verb should be listed in one line separated by a whitespace (blank or tabulator). Lines beginning with # are comments and will be ignored.
burn burned burns burner take took taken