gelapas

Section: User Commands (1)
Updated: Oct 2001

NAME

gelapas - extract links and print a summary

SYNOPSIS

gelapas [options]

DESCRIPTION

Extract information from files. The default settings (and the shorthand options) are useful to extract information such as the title or meta tags from HTML files, but it could also be used for other kind of documents.

gelapas crawls the file tree for files which match a specified file pattern (defaults to HTML files).

Options

Below is a summary of all options that gelapas accepts. Most options have two equivalent names, one of which is a single letter preceded by -, and the other of which is a long name preceded by --. Multiple single letter options (unless they take an argument) can be combined into a single command line word: -td is equivalent to -t -d. Long named options can be abbreviated to any unique prefix of their name. Brackets ([ and ]) indicate that an option takes an optional argument.
General options:
--exclude pattern
Exclude files that match the given pattern. Can be repeated.
-p regexp

--pattern regexp
File pattern. Defaults to ".html?$" to get all .htm and .html files.
-s directory

--startdir directory
Start at the specified directory instead of the current directory.
Fields:
-a

--[no]author
Include [Exclude] author.
Regexp: <link rev="made" href="([^"]*)"
-C

--[no]CONTENT
Include [Exclude] content.
Regexp: <body>(.*)</body>
-d

--[no]desc
Include [Exclude] the description.
Regexp: <meta name="description" content="([^"]*)
-F

--[no]FILELONG
Include [Exclude] filename (full path).
-f

--[no]fileshort
Include [Exclude] filename (relative to the start directory).
-k

--[no]keywords
Include [Exclude] the keywords.
Regexp: <meta name="keywords" content="([^"]*)
-l

--[no]level
Include [Exclude] the hierarchy level.
-t

--[no]title
Include [Exclude] document title.
Regexp: <title>(.*)</title>
-x pattern

--xtract pattern
Extract information according to the specified regular expression. The pattern is matched case insensitive and gelapas automatically inserts some white space expressions to match more HTML tags. (See the option --noise below) The first brace is returned as the data.
Example:
-x "<head>(*)</head>"
This will extract the full content of the header.
-T string

--fieldtitle string
Sets the title for a field. This is only used with the XML output format. Repeat this option to set multiple titles. You should either only set the titles for the xtract fields or for all fields. If you set the titles for some but not all of the shorthand options you'll get an ugly mix.
-n

--[no]noise
Toggles insertion of some white space expressions into the patterns to match more HTML tags. This is handy because HTML is very loosely formatted. If you extract information from files other than HTML it may be practical to disable this additional noise. The default is enabled.
Output options:
--csv
Print a comma separated list. The fields are in the order as specified on the command line. (See BUGS below for an exception)
--xml
Print XML output. You can specify the field names using the -T option. If no -T option is specified, convenient default values are used. For the fields specified with the xtract option, there are no default values.

BUGS

xtract and the shorthand options aren't sorted in the correct way. The xtract options come always first.

AUTHOR

Patrice Neff <software@patrice.ch>.
www.patrice.ch/en/computer/programs/gelapas

SEE ALSO

perlre(1)
This document was created by man2html, using the manual pages.
Time: 12:13:44 GMT, October 12, 2001