gelapas
Section: User Commands (1)
Updated: Oct 2001
NAME
gelapas - extract links and print a summary
SYNOPSIS
gelapas
[options]
DESCRIPTION
Extract information from files.
The default settings (and the shorthand options) are useful to extract information such as the title or meta tags from HTML files,
but it could also be used for other kind of documents.
- gelapas crawls the file tree for files which match a specified file pattern (defaults to HTML files).
-
Options
Below is a summary of all options that
gelapas
accepts.
Most options have two equivalent names, one of which is a single letter
preceded by
-,
and the other of which is a long name preceded by
--.
Multiple single letter options (unless they take an
argument) can be combined into a single command line word:
-td
is
equivalent to
-t -d.
Long named options can be abbreviated to
any unique prefix of their name. Brackets
([
and
])
indicate that an
option takes an optional argument.
- General options:
-
- --exclude pattern
-
Exclude files that match the given
pattern.
Can be repeated.
- -p regexp
-
- --pattern regexp
-
File pattern. Defaults to ".html?$" to get all .htm and .html files.
- -s directory
-
- --startdir directory
-
Start at the specified directory instead of the current directory.
- Fields:
-
- -a
-
- --[no]author
-
Include [Exclude] author.
Regexp: <link rev="made" href="([^"]*)"
- -C
-
- --[no]CONTENT
-
Include [Exclude] content.
Regexp: <body>(.*)</body>
- -d
-
- --[no]desc
-
Include [Exclude] the description.
Regexp: <meta name="description" content="([^"]*)
- -F
-
- --[no]FILELONG
-
Include [Exclude] filename (full path).
- -f
-
- --[no]fileshort
-
Include [Exclude] filename (relative to the start directory).
- -k
-
- --[no]keywords
-
Include [Exclude] the keywords.
Regexp: <meta name="keywords" content="([^"]*)
- -l
-
- --[no]level
-
Include [Exclude] the hierarchy level.
- -t
-
- --[no]title
-
Include [Exclude] document title.
Regexp: <title>(.*)</title>
- -x pattern
-
- --xtract pattern
-
Extract information according to the specified regular expression. The pattern is matched case insensitive and
gelapas
automatically inserts some white space expressions to match more HTML tags. (See the option --noise below)
The first brace is returned as the data.
-
-
Example:
-x "<head>(*)</head>"
This will extract the full content of the header.
- -T string
-
- --fieldtitle string
-
Sets the title for a field. This is only used with the XML output format. Repeat this option to set multiple titles. You should either only set the titles for the
xtract
fields or for all fields. If you set the titles for some but not all of the shorthand options you'll get an ugly mix.
- -n
-
- --[no]noise
-
Toggles insertion of some white space expressions into the patterns to match more HTML tags. This is handy because HTML is very loosely formatted. If you extract information from files other than HTML it may be practical to disable this additional noise. The default is enabled.
- Output options:
-
- --csv
-
Print a comma separated list. The fields are in the order as specified on the command line. (See
BUGS
below for an exception)
- --xml
-
Print XML output. You can specify the field names using the
-T
option. If no
-T
option is specified, convenient default values are used. For the fields specified with the
xtract
option, there are no default values.
BUGS
xtract
and the shorthand options aren't sorted in the correct way. The xtract options come always first.
AUTHOR
Patrice Neff
<software@patrice.ch>.
www.patrice.ch/en/computer/programs/gelapas
SEE ALSO
perlre(1)
This document was created by
man2html,
using the manual pages.
Time: 12:13:44 GMT, October 12, 2001