cl-langutils

<div>
  <h1>cl-langutils</h1>
  <p>
    <a href="https://github.com/eslick/cl-langutils.git">https://github.com/eslick/cl-langutils.git</a>
  </p>

  
    <code class="clone">
    git clone 'https://github.com/eslick/cl-langutils.git'
      <br>
      <br>
(ql:quickload :cl-langutils)
    </code>
  

  <div class="star-area">
    <span class="starlabel">&#9733;</span><span class="stars">30</span>
  </div>
  <hr>
  
    <h1>LANGUTILS LIBRARY</h1>

<p>This file contains a simple guide to the main functions and files of
the langutils library.  The code is reasonably documented with doc
strings and inline comments.  Write to the author if there are any
questions.  Also read <a href="http://github.com/eslick/langutils/blob/master/docs/LISP2005-langutils.pdf?raw=true" >docs/LISP2005-langutils.pdf</a>
which is a more involved exposition of the implementation and
performance issues in the toolkit.</p>

<p>The library provides a heirarchy of major functions and auxiliary
functions related to the structured analysis and processing of 
open text.  The major functions working from raw text up are:</p>

<ul>
<li>String tokenization (string &rarr; string)</li>
<li>Part of speech tagging (string &rarr; tokens &rarr; vector-document)</li>
<li>Phrase chunking (vector-document &rarr; phrases)</li>
</ul>

<p>We also provide auxiliary functions that operate on strings, tokens or
vector-documents.  The lisp functions implementing the functionality
can be found under the appropriately labled section in the reference
below.</p>

<h2>Strings</h2>

<ul>
<li>Tokenize a string (separate punctuation from word tokens)</li>
<li>POS tag a string or file returning a file, string or vector-document</li>
<li>Identify suspicious strings that may become tokens</li>
</ul>

<h2>Tokens</h2>

<ul>
<li>String to token-id conversion routines</li>
<li>Save/Load token maps</li>
<li>Guess the POS tag for a token (lexicon-based, also includes the porter stemmer)</li>
<li>Identify suspicious tokens</li>
<li>Identify stopwords; words used primarily as syntactic combinators</li>
<li>Lookup words in the lexicon</li>
<li>Get possible parts of speech for known words</li>
<li>Lemmatize a token (find the root lemma for a given surface form)</li>
<li>Generate all surface forms of a root word</li>
</ul>

<h2>Vector-Documents:</h2>

<ul>
<li>Generate phrases using the regex chunker</li>
</ul>

<h2>Miscellaneous:</h2>

<ul>
<li>Concept representation: A simple lemmatized noun or verb phrases can
  be treated as equal abstract notions; provides a CLOS class wrapper.</li>
</ul>

<h1>INTERFACE REFERENCE</h1>

<p>This documents the important functions of the langutils toolkit.
Documentation entries are of the form:</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
function( args )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:
arg1 - description
arg2 - description</p>

<p>Output:
description</p>

<p>Notes:
discussion of use cases, etc.</p>

<p>Functions are explicitely referenced by putting () around them; variables or
parameters have the form of <em><name></em>. </p>

<p>TOKENS and TOKENIZATION</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
tokenize-stream (stream &amp;key (by-sentence nil) (fragment &quot;&quot;))
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&ndash;
Input:
stream - A standard lisp stream containing the characters to analyze,
         the stream can be of any length
by-sentence - Stop the tokenization process after each processed sentence
         meaning each validly parsed period, exclamation or question mark.
fragment - Provide a fragment from a prior call to tokenize stream at the
         beginning of the parse stream.</p>

<p>Output: (multiple-values)
1 - parsing success (t) or failure (nil)
2 - the current index into the stream, starts from 0 on every call
3 - a string containing the tokenized data parsed up to &lsquo;index&rsquo;
4 - if parsing was a success, provides a fragment of any unparsed
    data (primarily in by-sentence mode)</p>

<p>Notes:
 This function is intended to be called all at once or in batches.
 For large strings or files it should be called in by-sentence mode
 in a loop that captures any fragments and passes them to the next call.
 The function operates by grabbing one character at a time from the stream
 and writing it into a temporary array.  When it reaches a punctuation
 character, it inserts a whitespace then backs up to the beginning of the current 
 token and checks whether the token should have included the punctuation
 and fixes up the temporary array.  Upon completion of the current parse (end 
 of stream or end of sentence) it </p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
tokenize-string (string)
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>string - a string of English natural language text</li>
</ul>

<p>Output: (string)</p>

<p>Returns a string which is the result of calling (tokenize-stream) on
the stream version of the input string.</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
tokenize-file (source target &amp;key (if-exists :supersede))
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>source - The source file name as a string or pathname</li>
<li>target - The target file name as a string or pathname</li>
</ul>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
id-for-token ( token )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>token - A string representing a primitive token</li>
</ul>

<p>Output:
A fixnum providing a unique id for the provided string token.  </p>

<p>Notes:
Tokens are case sensitive so several &lsquo;The&rsquo;, &lsquo;the&rsquo; and &lsquo;THE&rsquo; all 
map to different tokens but should map to the same entry in the 
lexicon.  The root form of a lexicon word is the lower case 
representation.</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
token-for-id ( id )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>id - A fixnum id</li>
</ul>

<p>Output:
The original token string.</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
tokens-for-id ( ids )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>ids - A list of fixnum ids</li>
</ul>

<p>Output:
A list of string representations of the each id</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
save-token-map ( filename )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>filename - A path or string to save token information to</li>
</ul>

<p>Output:
t on success or nil otherwise</p>

<p>Notes:
This procedure will default to the filename in <em>default-token-map-file-int</em> 
which can be set via the asdf-config parameter &lsquo;token-map&rsquo;</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
load-token-map ( filename )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>filename - A path or string to save token information to</li>
</ul>

<p>Output:
t on success or nil otherwise</p>

<p>Notes:
This procedure will default to the filename in <em>default-token-map-file-int</em> 
which can be set via the asdf-config parameter &lsquo;token-map&rsquo;</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
suspicious-word? ( word )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>word - A fixnum id for a word to test</li>
</ul>

<p>Output:
A boolean representing whether this word has been labelled as fishy</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
suspicious-string? ( string )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>string - Any string</li>
</ul>

<p>Output:
A boolean representing whether the word is fishy as determined by
parameters set in tokens.lisp (max numbers, total length and other 
characters in the token).  This is used inside id-for-token to 
keep the hash for suspicious-word? up to date.</p>

<h1>POS TAGGING AND OPERATIONS ON TOKENS</h1>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
tag ( string )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>string - An input string to tag.  Input should be less than 100k 
     characters if possible.</li>
</ul>

<p>Output:
A tagged string using the format <word>/<tag> where the tags are symbols
taken from the Penn Treebank 2 tagset.  Actual slash characters will 
show up as &lsquo;///&rsquo; meaning a slash word and slash token slash-separated!</p>

<p>Note:
This procedure calls the tokenizer to ensure that the input string is
properly tokenized in advance.</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
tag-tokenized ( string )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>string - An input string to tag.  The string is assumed to be tokenized
  already and should be less than 100k bytes in size</li>
</ul>

<p>Output:
A tagged string as in &lsquo;tag&rsquo; above.</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
vector-tag ( string )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>string - as in tag above</li>
</ul>

<p>Output:
A CLOS object of type vector-document with the token array initialized
to fixnum representations of the word tokens and the tag array initialized
with symbols represented the selected tags.</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
vector-tag-tokenized ( string &amp;key end-tokens )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>string - as in tag-tokenized above</li>
<li>end-tokens - A list of string tokens to add to the end of the tokenization
   array.  Sometimes this is useful to ensure a closing period if you are
   doing tagging of structured NL data</li>
</ul>

<p>Output:
A vector-document as in vector-tag</p>

<p>Note:
As in tag and tag-tokenized, this interface does not tokenize the input string.</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
get-lexicon-entry ( word )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>word - Token id or token string</li>
</ul>

<p>Output:
A lexicon-entry structure related to the lexical characteristics of the token</p>

<p>Notes:
The lexical-entry can be manipulated with a set of accessor
functions: lexicon-entry-tag, lexicon-entry-tags, lexical-entry-id,
lexical-entry-roots, lexical-entry-surface-forms, lexical-entry-case-forms,
get-lexicon-default-pos.  These functions are not all exported from the library
package, however.</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
initial-tag ( token )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>token - A string token</li>
</ul>

<p>Output:
A keyword symbol of the initially guessed tag (:PP :NN, etc)</p>

<p>Notes:
Provides an initial guess based purely on lexical features and lexicon
information of the provided string token.</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
read-file-as-tagged-document ( file )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>file - A string filename or path object</li>
</ul>

<p>Output:
A vector-document representing the tagged contents of file</p>

<p>Notes:
Loads the file into a string then calls vector-tag</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
read-and-tag-file ( file )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>file - A path string or a path object</li>
</ul>

<p>Output:
A string with tag annotations of the contents of file</p>

<p>Notes:
Uses tag on the string contents of file</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
get-lemma ( word &amp;key pos (noun t) porter )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>word - String of the word to find the lemma for</li>
<li>pos - The part of speech of the lemma to return (nil otherwise)</li>
<li>noun - Whether to stem nouns to the singular form</li>
<li>porter - Whether to use the porter algorithm if a word is unknown</li>
</ul>

<p>Output:
A string representing the lemma of the word, if found</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
get-lemma-for-id ( id &amp;key pos (noun t) porter )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>id - The token id to find the lemma of</li>
<li>pos - As above</li>
<li>noun - &quot;&quot;</li>
<li>porter - &quot;&quot;</li>
</ul>

<p>Output:
The lemma id</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
lemmatize ((sequence list/array) &amp;key strip-det pos (noun t) porter last-only )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>list/array - The input sequence of token ids as a list or an array</li>
<li>strip-det - Remove determiners from the sequence</li>
<li>pos - Part of speech of root of terms</li>
<li>noun - Whether to stem nouns</li>
<li>porter - Whether to use the porter stemmer</li>
<li>last-only - lemmatize the last token in the sequence only</li>
</ul>

<p>Output:
Return the lemmatized list of tokens</p>

<p>Notes:
The main method for performing lemmatization.  Valid on lists and arrays of
fixnum values only.  Useful for getting the lemmatization of short phrases.</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
morph-surface-forms ( root &amp;optional pos-class )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>root - The root form to expand</li>
<li>pos-class - if provided (V - verb, N - noun, A - Adverb) the class of 
        surface forms to generate</li>
</ul>

<p>Output:
A list of suface ids</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
morph-surface-forms-text ( root &amp;optional pos-class )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-</p>

<p>String to string form of the above function</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
stopword? ( id )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>id - Input token id</li>
</ul>

<p>Output:
boolean </p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
concise-stopword? ( id )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>id - Input token id</li>
</ul>

<p>Output:
boolean</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
contains-is? ( ids )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>ids - a list of fixnum token ids</li>
</ul>

<p>Output:
boolean</p>

<p>Notes:
A sometimes useful utility.  Searches the list for the token for &lsquo;is&rsquo;</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
string-stopword?, string-concise-stopword?, string-contains-is? ( string )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-</p>

<p>The three above functions but accepting string or list of string arguments</p>

<h1>CHUNKING</h1>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
chunk ( text )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>Text - raw string text</li>
</ul>

<p>Output:
A list of phrases referencing a document created from the text</p>

<p>Note:
Runs the tokenizer on the text prior to POS tagging</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
chunk-tokenized ( text )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>text - raw string text</li>
</ul>

<p>Output:
A list of phrases referencing a document created from the text</p>

<p>Note:
Does not run the tokenizer on text prior to POS tagging</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
get-all-chunks ( doc )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>doc - a vector-document</li>
</ul>

<p>Output:
A list of chunks of all the primitive types (verb, adverb, preps and nouns)</p>

<p>Related functions:</p>

<ul>
<li>get-nx-chunks ( doc )</li>
<li>get-vx-chunks ( doc )</li>
<li>get-ax-chunks ( doc )</li>
<li>get-pp-chunks ( doc )</li>
<li>get-event-chunks ( doc )</li>
<li>get-verb-arg-chunks ( doc )</li>
</ul>

<p>Notes:</p>

<ul>
<li>Events are concatenated verb-noun chunks</li>
<li>verb-arg chunks look for verb-pp-noun chunk groups</li>
</ul>

<p>These two functions could search over sequences of phrases, but
usually those are done alone and not on top of a more primitive
verb, noun, adverb decomposition.  Also note that common preposition 
idioms (by way of, in front of, etc) are not typically captured 
properly and would need to be special cased (ie would be VP-sNP-P-NP 
where sNP is a special type of NP instead of the usual VP-P-NP 
verb-arg formulation)</p>

<h1>CONCEPTS</h1>

<p>Concepts are a CLOS abstraction over token sequences that establishes
identity over lemmatized phrases.  This supports special applications
(ConceptNet, LifeNet) at the MIT Media Lab but might be more generally
useful.</p>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
concept
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
The &lsquo;concept&rsquo; is a clos object with the following methods</p>

<ul>
<li>concept&rarr;words - Return a list of token strings</li>
<li>concept&rarr;string - Return a string representing the concept</li>
<li>concept&rarr;token-array - Return an array representing the concept</li>
<li>phrase&rarr;concept - Create a concept from a phrase</li>
<li>words&rarr;concept - Create a concept from a list of token ids</li>
<li>token-array&rarr;concept - &quot;&quot;</li>
<li>associate-concepts - Take a list of phrases, lists or token-arrays and find the concept
   the they represent.  Returns a list of pairs of the form (phrase concept)</li>
<li>conceptually-equal - equal under lemmatization and with phrases, arrays of tokens</li>
<li>concept-contains - subset relations</li>
</ul>

<p>&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
lookup-canonical-concept-instance ( ta )
&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;&mdash;-
Input:</p>

<ul>
<li>ta - A token array or list of tokens</li>
</ul>

<p>Output:
A concept instance</p>

<h1>EXAMPLE USES</h1>

<p>See the file example.lisp.  This shows basic use of the tagger, 
tokenizer, lemmatizer and chunker interfaces.  </p>

<p>More examples of use can be generated if enough mail is sent to the
author to invoke a guilt-driven re-release of the library with
improved documentation.</p>

  
</div>