chtml-matcher

<div>
  <h1>chtml-matcher</h1>
  <p>
    <a href="https://github.com/eslick/chtml-matcher.git">https://github.com/eslick/chtml-matcher.git</a>
  </p>

  
    <code class="clone">
    git clone 'https://github.com/eslick/chtml-matcher.git'
      <br>
      <br>
(ql:quickload :chtml-matcher)
    </code>
  

  <div class="star-area">
    <span class="starlabel">&#9733;</span><span class="stars">5</span>
  </div>
  <hr>
  
    <h1>chtml-matcher</h1>

<h2>A simple Lisp-based DSL for extracting information from web pages</h2>

<p>chtml-matcher performs pattern-based unification over HTML via a set
of compiled nested closures.  It uses the closure-html library to
parse HTML to lhtml, a lisp form of HTML. A template list is passed to
(match-template template lhtml) and returns a bindings object
containing an alist of all the extracted information. </p>

<p>The semantics are reasonably intuitive, but might require a little
playing around to get a good feel for how to solve most common
problems. The API is small and the package.lisp provides pointers to
where to look. The whole library is less than 1k lines of code so easy
enough to read through.</p>

<h2>Download and Dependencies</h2>

<p>Clone it from github.  The <a href="http://common-lisp.net/project/chtml-matcher" >old
repository</a> on
common-lisp.net is deprecated.</p>

<p>chtml-matcher depends on my home-brew cl-stdutils, closure-html,
cl-ppcre, and f-underscore, although all but closure-html could be
removed if necessary.</p>

<h2>LXML Template Unfication</h2>

<p>The DSL provides a light-weight way to extract fields from nested
HTML/XML structure represented in LHTML (as produced by closure-html).
A template is a declarative representation of substructure with
embedded variables that are bound when the substructure matches.</p>

<p>Substructure is loosely matched, such that if any given body element
doesn&apos;t match, the next child is considered until all the template
body elements have matched a lhtml element or the end of the elment
has been reached without a match.</p>

<p>Prepending &lt; to a tag enables a depth-first search for that tag so you
can avoid specifying the parent path (similar to // in xpath)</p>

<p>Any matching template that consists of a variable reference results in
a binding set being created and returned if all elements of the
template node successfully match.</p>

<p>Additional reserved operators allow you more flexibility on 
managing what is matched and how bindings from subtrees are combined</p>

<p>all: match same template multiple times over the children of a given node
     and store them as a list attached to a fresh bindings list</p>

<p>merge: create a single binding out of each of the sub-bindings.  A 
     node body has an implicit merge</p>

<p>nth: find the nth instance that matches the full body of this operator</p>

<p>regex: matches if regex returns register values for a string (as a list)</p>

<p>fn: Run the referenced function symbol on the current parse state and
    return bindings, t or nil as appropriate.</p>

<h2>Example</h2>

<p>I&apos;ve recently been mining some posts from vBulletin sites. I go to the
last day&apos;s posts, get a list of all the new posts, then go to the
thread and grab the post body. The following two templates do 90% of
the work. Of course, I have to write code to convert the data I
extract to web page fetches, etc.</p>

<pre><code>(defparameter *vbulletin-search-template*
  '(&lt;tbody nil
     (all ?records 
       (tr nil
         (td nil)
       (td ((class &quot;alt1&quot;))
         (div nil
           (a ((href ?thread-uri))
             ?thread-name)))
         (td ((class &quot;alt2&quot;) (title ?activity))
           (div nil ?post-date
         (span nil ?post-time)
         (a ((href ?user-uri))
           ?username)
         (a ((href ?last-post-uri)))))))))
    =&gt;
    '(:records ((:thread-name . &quot;Thread name&quot;)
                (:thread-uri  . &quot;Thread URI&quot;)
                (:post-date   . &quot;Date String&quot;)
        (:post-time   . &quot;Time String&quot;)
                (:username    . &quot;Username&quot;)
                (:user-uri    . &quot;URI String&quot;)
                (:last-post-uri . &quot;URI String&quot;))
               ...)
</code></pre>

<p>This looks for a table body in the search results page, then gets
bindings for all matching <tr> elements and puts them within another
bindings object bound to :records as specified by &lsquo;all&rsquo;. The pattern
pulls out all the user, thread, post and date information for all
results. You can match elements on strings, regular expressions and
arbitrary function calls as well.</p>

<p>I use subst to customize the following pattern to find a particular
post in a page. It replaces &lsquo;post_message_?&rsquo; with a unique id for a
post then returns its thread number and the entire post body.</p>

<pre><code>  (defparameter *vbulletin-post-template*
    `(&lt;tbody nil 
   (tr nil (&lt;a ((name ?post-num))))
   (tr nil)
   (tr nil (?post-body &lt;div ((id &quot;post_message_?&quot;))))))
</code></pre>

<p>I use Firefox FireBug to inspect the HTML tree, identify the best
unique enclosing context I can specify and then provide enough
structure to uniquely capture the data I want. This approach is highly
robust to many small HTML changes and should be reasonably fast.</p>

  
</div>