Content.pm

Code Index:



NAME

EulerMB::Content - Parses, validates, and formats (as HTML) EulerMB Text


SYNOPSIS

 use EulerMB::Content;
 my $content = new EulerMB::Content(
     texpath        => '/cgi-bin/texserver',
     eulermbpath    => '/emb',
     features       => 'ALL',
     meid           => undef,
     transaction_id => undef,
     javapics       => undef,
     thread_link    => sub { ... }
 );
 my $html = $content->embtext_to_html(\$embtext, validate => 1);
 my $warnings = $content->warnings();

Example content:

 a_{n}[sqrt]2 + e^{x+1} = y.
 \object{graph.png} [myjavaimage]
  4     6
 --- + ---
  5     7
 x < y > z.
 <b>text</b><span style="color:blue">t&eacute;st</span>
 \begin{rawtext}
   How to input: <b>text</b><span style="color:blue">t&eacute;st</span>
 \end{rawtext}
 \begin{rawhtml}
   <table><tr>
   <td>123</td><td>234</td>
   </tr></table>
 \end{rawhtml}
 \( \int_2^3 \frac{x}{x+1} dx \)
 \( \sqrt{2} \) \[ \sqrt{2} \] \begin{math} \sqrt{2} \end{math}


Description

This module parses, validates, and formats (as HTML) EulerMB Text (Embtex). Embtext is a text language for composing messages within the EulerMB message board software (http://math2.org/eulermb/), which the Math Message Board (http://math2.org/mmb/) uses.

Embtext is a mix of HTML, LaTeX, plain text, and other formatting. It is intended to be written by people (rather than computers). In particular it has support for math notation and graphics as it was originally developed for a math message board. This module supports environments where the Embtext input is untrusted (e.g. message boards). The HTML and LaTeX notation is validated and sanitized before display.

With a few exceptions, LaTeX support is limited to math notation and is primarily used for inserting professionally formatted math notation into documents. The LaTeX math support requires the texserver, which is an external component (not currently publicly available), which in turn requires LaTeX. Some additional non-math primitives are available using a LaTeX-like notation (e.g. \thread{123}), but these are processed by this module itself without calling into LaTeX.

A few additional notations are supported for convenience. Block tags, e.g. [sqrt], can be used instead of LaTeX to insert simple math symbols. Other block tags display certain images inline drawing using the JavaDraw Java applet drawing tool (not currently publically available). A super- and subscript notation, e^{x^{2}+1}+a_{b}, is converted to HTML SUP/SUB tags.

This module can be used in a variety of applications, such as for message boards, as an HTML compiler, or as a dynamic page filter on a web server. The module is intended to be reasonably fast (for Perl), mature and robust (with an extensive test suite), and well designed. In the design, the lexing, validating, and rendering are implemented as a pipeline of separate coroutines, which allows selected combinations of these steps to be used or your own processes injected in the middle. For example, to display a message board message as HTML, the validation step may be omitted if message has already been validated and accepted.

Design criteria

(1) Robustness Principle: ``Be conservative in what you do, be liberal in what you accept from others.'' http://www.ietf.org/rfc/rfc793.txt.

This mandates that while we may accept Embtext input that is not fully valid HTML (e.g. <u>></u>), the HTML output generated by this module must be valid HTML (e.g. <u>&gt;</u>).

However, the meaning of input not always initially clear (e.g. ``x<y and y>z''). Here, the author's intention is that this math expression be recognized as text rather than as containing the HTML tag <y and y>, where the attribute values are missing. Therefore, this module recognizes strings as HTML only when the interpretation is clear (e.g. the attribute values are present and quoted). So, this example is escaped as ``x&lty and y&gt;z''.

(2) Secure on untrusted input

This module shall accept untrusted input (e.g. from arbitrary users on the web) and securely convert it to HTML that can be securely displayed on other users' web browsers. Some features of HTML (e.g. JavaScript) are prevented or filtered out of the user input.

(3) Reasonably terse syntax, easy to read in source code format.

Embtext is designed to be created by people rather than computers. It may be authored in an text-only environment, such as HTML TEXTAREA form element, not a realtime graphical ``what-you-see-it-what-you-get'' (WYSIWYG) one. Therefore, the syntax must be clear and terse. We do not encourage verbose syntax such as XML and especially MathML. Rather, we prefer LaTeX-like notations, simple HTML, or notations similar to that used by Wikipedia (http://www.wikipedia.org) or Epytext (http://epydoc.sourceforge.net/epytext.html).

(4) Represent structure not style.

This is a goal, at least in part, of HTML4, LaTeX, and others (e.g. DocBook, http://www.docbook.org). The idea is that the text represents structure, and the specific style (e.g. font, colors, positioning) can be easily adjusted by someone other than the author.

(5) Sufficiently flexible

Embtext should be able to extend to support new features. Currently, new commands can be added using LaTeX-like notation (e.g. \thread{123}).

Implemenation Design Criteria

(1) The lexing, parsing, and HTML rendering steps are independent.

The conversion from Embtext to HTML is done in a pipeline implemented with semi-coroutines provided by the the EulerMB::Coroutine manpage module. Selected combinations of these steps can be used or omitted from the pipeline, or your own steps can be added at any position. For example, when generating HTML, the validation step may be omitted if the content was already validated.

(2) Conversion is O(1) in memory usage.

The module lexes, parses, and translated to HTML token-by-token. This is a relatively minor goal, but the module achieves it without much added complexity by using semi-coroutines. The issues here are well described in ``Coroutines in C'' (http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html).

Grammar Constructs

The following constructs are supported in the language.

Block-style tags for symbols or JavaDraw images. e.g. [sqrt].

Superscripts and subscripts. e.g. e^{x^{2}+1} + log_{b} x.

Embedded images uploaded. e.g. \object{myfunction.png}.

Limited set of HTML. e.g. <b>This is bold.</b> <span style=``color:blue''>test</a>.

Special LaTeX-like tags. \thread{123} - hyperlink to EulerMB discussion thread with ID 123

ASCII text graphics. ***** ****** * ** *****

Raw HTML (to prevent extra whitespace being inserted as above): \begin{rawhtml} <table> <tr><td>123</td><td>234</td></tr> </table> \end{rawhtml}

Raw text (to prevent characters being rendered as HTML): \begin{rawtext} How to write HTML: <html><body>...</body></html> \end{rawtext}

LaTeX math (when the texserver is installed). \[ \frac{x+1}{x-\sqrt{y}} + \int_a^b f(x) \, dx. \] \( ... \) \begin{math}...\end{math}

Grammar

The Embtext grammar is roughly as follows:

 embtext := 
   latex_environment | latex_command | block_tag |
   html_tag | html_entity | text | script
 latex_environment := begin_environment latex_content end_environment
 latex_content := math_content | rawtext_content | rawhtml_content |
   other_content
 rawhtml_content := (html_tag | html_entity | text)*
 script := script_begin embtext script_end
 begin_environment := E
 end_environment := F
 latex_command := C
 math_content := I
 rawtext_content := T
 other_content := I
 html_tag := H
 html_entity := N
 script_begin := R
 script_end := S
 block_tag := B
 text := T

Functions and Methods

new
Constructs a new parser/validator/formatter object.
 my $content = new EulerMB::Content(%params);

where %params is key-value pairs:

  texpath        default  '/cgi-bin/texserver'
  eulermbpath    default  '/emb'
  features       default  'ALL'
  meid           default  undef
  transaction_id default  undef
  javapics       default  undef
  thread_link    default  undef (coderef)

texpath is an absolute URL to the texserver. The generated HTML may contain links to the texserver in order to display LaTeX math. This parameter is only needed if HTMLization is performed and LaTeX math is not disabled.

``eulermbpath'' is an absolute URL to the EulerMB static files. These include images that get embedded into the generated HTML. This parameter is only needed if HTMLization is performed and certain features that would result in such links (e.g. [sqrt] tags) are not disabled.

``features'' identifies which types of tokens shall be recognized in the lexing, validation, and rendering. See the features methods for details.

``meid'' is the message identifier. This parameter is only needed if HTMLization is performed and certain features (e.g. file attachments) are not disabled. The generated HTML may contain links that reference this identifier.

``transaction_id'' is the transaction identifier. This parameter is only needed if HTMLization is performed and certain features (e.g. file attachments) are not disabled. This parameter is complimentary to meid and is typically used in previewing a message in the composition stage (when neither the message nor its message identifier yet exist).

``javapics'' is any JavaDraw data associated with the current message. This parameter is only needed if HTMLization is performed and the JavaDraw feaure is not disabled. This parameter is used to properly generate the image tags in the HTML (e.g. width and height parameters).

``thread_link'' is code reference that will handle \thread{...} tags. This parameter is only needed if HTMLization if performed and \thread{...} tags are not disabled. If absent, default rendering is applied. Often, default rendering is not sufficient because additional information is needed (e.g. the title of the thread), and such information may require querying a database.

clone
Clones the object. Additional parameters (%params) accepted by ``new'' may optionally be passed as well and those will override any parameters in the original object.

features
Gets of sets the features enabled. The parameter or return value is a hash of key-value pairs. Optionally, you may pass a comma-delimited string of features or ``ALL'', which represents all features.
 $self->features(H => 1, R => 1);
 $self->features('H,E,ET,EH,EO,C,R,B');
 $self->features('ALL');
 my %features = $self->features();

Possible features:

 H - enable HTML tags
 HE - enable HTML entity references
 E - enable LaTeX-like environments
  ET - enable rawtext environments
  EH - enable rawhtml environments
  EO - enable math environments
 C - enable LaTeX-like commands (e.g. \thread{123})
 R - enable super- and subscripts notation ^{...} and _{...}.
 B - enable block tag (e.g. [sqrt] or [myimage])

gen_lex_embtext
Generate a coroutine to lex Embtext into tokens.
 my $source = $self->lex_embtext(\$embtext);
 while(my @token = $source->()) {
   ...
 }

$embtext is a string.

The format of each returned token varies by type, but generally it is of the form

 ($token_type, $raw_text, @additional_data)

$token_type is

 H - HTML tag
 N - HTML entity reference
 E - LaTeX environment begin tag
 F - LaTeX environment end tag
 I - LaTeX inside environment
 C - LaTeX simple tag
 R - Super/subscript begin
 S - Super/subscript end
 B - Block tag
 T - text or raw text
 X - error (exception)

$raw_text is the raw input text.

@additional data varies by token type. For example,

 'X' - ($error_message, $error_id, @more_data)

gen_validate_embtext
Generate a coroutine to validate a token stream.
 my $source = $self->gen_lex_embtex(\$embtext);
 $source2 = $self->gen_validate_embtex($source);
 while(my @token = $source2->()) {
   ...
 }

$source is either a token iterator or a list of tokens.

This coroutine performs higher-level validation (e.g. matching HTML start end-tags) than the validation performed by the lexer. This coroutine may insert ``X'' type tokens into the input stream as well as add or remove tokens to resolve formatting errors (e.g. add missing HTML end tags).

c<gen_token_iterator>
Returns a closure that iterators over the given list (of tokens).
 my $source = $content->gen_token_iterator(\@tokens);
 while(my $token = $source->()) { ... }

$source is either a token iterator or a list of tokens.

validate_embtext
Lex and validate Embtex. This is a wrapper around gen_lex_embtext and gen_validate_embtex.
 my @tokens = $self->validate_embtext($source);
 while(my @token = $source->()) {
   ...
 }

$source is a string reference, a token iterator, or a list of tokens.

lex_embtext
Lex Embtex into tokens. This is a wrapper around gen_lex_embtex.
 my @tokens = $self->lex_embtext(\$embtext);

textize
Generates a string representation of token source. This is essentially the inverse operation of embtext_lex.
 my $text = $content->textize($source);
 my $warnings = $content->warnings();

$source is either a token iterator or a list of tokens.OA

htmlize
Generates and string of HTML from a token source.
 my $html = $content->htmlize($source);
 my $warnings = $content->warnings();

$source is either a token iterator or a list of tokens.

embtext_to_html
Convert Embtext to HTML. This is a wrapper around the htmlizer, gen_lex_embtex, and gen_validate_embtex.
 my $html = $content->embtext_to_html(\$embtext, validate => $validate);
 my $warnings = $content->warnings();

where $embtext is a string of Embtext. $validate is a boolean indicating whether validation should be performed on the input. (optional--default 1) Any warnings triggered by the conversion can be retrieved with the warnings method.

text_to_html
Convert text to HTML by escaping characters and preserving appropriate whitespace. This is similar to CGI::escapeHTML except that it preserves whitespace.
 $html = EulerMB::Content::text_to_html($text);

match_latex
Recognize LaTeX environment at current position in string.
 my $info = $content->match_latex(\$text);

$text is a string to match on. matching is performed at pos($text). On success, $info is [$name, $pos_inner_start, $pos_inner_end], where $name is the environment name, $pos_inner_start and $pos_inner_end are 0-based character indicies indicated here before the ``*'' character here: ``\begin{math}*...\*end{math}''. pos($text) then points to the character immediately following the match. On error, return undef and leave pos($text) unchanged.

Generate HTML for LaTeX math image. =item make_latex_image

Generate HTML link for LaTeX image.

 my $html = $content->make_latex_image($tex);

$tex is string containin math LaTeX. $html is returned HTML.

make_symbol
Generate HTML for block symbols (e.g. [sqrt]).
 my $html = $content->make_symbol($name);

$name is the symbol identifier string. $html is the return HTML string. Returns undef on symbol not found.

make_embtex
Generate HTML for LaTeX-like command processed by Embtext.
 my $html = $content->make_embtext($name, $param);

$name is string containing command name. $param is string containing parameter $html is string containing returned HTML.

nbsp
Generate HTML containing proper whitespacing for plain text. Replace a sequence of N > 1 spaces not at the start of a line with one space and (N-1) &nbsp;s. Replace a sequence of N > 0 spaced at the start of a line with N &nbsp;s. This makes spacing and ASCII art render correctly in web browsers.
 $html = EulerMB::Content::nbsp($text);

make_javapicture
Generate HTML for inline JavaDraw picture.
 my $html = $content->make_javapicture($name);

$name is the identifier for the picture. Data for the picture is given in the javapics property of the content object. $html is the returned HTML as a string. Returns undef if picture not found or JavaDraw disabled.

make_attachment
Generate HTML for inline uploaded image. \object{...}
 my $html = $content->make_attachment($name);

$name is the identifier for the attachment. $html is the returned HTML.

match_htmltag
Parses HTML tag to tag type, name, and attribute key/values.
  my $info = EulerMB::Content::match_htmltag($tag);
 $tag is the HTML as a string.
 return
   ['S', $tagname, \%attr] for each start tag <tag>
   ['SE', $tagname, \%attr] for each start/end tag <tag/>
   ['E', $tagname] for each end tag </tag>
   $error on parsing error (e.g. duplicate attribute name)
   undef on no match   
 where
   $tagname - string - tag name
   %attr - hash of key-value attribute pairs
   $error - string - error

warnings
Retrieves reference to list of warning strings. Warnings can be generated by the embtext_to_html, htmlize, and textize methods.
 my $warnings = $self->warnings();

HTML Processing notes

Lexer Validation

Special characters < and > are not permitted within HTML tag attribute values. The presence of these does not cause an error but rather causes the string to be recognized as text. (possible change: returning an error).

HTML attribute values must be present on attributes and must be quoted, else the string will be recognized as text. HTML: <a href=``http://www.google.com''>. Text: <a href=http://www.google.com>. Text: <a href>. The HTML specification discourages unquoted attributes, and it is not conformant with XHTML. A decision to permit unquoted attributes could cause input such as ``a<b and x=2 and z>b'' to be recognized as HTML rather than text. The ambiguity would not occur in the rawhtml environment though.

Parser Validation

HTML start tags must have a matching end tag if and only if an end tag is required. Good: <br>. Bad: <br></br>. Good: <br />. Bad: <p>. Good: <p></p>. Good: <p/>. Missing end tags are added and flagged. Extraneous end tags are removed and flagged.

HTML start and end tags can overlap. Good: <b><i></i></b>. OK: <b><i></b></i> (IMPROVE: subject to change--the latter should be made an error).

HTML element and attribute names may be upper- or lower-case. Lower-case is preferred. Good: <a href=``...''>. OK: <A HREF=``...''>. XHTML conformance requires lower-case. (Possible change: allow upper case as input but internally convert to lower-case either in the validation or HTML rendering stage.)

Unrecognized HTML elements and attributes are not permitted. These tags are removed and flagged as errors. BAD: <asdf sdfg=``dfgh''>.

HTML forms, JavaScript, form elements, and embedded objects (e.g. applet, object tags) are not permitted. Such tags are removed and flagged as errors. Additional checks are made such as ``javascript:'' not being permitted in the href attribute of an ``a'' tag.

Various HTML tags and attributes that are possibly insecure are not permitted. These are removed and flagged as errors. Bad: <img>, <script>, <frame>, <form>, <p onclick=``...''>, ....

Numeric entity references, url(), and expression() are not permitted in CSS.

Various representations of entity references for ASCII '\0' are nor permitted. These are removed and flagged.

HTML rendering

The generated HTML is well-formed regardless of the Embtext input, at least when validation is enabled. This is an important design criteria.

Special characters <, >, and & not recognized as part of a valid HTML tag are escaped.

Errors (type ``X'') are rendered in-line, where they occur. The error message is HTML escaped.

TO DO

Possibly incorporate spell checker into the processing pipeline (validation step?) instead of having it be independent as it is now.

Possibly support new notations:

- standard notation for quotes referring to text in other messages.

- lists

- links to popular web sites (e.g. Mathworld or Wikipedia). This is partly an extension to the \thread{...} notation.

- Simplified math notation. Translating maybe \m{(x+1)/(x-1)+y/sqrt(z)} to \(\frac{x+1}{x-1}+\frac{y}{\sqrt{x}}\) and having it rendered by LaTeX as an image.

- standard notation for the inclusion of source code, possibly syntax highlighted (e.g. GNU enscript, http://people.ssh.fi/mtr/genscript/).

Cookbook

To validate Embtext, and alter the token stream right before the validation step:

 my $tokens = $content->lex_embtext(\$text);
 for my $token (@$tokens) {
     if($token->[0] eq 'H') { # HTML
         my $original = $token->[1];
         $token->[1] = &normalize_html_tag($original);
     }
 }
 $tokens = $content->gen_validate_embtext($tokens);
 my $text = $content->textize($tokens);
 my $warnings = $content->warnings();

Such a method was used in a database cleanup routine.

Dependencies

the EulerMB::Coroutine manpage - for semi-coroutine language construct

URI::Escape, CGI - for HTML generation, proper escaping

Params::Validate - to validate parameters to methods

References

(1) Jeffrey E. F. Freidl, Mastering Regular Expressions, 2nd ed, O'Reilly & Associates, 2002.


COPYRIGHT

(c) 1998-2004, David Manura. http://math2.org/david/contact. This module is licensed under the same terms as Perl itself.


package EulerMB::Content;
use EulerMB::Coroutine qw(:all);
use strict;
use URI::Escape;
use CGI;
use Params::Validate qw(:all);
#use Data::Dumper; # DEBUG
# note: for portability/robustness, use EulerMB::Coroutine not Coro
#use Coro::Cont; # REMOVED

our $VERSION = '2.00';

sub _trace(@);

# Mapping: symbol names in block tags --> image file names.
# This is used to render symbols (e.g. "[sqrt]") as images.
# Case of first character matters.
my %symbols = (
    'd_dx'    => 'd-dx.gif', # FIX: was "d/dx". "/" not recognized
    'inf'     => 'inf.gif',
    'integral'=> 'integral.gif',
    'pder'    => 'pder.gif',
    'product' => 'product.gif',
    'sqrt'    => 'sqrt.gif',
    'sum'     => 'sum.gif',
  
    'alpha'   => 'alpha-l.gif',
    'Alpha'   => 'alpha-u.gif',
    'beta'    => 'beta-l.gif',
    'Beta'    => 'beta-u.gif',
    'gamma'   => 'gamma-l.gif',
    'Gamma'   => 'gamma-u.gif',
    'delta'   => 'delta-l.gif',
    'Delta'   => 'delta-u.gif',
    'epsilon' => 'epsilon-l.gif',
    'Epsilon' => 'epsilon-u.gif',
    'zeta'    => 'zeta-l.gif',
    'Zeta'    => 'zeta-u.gif',
    'eta'     => 'eta-l.gif',
    'Eta'     => 'eta-u.gif',
    'theta'   => 'theta-l.gif',
    'Theta'   => 'theta-u.gif',
    'iota'    => 'iota-l.gif',
    'Iota'    => 'iota-u.gif',
    'kappa'   => 'kappa-l.gif',
    'Kappa'   => 'kappa-u.gif',
    'lambda'  => 'lambda-l.gif',
    'Lambda'  => 'lambda-u.gif',
    'mu'      => 'mu-l.gif',
    'Mu'      => 'mu-u.gif',
    'nu'      => 'nu-l.gif',
    'Nu'      => 'nu-u.gif',
    'xi'      => 'xi-l.gif',
    'Xi'      => 'xi-u.gif',
    'omicron' => 'omicron-l.gif',
    'Omicron' => 'omicron-u.gif',
    'pi'      => 'pi-l.gif',
    'Pi'      => 'pi-u.gif',
    'rho'     => 'rho-l.gif',
    'Rho'     => 'rho-u.gif',
    'sigma'   => 'sigma-l.gif',
    'Sigma'   => 'sigma-u.gif',
    'tau'     => 'tau-l.gif',
    'Tau'     => 'tau-u.gif',
    'upsilon' => 'upsilon-l.gif',
    'Upsilon' => 'upsilon-u.gif',
    'phi'     => 'phi-l.gif',
    'Phi'     => 'phi-u.gif',
    'chi'     => 'chi-l.gif',
    'Chi'     => 'chi-u.gif',
    'psi'     => 'psi-l.gif',
    'Psi'     => 'psi-u.gif',
    'omega'   => 'omega-l.gif',
    'Omega'   => 'omega-u.gif'
);

# HTML element information used for validation.
# From http://www.w3.org/TR/1999/REC-html401-19991224/index/elements.html
# Format: name => [allow, has_end_tag].
#   allow - whether the tag is allowed
#   has_end_tag - whether the tag can contain content and has matching
#     start and end tags (e.g. "<p>...</p>" v.s. "<br>".
# IMPROVE: should depreciated tags be allowed?
#   (e.g. HTML strict v.s. transitional)
# D = depreciated
my $element_infos = {
    'a'          => [1, 1],
    'abbr'       => [1, 1],
    'acronym'    => [1, 1],
    'address'    => [1, 1],
    'applet'     => [0, 1], # D
    'area'       => [0, 0],
    'b'          => [1, 1],
    'base'       => [0, 0],
    'basefont'   => [0, 0], # D
    'bdo'        => [0, 1],
    'big'        => [1, 1],
    'blockquote' => [1, 1],
    'body'       => [0, 1],
    'br'         => [1, 0],
    'button'     => [0, 1],
    'caption'    => [1, 1],
    'center'     => [1, 1], # D
    'cite'       => [1, 1],
    'code'       => [1, 1],
    'col'        => [1, 0],
    'colgroup'   => [1, 1],
    'dd'         => [1, 1],
    'del'        => [1, 1],
    'dfn'        => [1, 1],
    'dir'        => [1, 1], # D
    'div'        => [1, 1],
    'dl'         => [1, 1],
    'dt'         => [1, 1],
    'em'         => [1, 1],
    'fieldset'   => [0, 1],
    'font'       => [1, 1], # D
    'form'       => [0, 1],
    'frame'      => [0, 0],
    'frameset'   => [0, 1],
    'h1'         => [1, 1],
    'h2'         => [1, 1],
    'h3'         => [1, 1],
    'h4'         => [1, 1],
    'h5'         => [1, 1],
    'h6'         => [1, 1],
    'head'       => [0, 1],
    'hr'         => [1, 0],
    'html'       => [0, 1],
    'i'          => [1, 1],
    'iframe'     => [0, 1],
    'img'        => [0, 0],
    'input'      => [0, 0],
    'ins'        => [1, 1],
    'isindex'    => [0, 0], # D
    'kbd'        => [0, 1],
    'label'      => [0, 1],
    'legend'     => [0, 1],
    'li'         => [1, 1],
    'link'       => [0, 0],
    'map'        => [0, 1],
    'menu'       => [1, 1], # D
    'meta'       => [0, 0],
    'noframes'   => [0, 1],
    'noscript'   => [0, 1],
    'object'     => [0, 1],
    'ol'         => [1, 1],
    'optgroup'   => [0, 1],
    'option'     => [0, 1],
    'p'          => [1, 1],
    'param'      => [0, 0],
    'pre'        => [1, 1],
    'q'          => [1, 1],
    's'          => [1, 1], # S
    'samp'       => [1, 1],
    'script'     => [0, 1],
    'select'     => [0, 1],
    'small'      => [1, 1],
    'span'       => [1, 1],
    'strike'     => [1, 1], # S
    'strong'     => [1, 1],
    'style'      => [0, 1],
    'sub'        => [1, 1],
    'sup'        => [1, 1],
    'table'      => [1, 1],
    'tbody'      => [1, 1],
    'td'         => [1, 1],
    'textarea'   => [0, 1],
    'tfoot'      => [1, 1],
    'th'         => [1, 1],
    'thead'      => [1, 1],
    'title'      => [0, 1],
    'tr'         => [1, 1],
    'tt'         => [1, 1],
    'u'          => [1, 1],
    'ul'         => [1, 1],
    'var'        => [1, 1]
};

# HTML attribute info used for validation.
# From http://www.w3.org/TR/1999/REC-html401-19991224/index/attributes.html
# Format: tag_name => [allow, modifier, atts]
#    allow - whether attribute is allowed at all
#    modifier -
#      'a' - accept list follows
#      'd' - deny list follows
#    atts - accepted or denied elements associated with this attribute.
#      This is a listref transformed into a hashref at runtime.
# Note: a number of these are depreciated either globally or on
#   certain elements.
my $att_infos =
{
  'abbr'           => [1, 'a', ['td', 'th']],
  'accept-charset' => [0, 'a', ['form']],
  'accept'         => [0, 'a', ['form', 'input']],
  'accesskey'      => [0, 'a', ['a', 'area', 'button', 'input',
                                'label', 'legend', 'textarea']],
  'action'         => [0, 'a', ['form']],
  'align'          => [1, 'a', ['caption', 'applet', 'iframe',
                                'img', 'input', 'object', 'legend',
                                'table', 'hr', 'div',
                                'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
                                'p', 'col', 'colgroup', 'tbody', 'td',
                                'tfoot', 'th', 'thead', 'tr']],
  'alink'          => [0, 'a', ['body']],
  'alt'            => [0, 'a', ['applet', 'area', 'img', 'input']],
  'archive'        => [0, 'a', ['applet', 'object']],
  'axis'           => [1, 'a', ['td', 'th']],
  'background'     => [0, 'a', ['body']],
  'bgcolor'        => [1, 'a', ['table', 'tr', 'td', 'th', 'body']],
  'border'         => [1, 'a', ['table', 'img', 'object']],
  'cellpadding'    => [1, 'a', ['table']],
  'cellspacing'    => [1, 'a', [ 'table']],
  'char'           => [1, 'a', ['col', 'colgroup', 'tbody', 'td', 'tfoot',
                                'th', 'thead', 'tr']],
  'charoff'        => [1, 'a', ['col', 'colgroup', 'tbody', 'td',
                                'tfoot', 'th', 'thead', 'tr']
  ],
  'charset'        => [0, 'a', ['a', 'link', 'script']],
  'checked'        => [0, 'a', ['input']],
  'cite'           => [1, 'a', ['blockquote', 'q', 'del', 'ins']],
  'class'          => [0, 'd', ['base', 'basefont', 'head', 'html',
                                'meta', 'param', 'script', 'style', 'title']],
  'classid'        => [0, 'a', ['object']],
  'clear'          => [1, 'a', ['br']],
  'code'           => [0, 'a', ['applet']],
  'codebase'       => [0, 'a', ['object', 'applet']],
  'codetype'       => [0, 'a', ['object']],
  'color'          => [1, 'a', ['basefont', 'font']],
  'cols'           => [0, 'a', ['frameset', 'cols']],
  'colspan'        => [1, 'a', ['td', 'th']],
  'compact'        => [1, 'a', ['dir', 'dl', 'menu', 'ol', 'ul']],
  'content'        => [0, 'a', ['meta']],
  'coords'         => [0, 'a', ['area', 'a']],
  'data'           => [0, 'a', ['object']],
  'datetime'       => [1, 'a', ['del', 'ins']],
  'declare'        => [0, 'a', ['object']],
  'defer'          => [0, 'a', ['script']],
  'dir'            => [0, 'd', ['applet', 'base', 'basefont', 'bdo',
                                'br', 'frame', 'frameset', 'iframe',
                                'param', 'script']],
  'disabled'       => [0, 'a', ['button', 'input', 'optgroup',
                                'option', 'select', 'textarea']],
  'enctype'        => [0, 'a', ['form']],
  'face'           => [1, 'a', ['basefont', 'font']],
  'for'            => [0, 'a', ['label']],
  'frame'          => [1, 'a', ['table']],
  'frameborder'    => [1, 'a', ['frame', 'iframe']],
  'headers'        => [1, 'a', ['td', 'th']],
  'height'         => [1, 'a', ['iframe', 'td', 'th', 'img', 'object',
                                'applet']],
  'href'           => [1, 'a', ['a', 'area', 'link', 'base']],
  'hreflang'       => [0, 'a', ['a', 'link']],
  'hspace'         => [0, 'a', ['applet', 'img', 'object']],
  'http-equiv'     => [0, 'a', ['meta']],
  'id'             => [0, 'd', ['base', 'head', 'html', 'meta', 'script',
                                'style', 'title']],
  'ismap'          => [0, 'a', ['img', 'input']],
  'label'          => [0, 'a', ['option', 'optgroup']],
  'lang'           => [0, 'd', ['applet', 'base', 'basefont', 'br', 'frame',
                                'frameset', 'iframe', 'param', 'script']],
  'language'       => [0, 'a', ['script']],
  'link'           => [0, 'a', ['body']],
  'longdesc'       => [0, 'a', ['img', 'frame', 'iframe']],
  'marginheight'   => [0, 'a', ['frame', 'iframe']],
  'marginwidth'    => [0, 'a', ['frame', 'iframe']],
  'maxlength'      => [0, 'a', ['input']],
  'media'          => [0, 'a', ['style', 'link']],
  'method'         => [0, 'a', ['form']],
  'multiple'       => [0, 'a', ['select']],
  'name'           => [0, 'a', ['button', 'textarea', 'applet', 'select',
                                'form', 'frame', 'iframe', 'img', 'a',
                                'input', 'object', 'map', 'param', 'meta']],
  'nohref'         => [0, 'a', ['area']],
  'noresize'       => [0, 'a', ['frame']],
  'noshade'        => [1, 'a', ['hr']],
  'nowrap'         => [1, 'a', ['td', 'th']],
  'object'         => [0, 'a', ['applet']],
  'onblur'         => [0, 'a', ['a', 'area', 'button', 'input', 'label',
                                'select', 'textarea']],
  'onchange'       => [0, 'a', ['input', 'select', 'textarea']],
  'onclick'        => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br',
                                'font', 'frame', 'frameset', 'head', 'html',
                                'iframe', 'isindex', 'meta', 'param',
                                'script', 'style', 'title']],
  'ondblclick'     => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br',
                                'font', 'frame', 'frameset', 'head',
                                'html', 'iframe', 'isindex', 'meta',
                                'param', 'script', 'style', 'title']],
  'onfocus'        => [0, 'a', ['a', 'area', 'button', 'input', 'label',
                                'select', 'textarea']],
  'onkeydown'      => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br',
                                'font', 'frame', 'frameset', 'head', 'html',
                                'iframe', 'isindex', 'meta', 'param',
                                'script', 'style', 'title']],
  'onkeypress'     => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br',
                                'font', 'frame', 'frameset', 'head', 'html',
                                'iframe', 'isindex', 'meta',
                                'param', 'script', 'style', 'title']],
  'onkeyup'        => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br',
                                'font', 'frame', 'frameset', 'head', 'html',
                                'iframe', 'isindex', 'meta', 'param',
                                'script', 'style', 'title']],
  'onload'         => [0, 'a', ['frameset', 'body']],
  'onmousedown'    => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br',
                                'font', 'frame', 'frameset', 'head', 'html',
                                'iframe', 'isindex', 'meta',
                                'param', 'script', 'style', 'title']],
  'onmousemove'    => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br',
                                'font', 'frame', 'frameset', 'head', 'html',
                                'iframe', 'isindex', 'meta',
                                'param', 'script', 'style', 'title']],
  'onmouseout'     => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br',
                                'font', 'frame', 'frameset', 'head', 'html',
                                'iframe', 'isindex', 'meta', 'param',
                                'script', 'style', 'title']],
  'onmouseover'    => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br',
                                'font', 'frame', 'frameset', 'head', 'html',
                                'iframe', 'isindex', 'meta', 'param',
                                'script', 'style', 'title']],
  'onmouseup'      => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br',
                                'font', 'frame', 'frameset', 'head', 'html',
                                'iframe', 'isindex', 'meta', 'param',
                                'script', 'style', 'title']],
  'onreset'        => [0, 'a', ['form']],
  'onselect'       => [0, 'a', ['input', 'textarea']],
  'onsubmit'       => [0, 'a', ['form']],
  'onunload'       => [0, 'a', ['frameset']],
  'onunload'       => [0, 'a', ['body']],
  'profile'        => [0, 'a', ['head']],
  'prompt'         => [0, 'a', ['isindex']],
  'readonly'       => [0, 'a', ['textarea', 'input']],
  'rel'            => [0, 'a', ['a', 'link']],
  'rev'            => [0, 'a', ['a', 'link']],
  'rows'           => [0, 'a', ['frameset', 'textarea']],
  'rowspan'        => [1, 'a', ['td', 'th']],
  'rules'          => [1, 'a', ['table']],
  'scheme'         => [0, 'a', ['meta']],
  'scope'          => [1, 'a', ['td', 'th']],
  'scrolling'      => [0, 'a', ['frame', 'iframe']],
  'selected'       => [0, 'a', ['option']],
  'shape'          => [0, 'a', ['area', 'a']],
  'size'           => [1, 'a', ['hr', 'font', 'input', 'basefont', 'select']],
  'span'           => [1, 'a', ['col', 'colgroup']],
  'src'            => [1, 'a', ['script', 'input', 'frame', 'iframe', 'img']],
  'standby'        => [0, 'a', ['object']],
  'start'          => [1, 'a', ['ol']],
  'style'          => [1, 'd', ['base', 'basefont', 'head', 'html', 'meta',
                                'param', 'script', 'style', 'title']],
  'summary'        => [1, 'a', ['table']],
  'tabindex'       => [0, 'a', ['a', 'area', 'button', 'input', 'object',
                                'select', 'textarea']],
  'target'         => [1, 'a', ['a', 'area', 'base', 'form', 'link']],
  'text'           => [0, 'a', ['body']],
  'title'          => [1, 'd', ['base', 'basefont', 'head', 'html', 'meta',
                                'param', 'script', 'title']],
  'type'           => [1, 'a', ['a', 'link', 'object', 'param', 'script',
                                'style', 'input', 'li', 'ol', 'ul',
                                'button']],
  'usemap'         => [0, 'a', ['img', 'input', 'object']],
  'valign'         => [1, 'a', ['col', 'colgroup', 'tbody', 'td', 'tfoot',
                                'th', 'thead', 'tr']],
  'value'          => [1, 'a', ['input', 'option', 'param', 'button', 'li']],
  'valuetype'      => [0, 'a', ['param']],
  'version'        => [0, 'a', ['html']],
  'vlink'          => [0, 'a', ['body']],
  'vspace'         => [1, 'a', ['applet', 'img', 'object']],
  'width'          => [1, 'a', ['hr', 'iframe', 'img', 'object', 'table',
                                'td', 'th', 'applet', 'col', 'colgroup',
                                'pre']]
};
$_->[2] = { map {($_=>1)} @{$_->[2]} } for values %$att_infos;

# match HTML attribute
my $re_htmlatt = qr(
        (?: ([a-zA-Z]+) \s* = \s*
                (?: \' ([^\'<>\r\n\f]*) \' |
                        \" ([^\"<>\r\n\f]*) \"
                )
        )
)xs;
# match HTML tag
my $re_htmltag = qr{
        <   [a-zA-Z]+
                (?: \s $re_htmlatt?)*
                /? >
        |
        < / [a-zA-Z]+ \s* >
}xs;
# match HTML entity reference
my $re_htmlentity = qr{
        \& (?: ( [a-zA-Z]+ ) |
                      \# ( \d{1,5} ) |
                      \# [xX] ( [0-9a-fA-F]{1,4} )
              )
        ;
}xs;

#IMPROVE-handling if params are undef when features are disabled.
sub new
{
    my $class = shift;
    my %params = validate(@_, {
        texpath        => {type => SCALAR, default => '/cgi-bin/texserver'},
        eulermbpath    => {type => SCALAR, default => '/emb'},
        features       => {type => SCALAR, default => 'ALL'},
        meid           => {type => SCALAR | UNDEF, default => undef},
        transaction_id => {type => SCALAR | UNDEF, default => undef},
        javapics       => {type => SCALAR | UNDEF, default => ''},
        thread_link    => {type => CODEREF | UNDEF, default => undef}
    });
    #IMPROVE-possibly
    die 'ASSERT' if defined($params{transaction_id}) &&
                    $params{transaction_id} !~ /^\d{1,20}$/;
    die 'ASSERT' if defined($params{meid}) &&
                    $params{meid} !~ /^\d{1,9}$/;
    die 'ASSERT' if defined($params{javapics}) &&
                    (length($params{javapics}) > 1E6 ||
                     $params{javapics} !~ /^[a-zA-Z0-9_;,\*]*$/);
    my $self = bless {
        texpath        => $params{texpath},
        eulermbpath    => $params{eulermbpath},
        features       => undef,
        meid           => $params{meid},
        transaction_id => $params{transaction_id},
        javapics       => $params{javapics},
        thread_link    => $params{thread_link}
    }, $class;
    $self->features($params{features});

    return $self;
}

sub clone
{
    my $self = shift;
    my %params = validate(@_, {
        texpath        => {type => SCALAR, optional => 1},
        eulermbpath    => {type => SCALAR, optional => 1},
        features       => {type => SCALAR, optional => 1},
        meid           => {type => SCALAR, optional => 1},
        transaction_id => {type => SCALAR, optional => 1},
        javapics       => {type => SCALAR, optional => 1},
        thread_link    => {type => CODEREF, optional => 1}
    });
    # copy
    my $copy = bless {
        (map { ($_ => $self->{$_}) } grep {$_ ne 'features'} keys %$self),
        features => {%{$self->{features}}}
    }, __PACKAGE__;
    # override
    for (keys %params) {
        if($_ eq 'features') { $copy->features($params{$_}); }
        else                 { $copy->{$_} = $params{$_};   }
    }

    return $copy;
}

sub features
{
    my $self = shift;

    if(@_ == 0) { # read
        return %{$self->{features}};
    }
    else { # write
        my $features;
        if(@_ == 1) {
            die 'ASSERT' if ref($_[0]) ne '';
            if($_[0] eq 'ALL') {
                $features = 'H,HE,E,ET,EH,EO,C,R,B';
            }
            else {
                $features = $_[0];
            }
            $features = {map {($_=>1)} split /,/, $features};
        }
        else {
            die 'ASSERT' if @_ % 2;
            $features = {@_};
        }
        for(keys %$features) {
            die 'ASSERT' if $_ !~ /^(H|HE|E|ET|EH|EO|C|R|B)$/;
        }
        $self->{features} = $features;
    }
}

sub gen_lex_embtext
{
    my $self = shift;
    my($textref) = validate_pos(@_, {type => SCALARREF});

    # note: "pos($$textref)+=0" is used to cause a certain reset in
    # the regex engine.  Generally, the regex engine will fail a match
    # if the match length would be zero a second time when using /gc.
    # This behavior is typically used to force a bump-along to prevent
    # an infinite loop (Frield, 131), but it is not desired here, and
    # this reset prevents it.

    # Although Coro::Cont is not currently used, here are a few
    # observations on it:
    # - Cont routine many not call die.
    # - $1...$9 not preserved over yield.
    # - how can the coroutine be reset?
    # - Why is is called Coro::Cont (i.e. continuation)? Arn't
    #   continuations more general?

    my $textpos;
    my $len = length($$textref);
    pos($$textref) = 0 if ! defined pos($$textref);
    my $pos1;
    my @script_stack;

    my $features = $self->{features};

    my $gen_text = sub {
        COBEGIN();
        YIELD('T', substr($$textref, $textpos, $pos1-$textpos));
        $textpos = undef;
    };

    # locals
    my $one;
    my $two;
    my $three;
    my $name;
    my $end_tag;
    my $endpos;
    my $match;

    my $coroutine = coroutine
{
    COBEGIN();

    pos($$textref) = pos($$textref); # reset state

    until($$textref =~ /\G\z/gc) {
        $pos1 = pos($$textref);
#        _trace "DEBUG:lex_embtext," . &_trace_pos($$textref)

        # HTML tag or end tag
        if($features->{H} && $$textref =~ /\G($re_htmltag)/gc) {
            $one = $1;
            CALL($gen_text) if defined $textpos;
            YIELD('H', $one); # tag_id, all
        }
        # HTML entity reference
        elsif($features->{HE} && $$textref =~ /\G($re_htmlentity)/gc) {
            $one = $1;
            $two = defined($2) ? $2 : defined($3) ? $3 : hex($4);
            CALL($gen_text) if defined $textpos;
            YIELD('N', $one, $two);
        }
        # LaTeX environment
        elsif($features->{E} && $$textref =~ /
                        \G ( \\ (?: ([\[\(]) |
                                                begin \{ ([^\}]*) \}
                                        )
                              )
                /xgc)
        {
            $one = $1;
            $two = $2;
            $name = $2 || $3;

            CALL($gen_text) if defined $textpos;
            YIELD('E', $one, $name);

            $end_tag = $one;
            if(defined $two) {
                $end_tag =~ tr/[(/])/;
            }
            else {
                $end_tag =~ s{begin}{end};
            }
            if($features->{ET} && $name eq 'rawtext') {
                $$textref =~ /\G(.*?)(?=\Q$end_tag\E|\z)/gcs or
                    die 'ASSERT';
                pos($$textref)+=0;
                YIELD('T', $1);
            }
            elsif($features->{EH} && $name eq 'rawhtml') {
                until($$textref =~ /\G(?=\z|\Q$end_tag\E)/gc) {
                    $pos1 = pos($$textref); # note--changed

                    # HTML tag or end tag
                    if($$textref =~ /\G($re_htmltag)/gc) {
                        $one = $1;
                        CALL($gen_text) if defined $textpos;
                        YIELD('H', $one); # tag_id, all
                    }
                    # HTML entity reference
                    elsif($features->{HE} &&
                          $$textref =~ /\G($re_htmlentity)/gc)
                    {
                        $one = $1;
                        $two = defined($2) ? $2 : defined($3) ? $3 : hex($4);
                        CALL($gen_text) if defined $textpos;
                        YIELD('N', $one, $two);
                    }
                    else {
                        $textpos = $pos1 if ! defined $textpos;
                        $$textref =~ /\G.[^<\\]*/gcs;
                    }
                }
                $pos1 = pos($$textref);
                pos($$textref)+=0;;
                CALL($gen_text) if defined $textpos;
            }
            elsif($features->{EO}) {
                $endpos = pos($$textref);
                pos($$textref) = $pos1;
                $match = $self->match_latex($textref);
                if($match) {
                    pos($$textref) = $match->[2];
                }
                else {
                    pos($$textref) = $len;
                }
                YIELD(
                    'I', # tag_id
                    substr($$textref, $endpos, pos($$textref)-$endpos) # inner
                );
            }
            else { # unrecognized, so return text
                pos($$textref) = $len;

                YIELD('X', substr($$textref, $endpos),
                    "Unrecognized environment $name.", 'E1', $end_tag);
            }
            if($$textref =~ /\G\Q$end_tag\E/gc) {
                YIELD('F', $end_tag);
            }
            else {
                YIELD('X', '', "Missing $end_tag", 'E2', $end_tag);
            }
        }
        # LaTeX-like tag processed by EulerMB
        elsif($features->{C} &&
              $$textref =~ /\G ( \\ (\w+) \{ ([^\}]*) \} )/xgc)
        {
            $one = $1;
            $name = $2;
            $three = $3;
            CALL($gen_text) if defined $textpos;
            YIELD('C', $one, $name, $three); # tag_id, all, name, body
        }
        # Super- or subscript marker
        elsif($features->{R} && $$textref =~ /\G(([_^])\{)/gc) {
            $one = $1;
            $two = $2;
            CALL($gen_text) if defined $textpos;
            YIELD('R', $one, $two); # tag_id, all, name
            push @script_stack, $two;
        }
        # Super- or subscript end marker
        elsif($features->{R} && @script_stack > 0 && $$textref =~ /\G(\})/gc) {
            $one = $1;
            CALL($gen_text) if defined $textpos;
            $name = pop @script_stack;
            YIELD('S', $one, $name); # tag_id, all, name
        }
        # EulerMB block-style tag, e.g. [sqrt]
        elsif($features->{B} && $$textref =~ /\G(\[([a-zA-Z0-9_]+)\])/gc) {
            $one = $1;
            $two = $2;
            CALL($gen_text) if defined $textpos; L18:
            YIELD('B', $one, $two);
        }
        # text (partial)
        else {
            $textpos = $pos1 if ! defined $textpos;
            $$textref =~ /\G.[^<\&\\_^\[\}]*/gcs;
        }
    }
    $pos1 = pos($$textref);
    CALL($gen_text) if defined $textpos;
};

    return $coroutine->wrap();

};

sub match_latex
{
    my($self, $textref) = @_;

    my $pos0 = pos($$textref);
    if($$textref =~ m/
                \G ( \\ (?: ( [\[\(] ) |
                                            begin \s* \{ ( [^\}]* ) \}
                                )
                      )
        /xgc)
    {
        my $all = $1;
        my $name = defined($2) ? $2 : $3;
        my($beg, $end);
        if($name eq '[') {
            $beg = qr/\[/;
            $end = qr/\]/;
        }
        elsif($name eq '(') {
            $beg = qr/\(/;
            $end = qr/\)/;
        }
        else {
            $beg = qr/begin \s* \{ \Q$name\E \}/x;
            $end = qr/end   \s* \{ \Q$name\E \}/x; 
        }

        my $pos_inner_start = pos($$textref);
        my $pos_inner_end;

        my $depth = 1;
        while($depth > 0 && $$textref !~ /\G\z/gc) {
            my $pos = pos($$textref);
            if($$textref =~ /\G\\$beg/gc) {
                $depth++;
            }
            elsif($$textref =~ /\G\\$end/gc) {
                $depth--;
                $pos_inner_end = $pos if $depth == 0;
            }
            elsif($$textref =~ /\G\\/gc) {
            }
            else {
                $$textref =~ /\G[^\\]+/gcs;
            }
        }
        if($depth > 0) {
            pos($$textref) = $pos0;
            return;
        }
        return [$name, $pos_inner_start, $pos_inner_end];
    }
    else {
        return;
    }
}

sub match_htmltag
{
    my($tag) = @_;

    if($tag =~ /^<([a-zA-Z]+)/gc) {
        my $name = lc($1);
        my %attr;
        while($tag =~ /\G\s+$re_htmlatt/gcs) {
            my $val = defined($2) ? $2 : $3;
            if(defined $attr{$1}) {
                return "Duplicate attribute $1";
            }
            else {
                $attr{lc($1)} = $val;
            }
        }
        if($tag =~ m{\G\s*(/)?>$}gc) {
            my $id = $1 ? 'SE' : 'S';
            return [$id, $name, \%attr];
        }
        else {
            return;
        }
    }
    elsif($tag =~ m{^</([a-zA-Z]+)\s*>$}s) {
        return ['E', lc($1)];
    }
    else {
        return;
    }
}

sub gen_token_iterator
{
    my $self = shift;
    my($source) = validate_pos(@_, {type => ARRAYREF});
    my $idx = 0;
    my $iterator = sub {
        if($idx < @$source) {
            return @{$source->[$idx++]};
        }
        else { return; }
    };
    return $iterator;
}

sub gen_validate_embtext
{
    my $self = shift;
    my($source) = validate_pos(@_, {type => CODEREF | ARRAYREF});

    $source = $self->gen_token_iterator($source)
        if ref($source) eq 'ARRAY';

    my @token;
    my $nexttok = sub {
        @token = $source->();
#        _trace 'DEBUG:VAL_TOKEN', \@token;
    };

    my $eat_token;

    #-- locals
    my %tag_counts;
    # tokens to inject to fix mismatched start/end tags
    my @extra_tokens;

    my $check_html = sub {
        COBEGIN();

        my $tag = &match_htmltag($token[1]);
        if(! defined $tag) { die 'ASSERT'; }
       
        eval {

        if(! ref($tag)) {
            die ['X', $token[1], $tag, 'E3'];
        }
        elsif($tag->[0] =~ /^S/) { # start tag
            my($type, $tagname, $atts) = @$tag;
            my $ele_info = $element_infos->{$tagname};

            if(!defined($ele_info)) {
                die ['X', $token[1], "HTML tag $token[1] unknown.", 'E4'];
            }

            my($ele_allow, $ele_hasend) = @$ele_info;

            #IMPROVE:check only if safehtml set.
            if(!$ele_allow) {
                die ['X', $token[1], "HTML tag $token[1] not allowed.", 'E10'];
            }

            if($tagname eq 'a') {
                if(defined $atts->{href}) {
                    my $bad = ($atts->{href} =~ /
                                                javascript\: |
                                                \&\#\d
                                        /xi);

                    die ['X', $token[1],
                        "Attribute '$atts->{href}' on tag $token[1] not allowed.", 'E5']
                        if $bad;
                }
            }

            foreach my $at (keys %$atts) {
                my $att_info = $att_infos->{$at};
                if(!defined($att_info)) {
                    die ['X', $token[1],
                        "Attribute $at on HTML tag $token[1] unknown.", 'E6'];
                }
                elsif($att_info->[0] == 0) {
                    die ['X', $token[1],
                        "Attribute $at on HTML tag $token[1] not allowed.", 'E7'];
                }
                elsif($att_info->[1] eq 'a' && !$att_info->[2]{$tagname}) {
                    die ['X', $token[1],
                         "Attribute $at on HTML tag $token[1] not valid.", 'E8'];
                }
                elsif($att_info->[1] eq 'd' && $att_info->[2]{$tagname}) {
                    die ['X', $token[1],
                         "Attribute $at on HTML tag $token[1] not valid.", 'E9'];
                }

                if($at eq 'style') {
                    my $val = $atts->{$at};
                    my $bad = ($val =~ /
                                                \bmargin[a-z\-]*\s*:\s*\-? |
                                                \burl |
                                                \bexpression |
                                                \bposition\s*: |
                                                \&\#\d
                                        /xi);

                    die ['X', $token[1],
                        "Style '$val' on tag $token[1] not allowed.", 'E11']
                        if $bad;
                }
            } # for attributes

            # if not removed...
            ++$tag_counts{$tagname} if $ele_hasend;
        } # if start tag

        if($tag->[0] =~ /E/) { # end tag
            my($type, $tagname) = @$tag;

            my $ele_info = $element_infos->{$tagname};

            if(!defined($ele_info)) {
                die ['X', $token[1], "HTML tag $token[1] unknown.", 'E4'];
            }
            my($ele_allow, $ele_hasend) = @$ele_info;
                if(!$ele_allow) {
                    die ['X', $token[1], "HTML tag $token[1] not allowed", 'E17'];
                }

            if($tag->[0] =~ /S/) { # start-end tag
                if($ele_hasend) {
                    $tag_counts{$tagname}--;
                }
            }
            else {
                if(defined $tag_counts{$tagname} && $tag_counts{$tagname} > 0) {
                    $tag_counts{$tagname}--;
                }
                else {
                    die ['X', $token[1], "Mismatched HTML tag $token[1].", 
                         'E13'];
                }
            }

        } # if end tag

        }; # eval
        if($@) {
            die $@ if ref($@) ne 'ARRAY';
            YIELD(@{$@});
        }
        else {
            YIELD(@token);
        }
    }; # sub

    return coroutine
{
    COBEGIN();

    $eat_token = 0;

    $nexttok->();

    while(@token != 0) {
#        _trace "VALIDATE_LOOP", \@token;
        if($token[0] eq 'X') {
            if($token[3] eq 'E2') { # missing end tag
                YIELD('F', $token[4]); # inject missing tag
            }
            YIELD(@token);
        }
        #elsif($token[0] eq 'E' && $token[2] eq 'rawhtml') {
        #}
        elsif($token[0] eq 'H') {
            CALL($check_html);
        }
        elsif($token[0] eq 'N') {
            my $val = $token[2];
            if($val =~ /^\d+$/ && $val == 0) { # '\0'
                YIELD('X', $token[1],
                    "Invalid HTML entity reference $token[1]", 'E17');
            }
            else {
                YIELD(@token);
            }
        }
        elsif($token[0] eq 'R') { #IMPROVE
            my $name = $token[2];
            $tag_counts{$name}++;
            YIELD(@token);
        }
        elsif($token[0] eq 'S') {
            my $name = $token[2];
            if($tag_counts{$name} >= 0) {
                $tag_counts{$name}--;
                YIELD(@token);
            }
            else {
                YIELD('X', $token[1], "Mismatched $token[1]", 'E14');
            }
        }
        else {
            YIELD(@token);
        }
        $nexttok->();
    }

    for my $id (keys %tag_counts) {
        next if $tag_counts{$id} == 0;
        for(1..$tag_counts{$id}) {
            if($id =~ /^[_^]$/) {
                push @extra_tokens, ['S', '}', $id];
                push @extra_tokens,
                    ['X', '', "Missing } to match $id\{", 'E16'];
            }
            else {
                my $end_tag = "</$id>";

                push @extra_tokens, ['H', $end_tag];
                push @extra_tokens, ['X', '', "Missing $end_tag", 'E15'];
            }
        }
    }
    while(my $token = shift @extra_tokens) {
        YIELD(@$token);
    }


}->wrap();
}

sub validate_embtext
{
    my $self = shift;
    my($textref) = validate_pos(@_, {type => SCALARREF | CODEREF | ARRAYREF});

    # coroutine pipeline
    my $source = $textref;
    $source = $self->gen_lex_embtext($textref)
        if ref($source) eq 'SCALAR';
    $source  = $self->gen_validate_embtext($source);

    my @parts;
    while(my @token = $source->()) {
        push @parts, \@token;
    }

    return \@parts;
}

sub validate_tokens
{
    my $self = shift;
    my($source) = @_;

}

sub lex_embtext
{
    my $self = shift;
    my($textref) = validate_pos(@_, {type => SCALARREF});

    my $source = $self->gen_lex_embtext($textref);
    my @parts;
    while(my @token = $source->()) {
        push @parts, \@token;
    }

    return \@parts;
}

sub htmlize
{
    my $self = shift;
    my($source) = validate_pos(@_, {type => CODEREF | ARRAYREF});
    $source = $self->gen_token_iterator($source)
        if ref($source) eq 'ARRAY';

    my @warnings;
    my @token;
    my $out = '';

    my $nexttok = sub {
        @token = $source->();
#        _trace 'DEBUG:HTML_TOKEN', @token;
    };

    $nexttok->();

    while(@token != 0) {
        if($token[0] eq 'E') {
            my $environment = $token[2];
            if($environment =~ /^(?:rawtext|rawhtml)$/s) {
              $nexttok->();
              while(1) {
                if(@token == 0) {
                    print "ASSERT\n";
                    $out .= "[[Expected \\end{$environment}]]";
                    last;
                }
                elsif($token[0] eq 'H') {
                    $out .= $token[1];
                    $nexttok->();
                }
                elsif($token[0] eq 'T') {
                    $out .= ($environment eq 'rawhtml') ?
                        CGI::escapeHTML($token[1]) : &text_to_html($token[1]);
                    $nexttok->();
                }
                elsif($token[0] eq 'F') {
                    $nexttok->();
                    last;
                }
                else {print "UNEXPECTED TOKEN\n"; die; }
              }
            }
            elsif($environment =~ /^(?:
                                \[ | \( |
                                align|alignat|displaymath|equation|flalign|math|gather|
                                multiline|split
                        )$/xs)
            {
                my $tex = $token[1];
                $nexttok->();
                if(@token == 0) {
                    $out .= "[[error: missing content]]";
                }
                elsif($token[0] eq 'I') {
                    $tex .= $token[1];
                    $nexttok->();
                    if(@token == 0) {
                        $out .= "[[error: missing end token]]";
                    }
                    elsif($token[0] eq 'F') {
                        $tex .= $token[1];
                        $nexttok->();

                        my $img = $self->make_latex_image($tex);
                        # simulate centering for environments that should
                        # be centered.
                        if($environment =~ /^(displaymath|\[)$/) {
                            $img =
qq(<div style="text-align:center">$img</div>);
                        }
                        $out .= $img;

                    }
                    else {
                        die "ASSERT:unexpected token";
                    }

                }
                else {
                    $nexttok->();
                    die "ASSERT:unexpected token";
                }

            }
            else {
                $nexttok->();
                $out .= "[[Unrecognized environment $environment]]";
                if(@token == 0) {
                    $out .= "[[Missing \\end{$environment}]]";
                }
                else {
                    $out .= "[[IGNORED:" . &text_to_html($token[1]) . "]]";
                }
                $nexttok->();
                if(@token == 0) {
                    $out .= "[[Missing \\end{$environment}..]]";
                }
                elsif($token[0] eq 'F') { # good
                    $nexttok->();
                }
                else {
                    $out .= "[[Missing \\end{$environment}..]]";
                    $out .= "[[IGNORED:" . &text_to_html($token[1]) . "]]";
                }
            }
        }
        elsif($token[0] eq 'H') {
            $out .= $token[1];
            $nexttok->();
        }
        elsif($token[0] eq 'C') {
            my $name = $token[2];
            my $param = $token[3];
            my $more = $self->make_embtex($name, $param);
            $more = &text_to_html($token[1]) if ! defined $more;
            $out .= $more;
            $nexttok->();
        }
        elsif($token[0] eq 'R') {
            my $name = $token[2];
            $out .= ($name eq '_') ? q(<sub>) : q(<sup>);
            $nexttok->();
        }
        elsif($token[0] eq 'S') {
            my $name = $token[2];
            $out .= ($name eq '_') ? q(</sub>) : q(</sup>);
            $nexttok->();
        }
        elsif($token[0] eq 'B') {
            my $name = $token[2];
            my $more = $self->make_symbol($name);
            $more = $self->make_javapicture($name) if ! defined $more;
            $more = "[$name]" if ! defined $more;
            $out .= $more;
            $nexttok->();
        }
        elsif($token[0] eq 'N') {
            $out .= $token[1];
            $nexttok->();
        }
        elsif($token[0] eq 'T') {
            $out .= &text_to_html($token[1]);
            $nexttok->();
        }
        elsif($token[0] eq 'X') {
            my $msg = $token[2];
            push @warnings, $msg;
            my $msg_html = CGI::escapeHTML($msg);
            $out .=
qq(<span style="background-color:#800000;color:white">) .
qq([[error: $msg_html]]</span>);
            $nexttok->();
        }
        else {
            die "ASSERT token $token[0]";
            $nexttok->();
        }
    }

    $self->{warnings} = \@warnings;
    return $out;
}

sub textize
{
    my $self = shift;
    my($source) = validate_pos(@_, {type => CODEREF | ARRAYREF});
    $source = $self->gen_token_iterator($source)
        if ref($source) eq 'ARRAY';

    my @warnings;
    my @token;
    my $out = '';

    my $nexttok = sub {
        @token = $source->();
#        _trace 'DEBUG:TEXT_TOKEN', @token;
    };

    $nexttok->();
    while(@token != 0) {
        if($token[0] eq 'X') {
            $out .= $token[1];
            my $msg = $token[2];
            push @warnings, $msg;
        }
        else {
            $out .= $token[1];
        }
        $nexttok->();
    }

    $self->{warnings} = \@warnings;
    return $out;
}

sub embtext_to_html
{
    my $self = shift;
    my $textref = shift;
    my %params = validate(@_, {
        validate => {type => SCALAR, default => 1}
    });

    # coroutine pipeline
    my $source = $self->gen_lex_embtext($textref);
    if($params{validate}) {
        $source  = $self->gen_validate_embtext($source);
    }

    my $html = $self->htmlize($source);

    return $html;
}

sub warnings
{
    my $self = shift;
    return $self->{warnings}; # note: no copy
}

sub text_to_html
{
    my($s) = @_;
    $s =~ s/\&/&amp;/g;
    $s =~ s/</&lt;/g;
    $s =~ s/>/&gt;/g;
    $s = &nbsp($s);
    $s =~ s/\n/\n<br>/g;
    return $s;
};

sub make_latex_image
{
    my($self, $tex) = @_;

    my $tex_escaped = uri_escape($tex, "^A-Za-z0-9");
    my $alt = $tex;
    $alt =~ s/\"/\\\"/g;
    $alt =~ s/[\r\n]/ /g;
    $alt =~ s/ +/ /g;

    my $img =
qq(<img src="$self->{texpath}?tex=$tex_escaped" ) .
qq(style="vertical-align:middle" border="0" alt="$alt" title="$alt" />);

    return $img;
}

sub make_symbol
{
    my($self, $name) = @_;

    my $val = $symbols{$name};
    my $img = defined($val) ?
        qq(<img src="$self->{eulermbpath}/symbols/$val") .
        qq( style="vertical-align:middle; border-style:none") .
        qq( alt="[$name]" title="[$name]" />) : undef;

    return $img;
}

sub make_embtex
{
    my($self, $name, $param) = @_;
    my $html;
    if($name eq 'thread') {
        if($param =~ /^(\d+)$/s) {
            if(defined $self->{thread_link}) {
                $html = $self->{thread_link}->($1);
            }
            else {
                $html =
qq([<a href="$self->{eulermbpath}/thread/$1">$1</a>]);
            }
        }
        else {
            $html ='[[---BROKEN thread tag---]]'; #IMPROVE?
        }
    }
    elsif($name eq 'object') {
        if($param =~ /^[a-zA-Z0-9_\.]+$/s) {
            $html = $self->make_attachment($param);
        }
        else {
            $html = '[[---BROKEN object tag---]]'; #IMPROVE?
        }
    }
    #old: elsif($name eq 'special') { #old
    #     $html = qq(<img src="/img/special/$id.png" />);
    #     }
    else {} # unrecognized, return undef.
    return $html;
}

sub make_javapicture
{
    my($self, $name) = @_;
    $name =~ tr/a-zA-Z0-9_//cd; $name = substr($name, 0, 100); # IMPROVE

    my $pics = $self->{javapics};
    my @pics = split /;/, $pics;
    my %pics;

    foreach my $pic (@pics) {
        my($pic_name, $data) = split /,/, $pic;
        $pic_name =~ tr/a-zA-Z0-9_//cd;
        $pics{$pic_name} = $data;
    }
    my $img_tag;
    if(defined $pics{$name}) {
        my $data = $pics{$name};
        #note: ignore bytes 0-2 (info "Z" tag and version #)
        #parse: hex LSB->MSB
        my $width = hex(&_reverse_string(substr($data, 3, 4)));
        my $height = hex(&_reverse_string(substr($data, 7, 4)));

        #old: <applet code="ShowPict.class"
        #      codebase="$self->{javapath}" width="$width"
        #      height="$height"
        #     > <param name="bytes" value="$data">
        #     </applet>
        if(defined $self->{meid}) {
            $img_tag =
qq(<img src="$self->{eulermbpath}/javaimage/$self->{meid}/$name" ) .
qq(width="$width" height="$height" alt="[$name]" title="[$name]" />);
        }
        elsif(defined $self->{transaction_id}) {
            $img_tag =
qq(<img src="$self->{eulermbpath}/javaimage/tmp/$self->{transaction_id}/$name").
qq(width="$width" height="$height" alt="[$name]" title="[$name]" />);
        }
        else { } # nothing
    }
    else { } #nothing

    return $img_tag;
}

sub make_attachment
{
    my($self, $name) = @_;

    #IMPROVE:duplicated code
    if($name !~ /^[a-zA-Z0-9_\.]+$/) {
        return '[[Invalid attachment name format]]';
    }
    #IMPROVE--validate attachment name against those available

    my $base = defined($self->{meid}) ? $self->{meid} :
               defined($self->{transaction_id}) ? "N$self->{transaction_id}" :
               'ASSERT'; # IMPROVE?
    my $tag =
        qq(<img src="$self->{eulermbpath}/image/$base/$name") .
        qq( alt="[$name]" title="[$name]" />);

    return $tag;
}

sub nbsp
{
    my($s) = @_;
    # IMRPOVE: what about tabs?
    $s =~ s/(^| )( +)/$1 . '&nbsp;' x length($2)/gme;
    return $s;
}

# [internal]
# Reverse characters in string.
# (e.g. "asdf" -> "fdsa").
#   $string = &reverse($string);
sub _reverse_string
{
    my($str) = @_;
    my $out = join '', reverse split //, $str;
    return $out;
}

# [internal]
# print chars in string starting as pos() location.
#   my $s = &_trace_pos($textref);
sub _trace_pos
{
    my($textref) = @_;
    my $pos = pos($$textref);
    my $snip = substr($$textref, pos($$textref), 10);
    my $s = qq([pos=$pos,snip=$snip]);
    return $s;
}

sub _trace(@)
{
    print STDERR Data::Dumper::Dumper(\@_);
}


1
__END__