EulerMB::Content - Parses, validates, and formats (as HTML) EulerMB Text
use EulerMB::Content;
my $content = new EulerMB::Content(
texpath => '/cgi-bin/texserver',
eulermbpath => '/emb',
features => 'ALL',
meid => undef,
transaction_id => undef,
javapics => undef,
thread_link => sub { ... }
);
my $html = $content->embtext_to_html(\$embtext, validate => 1);
my $warnings = $content->warnings();
Example content:
a_{n}[sqrt]2 + e^{x+1} = y.
\object{graph.png} [myjavaimage]
4 6
--- + ---
5 7
x < y > z.
<b>text</b><span style="color:blue">tést</span>
\begin{rawtext}
How to input: <b>text</b><span style="color:blue">tést</span>
\end{rawtext}
\begin{rawhtml}
<table><tr>
<td>123</td><td>234</td>
</tr></table>
\end{rawhtml}
\( \int_2^3 \frac{x}{x+1} dx \)
\( \sqrt{2} \) \[ \sqrt{2} \] \begin{math} \sqrt{2} \end{math}
This module parses, validates, and formats (as HTML) EulerMB Text (Embtex). Embtext is a text language for composing messages within the EulerMB message board software (http://math2.org/eulermb/), which the Math Message Board (http://math2.org/mmb/) uses.
Embtext is a mix of HTML, LaTeX, plain text, and other formatting. It is intended to be written by people (rather than computers). In particular it has support for math notation and graphics as it was originally developed for a math message board. This module supports environments where the Embtext input is untrusted (e.g. message boards). The HTML and LaTeX notation is validated and sanitized before display.
With a few exceptions, LaTeX support is limited to math notation and is primarily used for inserting professionally formatted math notation into documents. The LaTeX math support requires the texserver, which is an external component (not currently publicly available), which in turn requires LaTeX. Some additional non-math primitives are available using a LaTeX-like notation (e.g. \thread{123}), but these are processed by this module itself without calling into LaTeX.
A few additional notations are supported for convenience. Block tags, e.g. [sqrt], can be used instead of LaTeX to insert simple math symbols. Other block tags display certain images inline drawing using the JavaDraw Java applet drawing tool (not currently publically available). A super- and subscript notation, e^{x^{2}+1}+a_{b}, is converted to HTML SUP/SUB tags.
This module can be used in a variety of applications, such as for message boards, as an HTML compiler, or as a dynamic page filter on a web server. The module is intended to be reasonably fast (for Perl), mature and robust (with an extensive test suite), and well designed. In the design, the lexing, validating, and rendering are implemented as a pipeline of separate coroutines, which allows selected combinations of these steps to be used or your own processes injected in the middle. For example, to display a message board message as HTML, the validation step may be omitted if message has already been validated and accepted.
(1) Robustness Principle: ``Be conservative in what you do, be liberal in what you accept from others.'' http://www.ietf.org/rfc/rfc793.txt.
This mandates that while we may accept Embtext input that is not fully valid HTML (e.g. <u>></u>), the HTML output generated by this module must be valid HTML (e.g. <u>></u>).
However, the meaning of input not always initially clear (e.g. ``x<y and y>z''). Here, the author's intention is that this math expression be recognized as text rather than as containing the HTML tag <y and y>, where the attribute values are missing. Therefore, this module recognizes strings as HTML only when the interpretation is clear (e.g. the attribute values are present and quoted). So, this example is escaped as ``x<y and y>z''.
(2) Secure on untrusted input
This module shall accept untrusted input (e.g. from arbitrary users on the web) and securely convert it to HTML that can be securely displayed on other users' web browsers. Some features of HTML (e.g. JavaScript) are prevented or filtered out of the user input.
(3) Reasonably terse syntax, easy to read in source code format.
Embtext is designed to be created by people rather than computers. It may be authored in an text-only environment, such as HTML TEXTAREA form element, not a realtime graphical ``what-you-see-it-what-you-get'' (WYSIWYG) one. Therefore, the syntax must be clear and terse. We do not encourage verbose syntax such as XML and especially MathML. Rather, we prefer LaTeX-like notations, simple HTML, or notations similar to that used by Wikipedia (http://www.wikipedia.org) or Epytext (http://epydoc.sourceforge.net/epytext.html).
(4) Represent structure not style.
This is a goal, at least in part, of HTML4, LaTeX, and others (e.g. DocBook, http://www.docbook.org). The idea is that the text represents structure, and the specific style (e.g. font, colors, positioning) can be easily adjusted by someone other than the author.
(5) Sufficiently flexible
Embtext should be able to extend to support new features. Currently, new commands can be added using LaTeX-like notation (e.g. \thread{123}).
(1) The lexing, parsing, and HTML rendering steps are independent.
The conversion from Embtext to HTML is done in a pipeline implemented with semi-coroutines provided by the the EulerMB::Coroutine manpage module. Selected combinations of these steps can be used or omitted from the pipeline, or your own steps can be added at any position. For example, when generating HTML, the validation step may be omitted if the content was already validated.
(2) Conversion is O(1) in memory usage.
The module lexes, parses, and translated to HTML token-by-token. This is a relatively minor goal, but the module achieves it without much added complexity by using semi-coroutines. The issues here are well described in ``Coroutines in C'' (http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html).
The following constructs are supported in the language.
Block-style tags for symbols or JavaDraw images. e.g. [sqrt].
Superscripts and subscripts. e.g. e^{x^{2}+1} + log_{b} x.
Embedded images uploaded. e.g. \object{myfunction.png}.
Limited set of HTML. e.g. <b>This is bold.</b> <span style=``color:blue''>test</a>.
Special LaTeX-like tags. \thread{123} - hyperlink to EulerMB discussion thread with ID 123
ASCII text graphics. ***** ****** * ** *****
Raw HTML (to prevent extra whitespace being inserted as above): \begin{rawhtml} <table> <tr><td>123</td><td>234</td></tr> </table> \end{rawhtml}
Raw text (to prevent characters being rendered as HTML): \begin{rawtext} How to write HTML: <html><body>...</body></html> \end{rawtext}
LaTeX math (when the texserver is installed).
\[ \frac{x+1}{x-\sqrt{y}} + \int_a^b f(x) \, dx. \]
\( ... \)
\begin{math}...\end{math}
The Embtext grammar is roughly as follows:
embtext := latex_environment | latex_command | block_tag | html_tag | html_entity | text | script
latex_environment := begin_environment latex_content end_environment
latex_content := math_content | rawtext_content | rawhtml_content | other_content
rawhtml_content := (html_tag | html_entity | text)*
script := script_begin embtext script_end
begin_environment := E end_environment := F latex_command := C math_content := I rawtext_content := T other_content := I html_tag := H html_entity := N script_begin := R script_end := S block_tag := B text := T
newmy $content = new EulerMB::Content(%params);
where %params is key-value pairs:
texpath default '/cgi-bin/texserver' eulermbpath default '/emb' features default 'ALL' meid default undef transaction_id default undef javapics default undef thread_link default undef (coderef)
texpath is an absolute URL to the texserver. The generated HTML may contain links to the texserver in order to display LaTeX math. This parameter is only needed if HTMLization is performed and LaTeX math is not disabled.
``eulermbpath'' is an absolute URL to the EulerMB static files. These include images that get embedded into the generated HTML. This parameter is only needed if HTMLization is performed and certain features that would result in such links (e.g. [sqrt] tags) are not disabled.
``features'' identifies which types of tokens shall be recognized in the lexing, validation, and rendering. See the features methods for details.
``meid'' is the message identifier. This parameter is only needed if HTMLization is performed and certain features (e.g. file attachments) are not disabled. The generated HTML may contain links that reference this identifier.
``transaction_id'' is the transaction identifier. This parameter is only needed if HTMLization is performed and certain features (e.g. file attachments) are not disabled. This parameter is complimentary to meid and is typically used in previewing a message in the composition stage (when neither the message nor its message identifier yet exist).
``javapics'' is any JavaDraw data associated with the current message. This parameter is only needed if HTMLization is performed and the JavaDraw feaure is not disabled. This parameter is used to properly generate the image tags in the HTML (e.g. width and height parameters).
``thread_link'' is code reference that will handle \thread{...} tags. This parameter is only needed if HTMLization if performed and \thread{...} tags are not disabled. If absent, default rendering is applied. Often, default rendering is not sufficient because additional information is needed (e.g. the title of the thread), and such information may require querying a database.
clonefeatures
$self->features(H => 1, R => 1);
$self->features('H,E,ET,EH,EO,C,R,B');
$self->features('ALL');
my %features = $self->features();
Possible features:
H - enable HTML tags
HE - enable HTML entity references
E - enable LaTeX-like environments
ET - enable rawtext environments
EH - enable rawhtml environments
EO - enable math environments
C - enable LaTeX-like commands (e.g. \thread{123})
R - enable super- and subscripts notation ^{...} and _{...}.
B - enable block tag (e.g. [sqrt] or [myimage])
gen_lex_embtext
my $source = $self->lex_embtext(\$embtext);
while(my @token = $source->()) {
...
}
$embtext is a string.
The format of each returned token varies by type, but generally it is of the form
($token_type, $raw_text, @additional_data)
$token_type is
H - HTML tag N - HTML entity reference E - LaTeX environment begin tag F - LaTeX environment end tag I - LaTeX inside environment C - LaTeX simple tag R - Super/subscript begin S - Super/subscript end B - Block tag T - text or raw text X - error (exception)
$raw_text is the raw input text.
@additional data varies by token type. For example,
'X' - ($error_message, $error_id, @more_data)
gen_validate_embtext
my $source = $self->gen_lex_embtex(\$embtext);
$source2 = $self->gen_validate_embtex($source);
while(my @token = $source2->()) {
...
}
$source is either a token iterator or a list of tokens.
This coroutine performs higher-level validation (e.g. matching HTML start end-tags) than the validation performed by the lexer. This coroutine may insert ``X'' type tokens into the input stream as well as add or remove tokens to resolve formatting errors (e.g. add missing HTML end tags).
my $source = $content->gen_token_iterator(\@tokens);
while(my $token = $source->()) { ... }
$source is either a token iterator or a list of tokens.
validate_embtext
my @tokens = $self->validate_embtext($source);
while(my @token = $source->()) {
...
}
$source is a string reference, a token iterator, or a list of tokens.
lex_embtextmy @tokens = $self->lex_embtext(\$embtext);
textizemy $text = $content->textize($source); my $warnings = $content->warnings();
$source is either a token iterator or a list of tokens.OA
htmlizemy $html = $content->htmlize($source); my $warnings = $content->warnings();
$source is either a token iterator or a list of tokens.
embtext_to_htmlmy $html = $content->embtext_to_html(\$embtext, validate => $validate); my $warnings = $content->warnings();
where $embtext is a string of Embtext. $validate is a boolean indicating whether validation should be performed on the input. (optional--default 1) Any warnings triggered by the conversion can be retrieved with the warnings method.
text_to_html$html = EulerMB::Content::text_to_html($text);
match_latexmy $info = $content->match_latex(\$text);
$text is a string to match on. matching is performed
at pos($text).
On success,
$info is [$name, $pos_inner_start, $pos_inner_end], where
$name is the environment name, $pos_inner_start and $pos_inner_end
are 0-based character indicies indicated here before the
``*'' character here: ``\begin{math}*...\*end{math}''.
pos($text) then points to the character immediately following the match.
On error, return undef and leave pos($text) unchanged.
Generate HTML for LaTeX math image.
=item make_latex_image
Generate HTML link for LaTeX image.
my $html = $content->make_latex_image($tex);
$tex is string containin math LaTeX. $html is returned HTML.
make_symbolmy $html = $content->make_symbol($name);
$name is the symbol identifier string. $html is the return HTML string. Returns undef on symbol not found.
make_embtexmy $html = $content->make_embtext($name, $param);
$name is string containing command name. $param is string containing parameter $html is string containing returned HTML.
nbsp$html = EulerMB::Content::nbsp($text);
make_javapicturemy $html = $content->make_javapicture($name);
$name is the identifier for the picture. Data for the picture is given in the javapics property of the content object. $html is the returned HTML as a string. Returns undef if picture not found or JavaDraw disabled.
make_attachmentmy $html = $content->make_attachment($name);
$name is the identifier for the attachment. $html is the returned HTML.
match_htmltagmy $info = EulerMB::Content::match_htmltag($tag);
$tag is the HTML as a string. return ['S', $tagname, \%attr] for each start tag <tag> ['SE', $tagname, \%attr] for each start/end tag <tag/> ['E', $tagname] for each end tag </tag> $error on parsing error (e.g. duplicate attribute name) undef on no match where $tagname - string - tag name %attr - hash of key-value attribute pairs $error - string - error
warningsmy $warnings = $self->warnings();
Special characters < and > are not permitted within HTML tag attribute values. The presence of these does not cause an error but rather causes the string to be recognized as text. (possible change: returning an error).
HTML attribute values must be present on attributes and must be quoted, else the string will be recognized as text. HTML: <a href=``http://www.google.com''>. Text: <a href=http://www.google.com>. Text: <a href>. The HTML specification discourages unquoted attributes, and it is not conformant with XHTML. A decision to permit unquoted attributes could cause input such as ``a<b and x=2 and z>b'' to be recognized as HTML rather than text. The ambiguity would not occur in the rawhtml environment though.
HTML start tags must have a matching end tag if and only if an end tag is required. Good: <br>. Bad: <br></br>. Good: <br />. Bad: <p>. Good: <p></p>. Good: <p/>. Missing end tags are added and flagged. Extraneous end tags are removed and flagged.
HTML start and end tags can overlap. Good: <b><i></i></b>. OK: <b><i></b></i> (IMPROVE: subject to change--the latter should be made an error).
HTML element and attribute names may be upper- or lower-case. Lower-case is preferred. Good: <a href=``...''>. OK: <A HREF=``...''>. XHTML conformance requires lower-case. (Possible change: allow upper case as input but internally convert to lower-case either in the validation or HTML rendering stage.)
Unrecognized HTML elements and attributes are not permitted. These tags are removed and flagged as errors. BAD: <asdf sdfg=``dfgh''>.
HTML forms, JavaScript, form elements, and embedded objects (e.g. applet, object tags) are not permitted. Such tags are removed and flagged as errors. Additional checks are made such as ``javascript:'' not being permitted in the href attribute of an ``a'' tag.
Various HTML tags and attributes that are possibly insecure are not permitted. These are removed and flagged as errors. Bad: <img>, <script>, <frame>, <form>, <p onclick=``...''>, ....
Numeric entity references, url(), and expression() are not permitted
in CSS.
Various representations of entity references for ASCII '\0' are nor permitted. These are removed and flagged.
The generated HTML is well-formed regardless of the Embtext input, at least when validation is enabled. This is an important design criteria.
Special characters <, >, and & not recognized as part of a valid HTML tag are escaped.
Errors (type ``X'') are rendered in-line, where they occur. The error message is HTML escaped.
Possibly incorporate spell checker into the processing pipeline (validation step?) instead of having it be independent as it is now.
Possibly support new notations:
- standard notation for quotes referring to text in other messages.
- lists
- links to popular web sites (e.g. Mathworld or Wikipedia). This is partly an extension to the \thread{...} notation.
- Simplified math notation. Translating maybe \m{(x+1)/(x-1)+y/sqrt(z)} to \(\frac{x+1}{x-1}+\frac{y}{\sqrt{x}}\) and having it rendered by LaTeX as an image.
- standard notation for the inclusion of source code, possibly syntax highlighted (e.g. GNU enscript, http://people.ssh.fi/mtr/genscript/).
To validate Embtext, and alter the token stream right before the validation step:
my $tokens = $content->lex_embtext(\$text);
for my $token (@$tokens) {
if($token->[0] eq 'H') { # HTML
my $original = $token->[1];
$token->[1] = &normalize_html_tag($original);
}
}
$tokens = $content->gen_validate_embtext($tokens);
my $text = $content->textize($tokens);
my $warnings = $content->warnings();
Such a method was used in a database cleanup routine.
the EulerMB::Coroutine manpage - for semi-coroutine language construct
URI::Escape, CGI - for HTML generation, proper escaping
Params::Validate - to validate parameters to methods
(1) Jeffrey E. F. Freidl, Mastering Regular Expressions, 2nd ed, O'Reilly & Associates, 2002.
(c) 1998-2004, David Manura. http://math2.org/david/contact. This module is licensed under the same terms as Perl itself.
package EulerMB::Content; use EulerMB::Coroutine qw(:all); use strict; use URI::Escape; use CGI; use Params::Validate qw(:all); #use Data::Dumper; # DEBUG # note: for portability/robustness, use EulerMB::Coroutine not Coro #use Coro::Cont; # REMOVED our $VERSION = '2.00'; sub _trace(@); # Mapping: symbol names in block tags --> image file names. # This is used to render symbols (e.g. "[sqrt]") as images. # Case of first character matters. my %symbols = ( 'd_dx' => 'd-dx.gif', # FIX: was "d/dx". "/" not recognized 'inf' => 'inf.gif', 'integral'=> 'integral.gif', 'pder' => 'pder.gif', 'product' => 'product.gif', 'sqrt' => 'sqrt.gif', 'sum' => 'sum.gif', 'alpha' => 'alpha-l.gif', 'Alpha' => 'alpha-u.gif', 'beta' => 'beta-l.gif', 'Beta' => 'beta-u.gif', 'gamma' => 'gamma-l.gif', 'Gamma' => 'gamma-u.gif', 'delta' => 'delta-l.gif', 'Delta' => 'delta-u.gif', 'epsilon' => 'epsilon-l.gif', 'Epsilon' => 'epsilon-u.gif', 'zeta' => 'zeta-l.gif', 'Zeta' => 'zeta-u.gif', 'eta' => 'eta-l.gif', 'Eta' => 'eta-u.gif', 'theta' => 'theta-l.gif', 'Theta' => 'theta-u.gif', 'iota' => 'iota-l.gif', 'Iota' => 'iota-u.gif', 'kappa' => 'kappa-l.gif', 'Kappa' => 'kappa-u.gif', 'lambda' => 'lambda-l.gif', 'Lambda' => 'lambda-u.gif', 'mu' => 'mu-l.gif', 'Mu' => 'mu-u.gif', 'nu' => 'nu-l.gif', 'Nu' => 'nu-u.gif', 'xi' => 'xi-l.gif', 'Xi' => 'xi-u.gif', 'omicron' => 'omicron-l.gif', 'Omicron' => 'omicron-u.gif', 'pi' => 'pi-l.gif', 'Pi' => 'pi-u.gif', 'rho' => 'rho-l.gif', 'Rho' => 'rho-u.gif', 'sigma' => 'sigma-l.gif', 'Sigma' => 'sigma-u.gif', 'tau' => 'tau-l.gif', 'Tau' => 'tau-u.gif', 'upsilon' => 'upsilon-l.gif', 'Upsilon' => 'upsilon-u.gif', 'phi' => 'phi-l.gif', 'Phi' => 'phi-u.gif', 'chi' => 'chi-l.gif', 'Chi' => 'chi-u.gif', 'psi' => 'psi-l.gif', 'Psi' => 'psi-u.gif', 'omega' => 'omega-l.gif', 'Omega' => 'omega-u.gif' ); # HTML element information used for validation. # From http://www.w3.org/TR/1999/REC-html401-19991224/index/elements.html # Format: name => [allow, has_end_tag]. # allow - whether the tag is allowed # has_end_tag - whether the tag can contain content and has matching # start and end tags (e.g. "<p>...</p>" v.s. "<br>". # IMPROVE: should depreciated tags be allowed? # (e.g. HTML strict v.s. transitional) # D = depreciated my $element_infos = { 'a' => [1, 1], 'abbr' => [1, 1], 'acronym' => [1, 1], 'address' => [1, 1], 'applet' => [0, 1], # D 'area' => [0, 0], 'b' => [1, 1], 'base' => [0, 0], 'basefont' => [0, 0], # D 'bdo' => [0, 1], 'big' => [1, 1], 'blockquote' => [1, 1], 'body' => [0, 1], 'br' => [1, 0], 'button' => [0, 1], 'caption' => [1, 1], 'center' => [1, 1], # D 'cite' => [1, 1], 'code' => [1, 1], 'col' => [1, 0], 'colgroup' => [1, 1], 'dd' => [1, 1], 'del' => [1, 1], 'dfn' => [1, 1], 'dir' => [1, 1], # D 'div' => [1, 1], 'dl' => [1, 1], 'dt' => [1, 1], 'em' => [1, 1], 'fieldset' => [0, 1], 'font' => [1, 1], # D 'form' => [0, 1], 'frame' => [0, 0], 'frameset' => [0, 1], 'h1' => [1, 1], 'h2' => [1, 1], 'h3' => [1, 1], 'h4' => [1, 1], 'h5' => [1, 1], 'h6' => [1, 1], 'head' => [0, 1], 'hr' => [1, 0], 'html' => [0, 1], 'i' => [1, 1], 'iframe' => [0, 1], 'img' => [0, 0], 'input' => [0, 0], 'ins' => [1, 1], 'isindex' => [0, 0], # D 'kbd' => [0, 1], 'label' => [0, 1], 'legend' => [0, 1], 'li' => [1, 1], 'link' => [0, 0], 'map' => [0, 1], 'menu' => [1, 1], # D 'meta' => [0, 0], 'noframes' => [0, 1], 'noscript' => [0, 1], 'object' => [0, 1], 'ol' => [1, 1], 'optgroup' => [0, 1], 'option' => [0, 1], 'p' => [1, 1], 'param' => [0, 0], 'pre' => [1, 1], 'q' => [1, 1], 's' => [1, 1], # S 'samp' => [1, 1], 'script' => [0, 1], 'select' => [0, 1], 'small' => [1, 1], 'span' => [1, 1], 'strike' => [1, 1], # S 'strong' => [1, 1], 'style' => [0, 1], 'sub' => [1, 1], 'sup' => [1, 1], 'table' => [1, 1], 'tbody' => [1, 1], 'td' => [1, 1], 'textarea' => [0, 1], 'tfoot' => [1, 1], 'th' => [1, 1], 'thead' => [1, 1], 'title' => [0, 1], 'tr' => [1, 1], 'tt' => [1, 1], 'u' => [1, 1], 'ul' => [1, 1], 'var' => [1, 1] }; # HTML attribute info used for validation. # From http://www.w3.org/TR/1999/REC-html401-19991224/index/attributes.html # Format: tag_name => [allow, modifier, atts] # allow - whether attribute is allowed at all # modifier - # 'a' - accept list follows # 'd' - deny list follows # atts - accepted or denied elements associated with this attribute. # This is a listref transformed into a hashref at runtime. # Note: a number of these are depreciated either globally or on # certain elements. my $att_infos = { 'abbr' => [1, 'a', ['td', 'th']], 'accept-charset' => [0, 'a', ['form']], 'accept' => [0, 'a', ['form', 'input']], 'accesskey' => [0, 'a', ['a', 'area', 'button', 'input', 'label', 'legend', 'textarea']], 'action' => [0, 'a', ['form']], 'align' => [1, 'a', ['caption', 'applet', 'iframe', 'img', 'input', 'object', 'legend', 'table', 'hr', 'div', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p', 'col', 'colgroup', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr']], 'alink' => [0, 'a', ['body']], 'alt' => [0, 'a', ['applet', 'area', 'img', 'input']], 'archive' => [0, 'a', ['applet', 'object']], 'axis' => [1, 'a', ['td', 'th']], 'background' => [0, 'a', ['body']], 'bgcolor' => [1, 'a', ['table', 'tr', 'td', 'th', 'body']], 'border' => [1, 'a', ['table', 'img', 'object']], 'cellpadding' => [1, 'a', ['table']], 'cellspacing' => [1, 'a', [ 'table']], 'char' => [1, 'a', ['col', 'colgroup', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr']], 'charoff' => [1, 'a', ['col', 'colgroup', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr'] ], 'charset' => [0, 'a', ['a', 'link', 'script']], 'checked' => [0, 'a', ['input']], 'cite' => [1, 'a', ['blockquote', 'q', 'del', 'ins']], 'class' => [0, 'd', ['base', 'basefont', 'head', 'html', 'meta', 'param', 'script', 'style', 'title']], 'classid' => [0, 'a', ['object']], 'clear' => [1, 'a', ['br']], 'code' => [0, 'a', ['applet']], 'codebase' => [0, 'a', ['object', 'applet']], 'codetype' => [0, 'a', ['object']], 'color' => [1, 'a', ['basefont', 'font']], 'cols' => [0, 'a', ['frameset', 'cols']], 'colspan' => [1, 'a', ['td', 'th']], 'compact' => [1, 'a', ['dir', 'dl', 'menu', 'ol', 'ul']], 'content' => [0, 'a', ['meta']], 'coords' => [0, 'a', ['area', 'a']], 'data' => [0, 'a', ['object']], 'datetime' => [1, 'a', ['del', 'ins']], 'declare' => [0, 'a', ['object']], 'defer' => [0, 'a', ['script']], 'dir' => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br', 'frame', 'frameset', 'iframe', 'param', 'script']], 'disabled' => [0, 'a', ['button', 'input', 'optgroup', 'option', 'select', 'textarea']], 'enctype' => [0, 'a', ['form']], 'face' => [1, 'a', ['basefont', 'font']], 'for' => [0, 'a', ['label']], 'frame' => [1, 'a', ['table']], 'frameborder' => [1, 'a', ['frame', 'iframe']], 'headers' => [1, 'a', ['td', 'th']], 'height' => [1, 'a', ['iframe', 'td', 'th', 'img', 'object', 'applet']], 'href' => [1, 'a', ['a', 'area', 'link', 'base']], 'hreflang' => [0, 'a', ['a', 'link']], 'hspace' => [0, 'a', ['applet', 'img', 'object']], 'http-equiv' => [0, 'a', ['meta']], 'id' => [0, 'd', ['base', 'head', 'html', 'meta', 'script', 'style', 'title']], 'ismap' => [0, 'a', ['img', 'input']], 'label' => [0, 'a', ['option', 'optgroup']], 'lang' => [0, 'd', ['applet', 'base', 'basefont', 'br', 'frame', 'frameset', 'iframe', 'param', 'script']], 'language' => [0, 'a', ['script']], 'link' => [0, 'a', ['body']], 'longdesc' => [0, 'a', ['img', 'frame', 'iframe']], 'marginheight' => [0, 'a', ['frame', 'iframe']], 'marginwidth' => [0, 'a', ['frame', 'iframe']], 'maxlength' => [0, 'a', ['input']], 'media' => [0, 'a', ['style', 'link']], 'method' => [0, 'a', ['form']], 'multiple' => [0, 'a', ['select']], 'name' => [0, 'a', ['button', 'textarea', 'applet', 'select', 'form', 'frame', 'iframe', 'img', 'a', 'input', 'object', 'map', 'param', 'meta']], 'nohref' => [0, 'a', ['area']], 'noresize' => [0, 'a', ['frame']], 'noshade' => [1, 'a', ['hr']], 'nowrap' => [1, 'a', ['td', 'th']], 'object' => [0, 'a', ['applet']], 'onblur' => [0, 'a', ['a', 'area', 'button', 'input', 'label', 'select', 'textarea']], 'onchange' => [0, 'a', ['input', 'select', 'textarea']], 'onclick' => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br', 'font', 'frame', 'frameset', 'head', 'html', 'iframe', 'isindex', 'meta', 'param', 'script', 'style', 'title']], 'ondblclick' => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br', 'font', 'frame', 'frameset', 'head', 'html', 'iframe', 'isindex', 'meta', 'param', 'script', 'style', 'title']], 'onfocus' => [0, 'a', ['a', 'area', 'button', 'input', 'label', 'select', 'textarea']], 'onkeydown' => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br', 'font', 'frame', 'frameset', 'head', 'html', 'iframe', 'isindex', 'meta', 'param', 'script', 'style', 'title']], 'onkeypress' => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br', 'font', 'frame', 'frameset', 'head', 'html', 'iframe', 'isindex', 'meta', 'param', 'script', 'style', 'title']], 'onkeyup' => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br', 'font', 'frame', 'frameset', 'head', 'html', 'iframe', 'isindex', 'meta', 'param', 'script', 'style', 'title']], 'onload' => [0, 'a', ['frameset', 'body']], 'onmousedown' => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br', 'font', 'frame', 'frameset', 'head', 'html', 'iframe', 'isindex', 'meta', 'param', 'script', 'style', 'title']], 'onmousemove' => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br', 'font', 'frame', 'frameset', 'head', 'html', 'iframe', 'isindex', 'meta', 'param', 'script', 'style', 'title']], 'onmouseout' => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br', 'font', 'frame', 'frameset', 'head', 'html', 'iframe', 'isindex', 'meta', 'param', 'script', 'style', 'title']], 'onmouseover' => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br', 'font', 'frame', 'frameset', 'head', 'html', 'iframe', 'isindex', 'meta', 'param', 'script', 'style', 'title']], 'onmouseup' => [0, 'd', ['applet', 'base', 'basefont', 'bdo', 'br', 'font', 'frame', 'frameset', 'head', 'html', 'iframe', 'isindex', 'meta', 'param', 'script', 'style', 'title']], 'onreset' => [0, 'a', ['form']], 'onselect' => [0, 'a', ['input', 'textarea']], 'onsubmit' => [0, 'a', ['form']], 'onunload' => [0, 'a', ['frameset']], 'onunload' => [0, 'a', ['body']], 'profile' => [0, 'a', ['head']], 'prompt' => [0, 'a', ['isindex']], 'readonly' => [0, 'a', ['textarea', 'input']], 'rel' => [0, 'a', ['a', 'link']], 'rev' => [0, 'a', ['a', 'link']], 'rows' => [0, 'a', ['frameset', 'textarea']], 'rowspan' => [1, 'a', ['td', 'th']], 'rules' => [1, 'a', ['table']], 'scheme' => [0, 'a', ['meta']], 'scope' => [1, 'a', ['td', 'th']], 'scrolling' => [0, 'a', ['frame', 'iframe']], 'selected' => [0, 'a', ['option']], 'shape' => [0, 'a', ['area', 'a']], 'size' => [1, 'a', ['hr', 'font', 'input', 'basefont', 'select']], 'span' => [1, 'a', ['col', 'colgroup']], 'src' => [1, 'a', ['script', 'input', 'frame', 'iframe', 'img']], 'standby' => [0, 'a', ['object']], 'start' => [1, 'a', ['ol']], 'style' => [1, 'd', ['base', 'basefont', 'head', 'html', 'meta', 'param', 'script', 'style', 'title']], 'summary' => [1, 'a', ['table']], 'tabindex' => [0, 'a', ['a', 'area', 'button', 'input', 'object', 'select', 'textarea']], 'target' => [1, 'a', ['a', 'area', 'base', 'form', 'link']], 'text' => [0, 'a', ['body']], 'title' => [1, 'd', ['base', 'basefont', 'head', 'html', 'meta', 'param', 'script', 'title']], 'type' => [1, 'a', ['a', 'link', 'object', 'param', 'script', 'style', 'input', 'li', 'ol', 'ul', 'button']], 'usemap' => [0, 'a', ['img', 'input', 'object']], 'valign' => [1, 'a', ['col', 'colgroup', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr']], 'value' => [1, 'a', ['input', 'option', 'param', 'button', 'li']], 'valuetype' => [0, 'a', ['param']], 'version' => [0, 'a', ['html']], 'vlink' => [0, 'a', ['body']], 'vspace' => [1, 'a', ['applet', 'img', 'object']], 'width' => [1, 'a', ['hr', 'iframe', 'img', 'object', 'table', 'td', 'th', 'applet', 'col', 'colgroup', 'pre']] }; $_->[2] = { map {($_=>1)} @{$_->[2]} } for values %$att_infos; # match HTML attribute my $re_htmlatt = qr( (?: ([a-zA-Z]+) \s* = \s* (?: \' ([^\'<>\r\n\f]*) \' | \" ([^\"<>\r\n\f]*) \" ) ) )xs; # match HTML tag my $re_htmltag = qr{ < [a-zA-Z]+ (?: \s $re_htmlatt?)* /? > | < / [a-zA-Z]+ \s* > }xs; # match HTML entity reference my $re_htmlentity = qr{ \& (?: ( [a-zA-Z]+ ) | \# ( \d{1,5} ) | \# [xX] ( [0-9a-fA-F]{1,4} ) ) ; }xs; #IMPROVE-handling if params are undef when features are disabled. sub new { my $class = shift; my %params = validate(@_, { texpath => {type => SCALAR, default => '/cgi-bin/texserver'}, eulermbpath => {type => SCALAR, default => '/emb'}, features => {type => SCALAR, default => 'ALL'}, meid => {type => SCALAR | UNDEF, default => undef}, transaction_id => {type => SCALAR | UNDEF, default => undef}, javapics => {type => SCALAR | UNDEF, default => ''}, thread_link => {type => CODEREF | UNDEF, default => undef} }); #IMPROVE-possibly die 'ASSERT' if defined($params{transaction_id}) && $params{transaction_id} !~ /^\d{1,20}$/; die 'ASSERT' if defined($params{meid}) && $params{meid} !~ /^\d{1,9}$/; die 'ASSERT' if defined($params{javapics}) && (length($params{javapics}) > 1E6 || $params{javapics} !~ /^[a-zA-Z0-9_;,\*]*$/); my $self = bless { texpath => $params{texpath}, eulermbpath => $params{eulermbpath}, features => undef, meid => $params{meid}, transaction_id => $params{transaction_id}, javapics => $params{javapics}, thread_link => $params{thread_link} }, $class; $self->features($params{features}); return $self; } sub clone { my $self = shift; my %params = validate(@_, { texpath => {type => SCALAR, optional => 1}, eulermbpath => {type => SCALAR, optional => 1}, features => {type => SCALAR, optional => 1}, meid => {type => SCALAR, optional => 1}, transaction_id => {type => SCALAR, optional => 1}, javapics => {type => SCALAR, optional => 1}, thread_link => {type => CODEREF, optional => 1} }); # copy my $copy = bless { (map { ($_ => $self->{$_}) } grep {$_ ne 'features'} keys %$self), features => {%{$self->{features}}} }, __PACKAGE__; # override for (keys %params) { if($_ eq 'features') { $copy->features($params{$_}); } else { $copy->{$_} = $params{$_}; } } return $copy; } sub features { my $self = shift; if(@_ == 0) { # read return %{$self->{features}}; } else { # write my $features; if(@_ == 1) { die 'ASSERT' if ref($_[0]) ne ''; if($_[0] eq 'ALL') { $features = 'H,HE,E,ET,EH,EO,C,R,B'; } else { $features = $_[0]; } $features = {map {($_=>1)} split /,/, $features}; } else { die 'ASSERT' if @_ % 2; $features = {@_}; } for(keys %$features) { die 'ASSERT' if $_ !~ /^(H|HE|E|ET|EH|EO|C|R|B)$/; } $self->{features} = $features; } } sub gen_lex_embtext { my $self = shift; my($textref) = validate_pos(@_, {type => SCALARREF}); # note: "pos($$textref)+=0" is used to cause a certain reset in # the regex engine. Generally, the regex engine will fail a match # if the match length would be zero a second time when using /gc. # This behavior is typically used to force a bump-along to prevent # an infinite loop (Frield, 131), but it is not desired here, and # this reset prevents it. # Although Coro::Cont is not currently used, here are a few # observations on it: # - Cont routine many not call die. # - $1...$9 not preserved over yield. # - how can the coroutine be reset? # - Why is is called Coro::Cont (i.e. continuation)? Arn't # continuations more general? my $textpos; my $len = length($$textref); pos($$textref) = 0 if ! defined pos($$textref); my $pos1; my @script_stack; my $features = $self->{features}; my $gen_text = sub { COBEGIN(); YIELD('T', substr($$textref, $textpos, $pos1-$textpos)); $textpos = undef; }; # locals my $one; my $two; my $three; my $name; my $end_tag; my $endpos; my $match; my $coroutine = coroutine { COBEGIN(); pos($$textref) = pos($$textref); # reset state until($$textref =~ /\G\z/gc) { $pos1 = pos($$textref); # _trace "DEBUG:lex_embtext," . &_trace_pos($$textref) # HTML tag or end tag if($features->{H} && $$textref =~ /\G($re_htmltag)/gc) { $one = $1; CALL($gen_text) if defined $textpos; YIELD('H', $one); # tag_id, all } # HTML entity reference elsif($features->{HE} && $$textref =~ /\G($re_htmlentity)/gc) { $one = $1; $two = defined($2) ? $2 : defined($3) ? $3 : hex($4); CALL($gen_text) if defined $textpos; YIELD('N', $one, $two); } # LaTeX environment elsif($features->{E} && $$textref =~ / \G ( \\ (?: ([\[\(]) | begin \{ ([^\}]*) \} ) ) /xgc) { $one = $1; $two = $2; $name = $2 || $3; CALL($gen_text) if defined $textpos; YIELD('E', $one, $name); $end_tag = $one; if(defined $two) { $end_tag =~ tr/[(/])/; } else { $end_tag =~ s{begin}{end}; } if($features->{ET} && $name eq 'rawtext') { $$textref =~ /\G(.*?)(?=\Q$end_tag\E|\z)/gcs or die 'ASSERT'; pos($$textref)+=0; YIELD('T', $1); } elsif($features->{EH} && $name eq 'rawhtml') { until($$textref =~ /\G(?=\z|\Q$end_tag\E)/gc) { $pos1 = pos($$textref); # note--changed # HTML tag or end tag if($$textref =~ /\G($re_htmltag)/gc) { $one = $1; CALL($gen_text) if defined $textpos; YIELD('H', $one); # tag_id, all } # HTML entity reference elsif($features->{HE} && $$textref =~ /\G($re_htmlentity)/gc) { $one = $1; $two = defined($2) ? $2 : defined($3) ? $3 : hex($4); CALL($gen_text) if defined $textpos; YIELD('N', $one, $two); } else { $textpos = $pos1 if ! defined $textpos; $$textref =~ /\G.[^<\\]*/gcs; } } $pos1 = pos($$textref); pos($$textref)+=0;; CALL($gen_text) if defined $textpos; } elsif($features->{EO}) { $endpos = pos($$textref); pos($$textref) = $pos1; $match = $self->match_latex($textref); if($match) { pos($$textref) = $match->[2]; } else { pos($$textref) = $len; } YIELD( 'I', # tag_id substr($$textref, $endpos, pos($$textref)-$endpos) # inner ); } else { # unrecognized, so return text pos($$textref) = $len; YIELD('X', substr($$textref, $endpos), "Unrecognized environment $name.", 'E1', $end_tag); } if($$textref =~ /\G\Q$end_tag\E/gc) { YIELD('F', $end_tag); } else { YIELD('X', '', "Missing $end_tag", 'E2', $end_tag); } } # LaTeX-like tag processed by EulerMB elsif($features->{C} && $$textref =~ /\G ( \\ (\w+) \{ ([^\}]*) \} )/xgc) { $one = $1; $name = $2; $three = $3; CALL($gen_text) if defined $textpos; YIELD('C', $one, $name, $three); # tag_id, all, name, body } # Super- or subscript marker elsif($features->{R} && $$textref =~ /\G(([_^])\{)/gc) { $one = $1; $two = $2; CALL($gen_text) if defined $textpos; YIELD('R', $one, $two); # tag_id, all, name push @script_stack, $two; } # Super- or subscript end marker elsif($features->{R} && @script_stack > 0 && $$textref =~ /\G(\})/gc) { $one = $1; CALL($gen_text) if defined $textpos; $name = pop @script_stack; YIELD('S', $one, $name); # tag_id, all, name } # EulerMB block-style tag, e.g. [sqrt] elsif($features->{B} && $$textref =~ /\G(\[([a-zA-Z0-9_]+)\])/gc) { $one = $1; $two = $2; CALL($gen_text) if defined $textpos; L18: YIELD('B', $one, $two); } # text (partial) else { $textpos = $pos1 if ! defined $textpos; $$textref =~ /\G.[^<\&\\_^\[\}]*/gcs; } } $pos1 = pos($$textref); CALL($gen_text) if defined $textpos; }; return $coroutine->wrap(); }; sub match_latex { my($self, $textref) = @_; my $pos0 = pos($$textref); if($$textref =~ m/ \G ( \\ (?: ( [\[\(] ) | begin \s* \{ ( [^\}]* ) \} ) ) /xgc) { my $all = $1; my $name = defined($2) ? $2 : $3; my($beg, $end); if($name eq '[') { $beg = qr/\[/; $end = qr/\]/; } elsif($name eq '(') { $beg = qr/\(/; $end = qr/\)/; } else { $beg = qr/begin \s* \{ \Q$name\E \}/x; $end = qr/end \s* \{ \Q$name\E \}/x; } my $pos_inner_start = pos($$textref); my $pos_inner_end; my $depth = 1; while($depth > 0 && $$textref !~ /\G\z/gc) { my $pos = pos($$textref); if($$textref =~ /\G\\$beg/gc) { $depth++; } elsif($$textref =~ /\G\\$end/gc) { $depth--; $pos_inner_end = $pos if $depth == 0; } elsif($$textref =~ /\G\\/gc) { } else { $$textref =~ /\G[^\\]+/gcs; } } if($depth > 0) { pos($$textref) = $pos0; return; } return [$name, $pos_inner_start, $pos_inner_end]; } else { return; } } sub match_htmltag { my($tag) = @_; if($tag =~ /^<([a-zA-Z]+)/gc) { my $name = lc($1); my %attr; while($tag =~ /\G\s+$re_htmlatt/gcs) { my $val = defined($2) ? $2 : $3; if(defined $attr{$1}) { return "Duplicate attribute $1"; } else { $attr{lc($1)} = $val; } } if($tag =~ m{\G\s*(/)?>$}gc) { my $id = $1 ? 'SE' : 'S'; return [$id, $name, \%attr]; } else { return; } } elsif($tag =~ m{^</([a-zA-Z]+)\s*>$}s) { return ['E', lc($1)]; } else { return; } } sub gen_token_iterator { my $self = shift; my($source) = validate_pos(@_, {type => ARRAYREF}); my $idx = 0; my $iterator = sub { if($idx < @$source) { return @{$source->[$idx++]}; } else { return; } }; return $iterator; } sub gen_validate_embtext { my $self = shift; my($source) = validate_pos(@_, {type => CODEREF | ARRAYREF}); $source = $self->gen_token_iterator($source) if ref($source) eq 'ARRAY'; my @token; my $nexttok = sub { @token = $source->(); # _trace 'DEBUG:VAL_TOKEN', \@token; }; my $eat_token; #-- locals my %tag_counts; # tokens to inject to fix mismatched start/end tags my @extra_tokens; my $check_html = sub { COBEGIN(); my $tag = &match_htmltag($token[1]); if(! defined $tag) { die 'ASSERT'; } eval { if(! ref($tag)) { die ['X', $token[1], $tag, 'E3']; } elsif($tag->[0] =~ /^S/) { # start tag my($type, $tagname, $atts) = @$tag; my $ele_info = $element_infos->{$tagname}; if(!defined($ele_info)) { die ['X', $token[1], "HTML tag $token[1] unknown.", 'E4']; } my($ele_allow, $ele_hasend) = @$ele_info; #IMPROVE:check only if safehtml set. if(!$ele_allow) { die ['X', $token[1], "HTML tag $token[1] not allowed.", 'E10']; } if($tagname eq 'a') { if(defined $atts->{href}) { my $bad = ($atts->{href} =~ / javascript\: | \&\#\d /xi); die ['X', $token[1], "Attribute '$atts->{href}' on tag $token[1] not allowed.", 'E5'] if $bad; } } foreach my $at (keys %$atts) { my $att_info = $att_infos->{$at}; if(!defined($att_info)) { die ['X', $token[1], "Attribute $at on HTML tag $token[1] unknown.", 'E6']; } elsif($att_info->[0] == 0) { die ['X', $token[1], "Attribute $at on HTML tag $token[1] not allowed.", 'E7']; } elsif($att_info->[1] eq 'a' && !$att_info->[2]{$tagname}) { die ['X', $token[1], "Attribute $at on HTML tag $token[1] not valid.", 'E8']; } elsif($att_info->[1] eq 'd' && $att_info->[2]{$tagname}) { die ['X', $token[1], "Attribute $at on HTML tag $token[1] not valid.", 'E9']; } if($at eq 'style') { my $val = $atts->{$at}; my $bad = ($val =~ / \bmargin[a-z\-]*\s*:\s*\-? | \burl | \bexpression | \bposition\s*: | \&\#\d /xi); die ['X', $token[1], "Style '$val' on tag $token[1] not allowed.", 'E11'] if $bad; } } # for attributes # if not removed... ++$tag_counts{$tagname} if $ele_hasend; } # if start tag if($tag->[0] =~ /E/) { # end tag my($type, $tagname) = @$tag; my $ele_info = $element_infos->{$tagname}; if(!defined($ele_info)) { die ['X', $token[1], "HTML tag $token[1] unknown.", 'E4']; } my($ele_allow, $ele_hasend) = @$ele_info; if(!$ele_allow) { die ['X', $token[1], "HTML tag $token[1] not allowed", 'E17']; } if($tag->[0] =~ /S/) { # start-end tag if($ele_hasend) { $tag_counts{$tagname}--; } } else { if(defined $tag_counts{$tagname} && $tag_counts{$tagname} > 0) { $tag_counts{$tagname}--; } else { die ['X', $token[1], "Mismatched HTML tag $token[1].", 'E13']; } } } # if end tag }; # eval if($@) { die $@ if ref($@) ne 'ARRAY'; YIELD(@{$@}); } else { YIELD(@token); } }; # sub return coroutine { COBEGIN(); $eat_token = 0; $nexttok->(); while(@token != 0) { # _trace "VALIDATE_LOOP", \@token; if($token[0] eq 'X') { if($token[3] eq 'E2') { # missing end tag YIELD('F', $token[4]); # inject missing tag } YIELD(@token); } #elsif($token[0] eq 'E' && $token[2] eq 'rawhtml') { #} elsif($token[0] eq 'H') { CALL($check_html); } elsif($token[0] eq 'N') { my $val = $token[2]; if($val =~ /^\d+$/ && $val == 0) { # '\0' YIELD('X', $token[1], "Invalid HTML entity reference $token[1]", 'E17'); } else { YIELD(@token); } } elsif($token[0] eq 'R') { #IMPROVE my $name = $token[2]; $tag_counts{$name}++; YIELD(@token); } elsif($token[0] eq 'S') { my $name = $token[2]; if($tag_counts{$name} >= 0) { $tag_counts{$name}--; YIELD(@token); } else { YIELD('X', $token[1], "Mismatched $token[1]", 'E14'); } } else { YIELD(@token); } $nexttok->(); } for my $id (keys %tag_counts) { next if $tag_counts{$id} == 0; for(1..$tag_counts{$id}) { if($id =~ /^[_^]$/) { push @extra_tokens, ['S', '}', $id]; push @extra_tokens, ['X', '', "Missing } to match $id\{", 'E16']; } else { my $end_tag = "</$id>"; push @extra_tokens, ['H', $end_tag]; push @extra_tokens, ['X', '', "Missing $end_tag", 'E15']; } } } while(my $token = shift @extra_tokens) { YIELD(@$token); } }->wrap(); } sub validate_embtext { my $self = shift; my($textref) = validate_pos(@_, {type => SCALARREF | CODEREF | ARRAYREF}); # coroutine pipeline my $source = $textref; $source = $self->gen_lex_embtext($textref) if ref($source) eq 'SCALAR'; $source = $self->gen_validate_embtext($source); my @parts; while(my @token = $source->()) { push @parts, \@token; } return \@parts; } sub validate_tokens { my $self = shift; my($source) = @_; } sub lex_embtext { my $self = shift; my($textref) = validate_pos(@_, {type => SCALARREF}); my $source = $self->gen_lex_embtext($textref); my @parts; while(my @token = $source->()) { push @parts, \@token; } return \@parts; } sub htmlize { my $self = shift; my($source) = validate_pos(@_, {type => CODEREF | ARRAYREF}); $source = $self->gen_token_iterator($source) if ref($source) eq 'ARRAY'; my @warnings; my @token; my $out = ''; my $nexttok = sub { @token = $source->(); # _trace 'DEBUG:HTML_TOKEN', @token; }; $nexttok->(); while(@token != 0) { if($token[0] eq 'E') { my $environment = $token[2]; if($environment =~ /^(?:rawtext|rawhtml)$/s) { $nexttok->(); while(1) { if(@token == 0) { print "ASSERT\n"; $out .= "[[Expected \\end{$environment}]]"; last; } elsif($token[0] eq 'H') { $out .= $token[1]; $nexttok->(); } elsif($token[0] eq 'T') { $out .= ($environment eq 'rawhtml') ? CGI::escapeHTML($token[1]) : &text_to_html($token[1]); $nexttok->(); } elsif($token[0] eq 'F') { $nexttok->(); last; } else {print "UNEXPECTED TOKEN\n"; die; } } } elsif($environment =~ /^(?: \[ | \( | align|alignat|displaymath|equation|flalign|math|gather| multiline|split )$/xs) { my $tex = $token[1]; $nexttok->(); if(@token == 0) { $out .= "[[error: missing content]]"; } elsif($token[0] eq 'I') { $tex .= $token[1]; $nexttok->(); if(@token == 0) { $out .= "[[error: missing end token]]"; } elsif($token[0] eq 'F') { $tex .= $token[1]; $nexttok->(); my $img = $self->make_latex_image($tex); # simulate centering for environments that should # be centered. if($environment =~ /^(displaymath|\[)$/) { $img = qq(<div style="text-align:center">$img</div>); } $out .= $img; } else { die "ASSERT:unexpected token"; } } else { $nexttok->(); die "ASSERT:unexpected token"; } } else { $nexttok->(); $out .= "[[Unrecognized environment $environment]]"; if(@token == 0) { $out .= "[[Missing \\end{$environment}]]"; } else { $out .= "[[IGNORED:" . &text_to_html($token[1]) . "]]"; } $nexttok->(); if(@token == 0) { $out .= "[[Missing \\end{$environment}..]]"; } elsif($token[0] eq 'F') { # good $nexttok->(); } else { $out .= "[[Missing \\end{$environment}..]]"; $out .= "[[IGNORED:" . &text_to_html($token[1]) . "]]"; } } } elsif($token[0] eq 'H') { $out .= $token[1]; $nexttok->(); } elsif($token[0] eq 'C') { my $name = $token[2]; my $param = $token[3]; my $more = $self->make_embtex($name, $param); $more = &text_to_html($token[1]) if ! defined $more; $out .= $more; $nexttok->(); } elsif($token[0] eq 'R') { my $name = $token[2]; $out .= ($name eq '_') ? q(<sub>) : q(<sup>); $nexttok->(); } elsif($token[0] eq 'S') { my $name = $token[2]; $out .= ($name eq '_') ? q(</sub>) : q(</sup>); $nexttok->(); } elsif($token[0] eq 'B') { my $name = $token[2]; my $more = $self->make_symbol($name); $more = $self->make_javapicture($name) if ! defined $more; $more = "[$name]" if ! defined $more; $out .= $more; $nexttok->(); } elsif($token[0] eq 'N') { $out .= $token[1]; $nexttok->(); } elsif($token[0] eq 'T') { $out .= &text_to_html($token[1]); $nexttok->(); } elsif($token[0] eq 'X') { my $msg = $token[2]; push @warnings, $msg; my $msg_html = CGI::escapeHTML($msg); $out .= qq(<span style="background-color:#800000;color:white">) . qq([[error: $msg_html]]</span>); $nexttok->(); } else { die "ASSERT token $token[0]"; $nexttok->(); } } $self->{warnings} = \@warnings; return $out; } sub textize { my $self = shift; my($source) = validate_pos(@_, {type => CODEREF | ARRAYREF}); $source = $self->gen_token_iterator($source) if ref($source) eq 'ARRAY'; my @warnings; my @token; my $out = ''; my $nexttok = sub { @token = $source->(); # _trace 'DEBUG:TEXT_TOKEN', @token; }; $nexttok->(); while(@token != 0) { if($token[0] eq 'X') { $out .= $token[1]; my $msg = $token[2]; push @warnings, $msg; } else { $out .= $token[1]; } $nexttok->(); } $self->{warnings} = \@warnings; return $out; } sub embtext_to_html { my $self = shift; my $textref = shift; my %params = validate(@_, { validate => {type => SCALAR, default => 1} }); # coroutine pipeline my $source = $self->gen_lex_embtext($textref); if($params{validate}) { $source = $self->gen_validate_embtext($source); } my $html = $self->htmlize($source); return $html; } sub warnings { my $self = shift; return $self->{warnings}; # note: no copy } sub text_to_html { my($s) = @_; $s =~ s/\&/&/g; $s =~ s/</</g; $s =~ s/>/>/g; $s =  ($s); $s =~ s/\n/\n<br>/g; return $s; }; sub make_latex_image { my($self, $tex) = @_; my $tex_escaped = uri_escape($tex, "^A-Za-z0-9"); my $alt = $tex; $alt =~ s/\"/\\\"/g; $alt =~ s/[\r\n]/ /g; $alt =~ s/ +/ /g; my $img = qq(<img src="$self->{texpath}?tex=$tex_escaped" ) . qq(style="vertical-align:middle" border="0" alt="$alt" title="$alt" />); return $img; } sub make_symbol { my($self, $name) = @_; my $val = $symbols{$name}; my $img = defined($val) ? qq(<img src="$self->{eulermbpath}/symbols/$val") . qq( style="vertical-align:middle; border-style:none") . qq( alt="[$name]" title="[$name]" />) : undef; return $img; } sub make_embtex { my($self, $name, $param) = @_; my $html; if($name eq 'thread') { if($param =~ /^(\d+)$/s) { if(defined $self->{thread_link}) { $html = $self->{thread_link}->($1); } else { $html = qq([<a href="$self->{eulermbpath}/thread/$1">$1</a>]); } } else { $html ='[[---BROKEN thread tag---]]'; #IMPROVE? } } elsif($name eq 'object') { if($param =~ /^[a-zA-Z0-9_\.]+$/s) { $html = $self->make_attachment($param); } else { $html = '[[---BROKEN object tag---]]'; #IMPROVE? } } #old: elsif($name eq 'special') { #old # $html = qq(<img src="/img/special/$id.png" />); # } else {} # unrecognized, return undef. return $html; } sub make_javapicture { my($self, $name) = @_; $name =~ tr/a-zA-Z0-9_//cd; $name = substr($name, 0, 100); # IMPROVE my $pics = $self->{javapics}; my @pics = split /;/, $pics; my %pics; foreach my $pic (@pics) { my($pic_name, $data) = split /,/, $pic; $pic_name =~ tr/a-zA-Z0-9_//cd; $pics{$pic_name} = $data; } my $img_tag; if(defined $pics{$name}) { my $data = $pics{$name}; #note: ignore bytes 0-2 (info "Z" tag and version #) #parse: hex LSB->MSB my $width = hex(&_reverse_string(substr($data, 3, 4))); my $height = hex(&_reverse_string(substr($data, 7, 4))); #old: <applet code="ShowPict.class" # codebase="$self->{javapath}" width="$width" # height="$height" # > <param name="bytes" value="$data"> # </applet> if(defined $self->{meid}) { $img_tag = qq(<img src="$self->{eulermbpath}/javaimage/$self->{meid}/$name" ) . qq(width="$width" height="$height" alt="[$name]" title="[$name]" />); } elsif(defined $self->{transaction_id}) { $img_tag = qq(<img src="$self->{eulermbpath}/javaimage/tmp/$self->{transaction_id}/$name"). qq(width="$width" height="$height" alt="[$name]" title="[$name]" />); } else { } # nothing } else { } #nothing return $img_tag; } sub make_attachment { my($self, $name) = @_; #IMPROVE:duplicated code if($name !~ /^[a-zA-Z0-9_\.]+$/) { return '[[Invalid attachment name format]]'; } #IMPROVE--validate attachment name against those available my $base = defined($self->{meid}) ? $self->{meid} : defined($self->{transaction_id}) ? "N$self->{transaction_id}" : 'ASSERT'; # IMPROVE? my $tag = qq(<img src="$self->{eulermbpath}/image/$base/$name") . qq( alt="[$name]" title="[$name]" />); return $tag; } sub nbsp { my($s) = @_; # IMRPOVE: what about tabs? $s =~ s/(^| )( +)/$1 . ' ' x length($2)/gme; return $s; } # [internal] # Reverse characters in string. # (e.g. "asdf" -> "fdsa"). # $string = &reverse($string); sub _reverse_string { my($str) = @_; my $out = join '', reverse split //, $str; return $out; } # [internal] # print chars in string starting as pos() location. # my $s = &_trace_pos($textref); sub _trace_pos { my($textref) = @_; my $pos = pos($$textref); my $snip = substr($$textref, pos($$textref), 10); my $s = qq([pos=$pos,snip=$snip]); return $s; } sub _trace(@) { print STDERR Data::Dumper::Dumper(\@_); } 1 __END__