Commits

matt chisholm  committed 577201e Draft

import most recent Codeville changeset into Mercurial:

Change d419 (6d0f) by matt on Fri Feb 27 23:22:57 2009

  • Participants
  • Branches default
  • Tags 1.1.4

Comments (0)

Files changed (20)

+.cdv/
+.*.pyc
+test-mosuki.py
+Copyright (c) 2007, Matt Chisholm
+
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+
+    * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+    * Neither the name of the PottyMouth nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+include readme.html
+include LICENSE.txt
+include setup.py
+include test.py
+include web.py
+include debian/changelog
+include debian/control
+include debian/copyright
+include debian/pyversions
+include debian/rules

File PottyMouth.py

+#!/usr/bin/env python
+import re
+
+short_line_length = 50
+encoding = 'utf8' # Default output encoding
+
+
+class TokenMatcher(object):
+
+    def __init__(self, name, pattern, replace=None):
+        self.name = name
+        self.pattern = re.compile(pattern, re.IGNORECASE)
+        self.replace = replace
+
+    def match(self, string):
+        return self.pattern.match(string)
+
+
+protocol_pattern = re.compile(r'^\w+://', re.IGNORECASE)
+
+domain_pattern = r"([-\w]+\.)+\w\w+"
+
+_URI_pattern = ("(("                                    +
+                r"(https?|webcal|feed|ftp|news|nntp)://" + # protocol
+                r"([-\w]+(:[-\w]+)?@)?"                  + # authentication
+                r")|www\.)"                              + # or just www.
+                domain_pattern                           + # domain
+                r"(/[-\w$\.+!*'(),;:@%&=?/~#]*)?"          # path
+                )
+_URI_end_punctuation = r"(?<![\]\.}>\)\s,?!;:\"'])"  # end punctuation
+
+URI_pattern = _URI_pattern + _URI_end_punctuation
+
+email_pattern = r'([^()<>@,;:\"\[\]\s]+@' + domain_pattern + ')'
+
+image_pattern = _URI_pattern + '\.(jpe?g|png|gif)' + _URI_end_punctuation
+
+# youtube_pattern matches:
+#  http://www.youtube.com/watch?v=KKTDRqQtPO8 and
+#  http://www.youtube.com/v/KKTDRqQtPO8       and
+#  http://youtube.com/watch?v=KKTDRqQtPO8     and
+#  http://youtube.com/v/KKTDRqQtPO8
+youtube_pattern = r'http://(?:www\.)?youtube.com/(?:watch\?)?v=?/?([\w\-]{11})'
+youtube_matcher = re.compile(youtube_pattern, re.IGNORECASE)
+
+token_order = (
+    TokenMatcher('NEW_LINE'   , r'(\r?\n)'), #fuck you, Microsoft!
+    TokenMatcher('YOUTUBE'    , '('+youtube_pattern+')'),
+    TokenMatcher('IMAGE'      , '('+image_pattern  +')'),
+    TokenMatcher('URL'        , '('+URI_pattern    +')'),
+    TokenMatcher('EMAIL'      , email_pattern ),
+
+    TokenMatcher('HASH'       ,  r'([\t ]*#[\t ]+)'     ),
+    TokenMatcher('DASH'       ,  r'([\t ]*-[\t ]+)'     ),
+    TokenMatcher('NUMBERDOT'  ,  r'([\t ]*\d+\.[\t ]+)' ),
+    TokenMatcher('ITEMSTAR'   ,  r'([\t ]*\*[\t ]+)'    ),
+    TokenMatcher('BULLET'     , ur'([\t ]*\u2022[\t ]+)'),
+
+    TokenMatcher('UNDERSCORE' , r'(_)' ),
+    TokenMatcher('STAR'       , r'(\*)'),
+
+    TokenMatcher('RIGHT_ANGLE', r'(>[\t ]*(?:>[\t ]*)*)'),
+
+    # The following are simple, context-independent replacement tokens
+    TokenMatcher('EMDASH'  , r'(--)'    , replace=unichr(8212)),
+    # No way to reliably distinguish Endash from Hyphen, Dash & Minus, 
+    # so we don't.  See: http://www.alistapart.com/articles/emen/
+
+    TokenMatcher('ELLIPSIS', r'(\.\.\.)', replace=unichr(8230)),
+    #TokenMatcher('SMILEY' , r'(:\))'   , replace=unichr(9786)), # smiley face, not in HTML 4.01, doesn't work in IE
+    )
+
+
+
+# The "Replacers" are context sensitive replacements, therefore they
+# must be applied in-line to the string as a whole before tokenizing.
+# Another option would be to keep track of previous context when
+# applying tokens.
+class Replacer(object):
+
+    def __init__(self, pattern, replace):
+        self.pattern = re.compile(pattern)
+        self.replace = replace
+
+    def sub(self, string):
+        return self.pattern.sub(self.replace, string)
+
+
+replace_list = [
+    Replacer(r'(``)', unichr(8220)),
+    Replacer(r"('')", unichr(8221)),
+
+    # First we look for inter-word " and ' 
+    Replacer(r'(\b"\b)', unichr(34)), # double prime
+    Replacer(r"(\b'\b)", unichr(8217)), # apostrophe
+    # Then we look for opening or closing " and ' 
+    Replacer(r'(\b"\B)', unichr(8221)), # close double quote 
+    Replacer(r'(\B"\b)', unichr(8220)), # open double quote
+    Replacer(r"(\b'\B)", unichr(8217)), # close single quote
+    Replacer(r"(\B'\b)", unichr(8216)), # open single quote
+
+    # Then we look for space-padded opening or closing " and ' 
+    Replacer(r'(")(\s)', unichr(8221)+r'\2'), # close double quote
+    Replacer(r'(\s)(")', r'\1'+unichr(8220)), # open double quote 
+    Replacer(r"(')(\s)", unichr(8217)+r'\2'), # close single quote
+    Replacer(r"(\s)(')", r'\1'+unichr(8216)), # open single quote
+
+    # Then we gobble up stand-alone ones
+    Replacer(r'(`)', unichr(8216)),
+    #Replacer(r'(")', unichr(8221)),
+    #Replacer(r"(')", unichr(8217)),
+    ]
+
+
+
+class Token(unicode):
+
+    def __new__(cls, name, content=''):
+        self = unicode.__new__(cls, content)
+        self.name = name
+        return self
+
+    def __repr__(self):
+        return '%s{%s}'%(self.name, self)
+
+    def __add__(self, extra):
+        return Token(self.name, unicode.__add__(self, extra))
+
+    def __str__(self):
+        return self.encode(encoding, 'xmlcharrefreplace')
+
+
+
+# escape() made is available to calling code in case it needs to
+# escape the content of PottyMouth Nodes before converting it to
+# another tree object that does not automatically escape these
+# disallowed HTML characters.
+def escape(string):
+    out = string.replace('&', '&amp;')
+    out =    out.replace('<', '&lt;' )
+    out =    out.replace('>', '&gt;' )
+    return out
+
+
+
+class Line(list):
+
+    def __init__(self):
+        self.depth = 0
+
+
+    def __repr__(self):
+        return 'Line[' + ''.join(map(repr, self)) + '(' + str(self.depth) + ')]'
+
+
+    def __len__(self):
+        return sum([len(x) for x in self])
+
+
+
+class Node(list):
+
+    def __new__(cls, name, *contents, **kw):
+        self = list.__new__(cls)
+        return self
+
+
+    def __init__(self, name, *contents, **kw):
+        super(list, self).__init__(self)
+        self.name = name.lower()
+        self.extend(contents)
+        self._attributes = kw.get('attributes', {})
+
+
+    def node_children(self):
+        for n in self:
+            if isinstance(n, Node):
+                return True
+        return False
+
+
+    def __str__(self):
+        if self.name in ('br','img'): # Also <hr>
+            # <br></br> causes double-newlines, so we do this
+            return '<%s%s />' % (self.name, self._attribute_string())
+        else:
+            content = ''
+            for c in self:
+                if isinstance(c, Node):
+                    content += str(c)
+                else:
+                    content += escape(c).encode(encoding, 'xmlcharrefreplace')
+                content += '\n'
+            content = content.rstrip('\n')
+            content = content.replace('\n', '\n  ')
+
+            interpolate = {'name'   :self.name               ,
+                           'attrs'  :self._attribute_string(),
+                           'content':content                 ,}
+
+            if self.node_children():
+                return '<%(name)s%(attrs)s>\n  %(content)s\n</%(name)s>' % interpolate
+            elif self.name == 'span':
+                return content
+            else:
+                return '<%(name)s%(attrs)s>%(content)s</%(name)s>' % interpolate
+
+
+    def _attribute_string(self):
+        content = ''
+        if self._attributes:
+            for k, v in self._attributes.items():
+                content += ' %s="%s"' % (k, escape(v).encode(encoding, 'xmlcharrefreplace'))
+        return content
+
+
+
+class URLNode(Node):
+
+    def __new__(cls, content, internal=False):
+        self = Node.__new__(cls, 'a', content)
+        return self
+
+
+    def __init__(self, content, internal=False):
+        attributes = {'href':content}
+        if not internal:
+            attributes['class'] = 'external'
+
+        if content.startswith('http://'):
+            content = content[7:]
+
+        Node.__init__(self, 'a', content, attributes=attributes)
+
+
+
+class LinkNode(URLNode):
+
+    pass
+
+
+
+class EmailNode(URLNode):
+
+    def __init__(self, content, internal=False):
+        attributes = {'href':'mailto:'+content}
+        if not internal:
+            attributes['class'] = 'external'
+
+        Node.__init__(self, 'a', content, attributes=attributes)
+
+
+
+class ImageNode(Node):
+
+    def __new__(cls, content):
+        self = Node.__new__(cls, 'img', content)
+        return self
+
+    def __init__(self, content):
+        Node.__init__(self, 'img', '', attributes={'src':content})
+
+
+
+class YouTubeNode(Node):
+
+    def __new__(cls, content):
+        self = Node.__new__(cls, 'object', content)
+        return self
+
+    def __init__(self, content):
+        Node.__init__(self, 'object', attributes={'width':'425', 'height':'350',})
+
+        ytid = youtube_matcher.match(content).groups()[0]
+        url = 'http://www.youtube.com/v/'+ytid
+
+        self.append(Node(name='param',
+                         attributes={'name':'movie', 'value':url,}))
+        self.append(Node('param',
+                         attributes={'name':'wmode', 'value':'transparent',}))
+        self.append(Node('embed',
+                         attributes={'type':'application/x-shockwave-flash',
+                                     'wmode':'transparent','src':url,
+                                     'width':'425', 'height':'350',}))
+
+
+
+class PottyMouth(object):
+
+    def __init__(self, url_check_domains=(), url_white_lists=(),
+                 all_links=True,      # disables all URL hyperlinking
+                 image=True,          # disables <img> tags for image URLs
+                 youtube=True,        # disables YouTube embedding
+                 email=True,          # disables mailto:email@site.com URLs
+                 all_lists=True,      # disables all lists (<ol> and <ul>)
+                 unordered_list=True, # disables all unordered lists (<ul>)
+                 ordered_list=True,   # disables all ordered lists (<ol>)
+                 numbered_list=True,  # disables '\d+\.' lists 
+                 blockquote=True,     # disables '>' <blockquote>s
+                 bold=True,           # disables *bold*
+                 italic=True,         # disables _italics_
+                 emdash=True,         # disables -- emdash
+                 ellipsis=True,       # disables ... ellipsis
+                 smart_quotes=True,   # disables smart quotes
+                 ):
+
+        self._url_check_domain = None
+        if url_check_domains:
+            self._url_check_domain = re.compile('(\w+://)?((' + ')|('.join(url_check_domains) + '))',
+                                                flags=re.I)
+
+        self._url_white_lists  = [re.compile(w) for w in url_white_lists]
+        self.smart_quotes = smart_quotes
+
+        self.token_list = []
+        for t in token_order:
+            n = t.name
+            if n in ('URL','IMAGE','YOUTUBE','EMAIL') and not all_links:
+                continue
+            elif n == 'IMAGE' and not image:                           continue
+            elif n == 'YOUTUBE' and not youtube:                       continue
+            elif n == 'EMAIL' and not email:                           continue
+            elif n in ('HASH','DASH','NUMBERDOT','ITEMSTAR','BULLET') and not all_lists:
+                continue 
+            elif n in ('DASH','ITEMSTAR','BULLET') and not unordered_list:
+                continue
+            elif n in ('HASH','NUMBERDOT') and not ordered_list:
+                continue
+            elif n == 'NUMBERDOT' and not numbered_list:               continue
+            elif n == 'STAR' and not bold:                             continue
+            elif n == 'UNDERSCORE' and not italic:                     continue
+            elif n == 'RIGHT_ANGLE' and not blockquote:                continue
+            elif n == 'EMDASH' and not emdash:                         continue
+            elif n == 'ELLIPSIS' and not ellipsis:                     continue
+
+            self.token_list.append(t)
+
+
+    def debug(self, *s):
+        return
+        print ' '.join(map(str, s))
+
+
+    def tokenize(self, string):
+        p = 0
+        found_tokens = []
+        unmatched_collection = ''
+        while p < len(string):
+            found_token = False
+            for tm in self.token_list:
+                m = tm.match(string[p:])
+                if m:
+                    found_token = True
+                    content = m.groups()[0]
+                    p += len(content)
+
+                    if tm.replace is not None: 
+                        unmatched_collection += tm.replace
+                        break
+
+                    if unmatched_collection:
+                        try:
+                            found_tokens.append(Token('TEXT', unmatched_collection))
+                        except UnicodeDecodeError:
+                            found_tokens.append(Token('TEXT', unmatched_collection.decode('utf8')))
+                        except:
+                            raise
+
+                    unmatched_collection = ''
+
+                    if tm.name == 'NEW_LINE':
+                        if found_tokens and found_tokens[-1].name == 'TEXT':
+                            found_tokens[-1] += ' '
+                        content=' '
+
+                    found_tokens.append(Token(tm.name, content))
+                    break
+
+            if not found_token:
+                # Pull one character off the string and continue looking for tokens
+                unmatched_collection += string[p]
+                p += 1
+
+        if unmatched_collection:
+            found_tokens.append(Token('TEXT', unmatched_collection))
+
+        return found_tokens
+
+
+    def _find_blocks(self, tokens):
+        finished = []
+
+        current_line = Line()
+
+        stack = []
+
+        old_depth = 0
+
+        for t in tokens:
+            self.debug(t)
+
+            if t.name == 'NEW_LINE':
+                if current_line:
+                    if current_line.depth == 0 and old_depth != 0:
+                        # figure out whether we're closing >> or * here and collapse the stack accordingly
+                        self.debug('\tneed to collapse the stack by' + str(old_depth))
+                        top = None
+                        for i in range(old_depth):
+                            if stack and stack[-1].name == 'p':
+                                top = stack.pop()
+                            if stack and stack[-1].name == 'blockquote':
+                                top = stack.pop()
+
+                        if not stack:
+                            if top is not None:
+                                finished.append(top)
+                            stack.append(Node('p'))
+                        self.debug('\tclosing out the stack')
+                        old_depth = 0
+                    self.debug('\tappending line to top of stack')
+                    if not stack:
+                        stack.append(Node('p'))
+                    stack[-1].append( current_line )
+                    current_line = Line()
+
+                elif stack:
+                    if stack[-1].name in ('p','li'):
+                        top = stack.pop() # the p or li
+                        self.debug('\tpopped off because saw a blank line')
+
+                        while stack:
+                            if stack[-1].name in ('blockquote','ul','ol','li'):
+                                top = stack.pop()
+                            else:
+                                break
+                        if not stack:
+                            finished.append(top)
+
+            elif t.name in ('HASH','NUMBERDOT','ITEMSTAR','BULLET','DASH') and not(current_line):
+                if stack and stack[-1].name == 'p':
+                    top = stack.pop()
+                    if current_line.depth < old_depth:
+                        # pop off <blockquote> and <li> or <p>so we can apppend the new <li> in the right node
+                        for i in range(old_depth - current_line.depth):
+                            top = stack.pop() # the <blockquote>
+                            top = stack.pop() # the previous <li> or <p>
+                    if not stack:
+                        finished.append(top)
+
+                if stack and stack[-1].name == 'li':
+                    stack.pop() # the previous li
+                elif stack and stack[-1].name in ('ul', 'ol'):
+                    pass
+                else:
+                    if t.name in ('HASH','NUMBERDOT'):
+                        newl = Node('ol')
+                    elif t.name in ('ITEMSTAR','BULLET','DASH'):
+                        newl = Node('ul')
+                    if stack:
+                        stack[-1].append(newl)
+                    stack.append(newl)
+
+                newli = Node('li')
+                stack[-1].append(newli)
+                stack.append(newli)
+
+            elif t.name == 'RIGHT_ANGLE' and not(current_line):
+                new_depth = t.count('>')
+                old_depth = 0
+
+                for n in stack[::-1]:
+                    if n.name == 'blockquote':
+                        old_depth += 1
+                    elif n.name in ('p', 'li', 'ul', 'ol'):
+                        pass
+                    else:
+                        break
+
+                current_line.depth = new_depth
+                if new_depth == old_depth:
+                    # same level, do nothing
+                    self.debug('\tsame level, do nothing')
+                    pass
+                elif new_depth > old_depth:
+                    # current_line is empty, so we just make some new nodes
+                    for i in range(new_depth - old_depth):
+                        if not stack:
+                            newp = Node('p')
+                            stack.append(newp)
+                        elif stack[-1].name not in ('p', 'li'):
+                            newp = Node('p')
+                            stack[-1].append(newp)
+                            stack.append(newp)
+                        newq = Node('blockquote')
+                        stack[-1].append(newq)
+                        stack.append(newq)
+
+                elif new_depth < old_depth:
+                    # current line is empty, so we just pop off the existing nodes
+                    for i in range(old_depth - new_depth):
+                        stack.pop() # the p
+                        stack.pop() # the blockquote
+                old_depth = new_depth
+
+            else:
+                if stack and stack[-1].name == 'blockquote':
+                    newp = Node('p')
+                    stack[-1].append(newp)
+                    stack.append(newp)
+
+                if t.name == 'URL':
+                    self._handle_url(t, current_line)
+                elif t.name == 'YOUTUBE':
+                    self._handle_youtube(t, current_line)
+                elif t.name == 'IMAGE':
+                    self._handle_image(t, current_line)
+                elif t.name == 'EMAIL':
+                    self._handle_email(t, current_line)
+                elif current_line and t.strip('\t\n\r'):
+                    self.debug('\tadding (possibly empty space) text token to current line')
+                    current_line.append(t)
+                elif t.strip():
+                    self.debug('\tadding non-empty text token to current line')
+                    current_line.append(t)
+
+        if current_line:
+            if not stack:
+                stack.append(Node('p'))
+            stack[-1].append(current_line)
+
+        while stack:
+            top = stack.pop()
+            if stack and top in stack[-1]:
+                pass
+            else:
+                finished.append(top)
+            
+        return finished
+
+
+    def _handle_email(self, email, current_line):
+        current_line.append( EmailNode(email) )
+
+
+    def _handle_url(self, anchor, current_line):
+        self.debug('handling', anchor)
+
+        if not protocol_pattern.match(anchor):
+            anchor = Token(anchor.name, 'http://' + anchor)
+
+        if self._url_check_domain and self._url_check_domain.findall(anchor):
+            self.debug('\tchecking urls for this domain', len(self._url_white_lists))
+            for w in self._url_white_lists:
+                self.debug('\t\tchecking against', str(w))
+                if w.match(anchor):
+                    self.debug('\t\tmatches the white lists')
+                    a = self._handle_link(anchor, internal=True)
+                    current_line.append(a)
+                    return
+            self.debug('\tdidn\'t match any white lists, making text')
+            current_line.append(anchor)
+        else:
+            a = self._handle_link(anchor)
+            current_line.append(a)
+
+
+    def _handle_link(self, anchor, internal=False):
+        return LinkNode(anchor, internal=internal)
+
+
+    def _handle_youtube(self, t, current_line):
+        ytn = YouTubeNode(t)
+        current_line.append(ytn)
+
+
+    def _handle_image(self, t, current_line):
+        i = ImageNode(t)
+        current_line.append(i)
+
+
+    def _create_spans(self, sub_line):
+        new_sub_line = []
+        current_span = None
+        for t in sub_line:
+            if isinstance(t, Node):
+                if current_span is not None:
+                    new_sub_line.append(current_span)
+                    current_span = None
+                new_sub_line.append(t)
+            else:
+                if current_span is None:
+                    current_span = Node('span')
+                current_span.append(t)
+        if current_span is not None:
+            new_sub_line.append(current_span)
+
+        return new_sub_line
+
+
+    def _parse_line(self, line):
+        """Parse bold and italic and other balanced items"""
+        stack = []
+        finished = []
+        
+        last_bold_idx = -1
+        last_ital_idx = -1
+
+        leading_space_pad = False
+
+        def _reduce_balanced(name, last_idx, stack):
+            n = Node(name)
+            sub_line = self._create_spans( stack[last_idx+1:] )
+
+            for i in range(last_idx, len(stack)):
+                stack.pop()
+
+            if sub_line:
+                n.extend(sub_line)
+                stack.append(n)
+
+        for i, t in enumerate(line):
+            if isinstance(t, URLNode):
+                # URL nodes can go inside balanced syntax
+                stack.append(t)
+            elif isinstance(t, Node):
+                if stack:
+                    # reduce stack, close out dangling * and _
+                    sub_line = self._create_spans(stack)
+                    finished.extend(sub_line)
+                    last_bold_idx = -1
+                    last_ital_idx = -1
+                    stack = []
+                # add node to new_line
+                finished.append(t)
+            elif isinstance(t, Token):
+                if t.name == 'UNDERSCORE':
+                    if last_ital_idx == -1:
+                        last_ital_idx = len(stack)
+                        stack.append(t)
+                    else:
+                        _reduce_balanced('i', last_ital_idx, stack)
+                        if last_ital_idx <= last_bold_idx:
+                            last_bold_idx = -1
+                        last_ital_idx = -1
+                elif t.name in ('STAR', 'ITEMSTAR'):
+                    if t.name == 'ITEMSTAR':
+                        # Because ITEMSTAR gobbles up following space, we have to space-pad the next (text) token
+                        leading_space_pad = True
+                    if last_bold_idx == -1:
+                        last_bold_idx = len(stack)
+                        stack.append(t)
+                    else:
+                        _reduce_balanced('b', last_bold_idx, stack)
+                        if last_bold_idx <= last_ital_idx:
+                            last_ital_idx = -1
+                        last_bold_idx = -1
+                else:
+                    if leading_space_pad:
+                        # Because ITEMSTAR gobbled up the following space, we have to space-pad this (text) token
+                        t = Token(t.name, ' '+t)
+                        leading_space_pad = False
+                    stack.append(t)
+            else:
+                raise str(type(t)) + ':' + str(t)
+
+        if stack:
+            # reduce stack, close out dangling * and _
+            sub_line = self._create_spans(stack)
+            finished.extend(sub_line)
+
+        return finished
+
+
+    def _parse_block(self, block):
+        new_block = Node(block.name)
+        current_line = None
+
+        ppll = -1 # previous previous line length
+        pll  = -1 # previous line length
+
+        for i, item in enumerate(block):
+            # collapse lines together into single lines
+            if isinstance(item, Node):
+                if current_line is not None:
+                    # all these lines should be dealt with together
+                    parsed_line = self._parse_line(current_line)
+                    new_block.extend(parsed_line)
+
+                parsed_block = self._parse_block(item)
+                new_block.append(parsed_block)
+                current_line = None
+                ppll = -1
+                pll  = -1
+
+            elif isinstance(item, Line):
+                if current_line is not None:
+                    if len(item) < short_line_length:
+                        # Identify short lines
+                        if 0 < pll < short_line_length:
+                            current_line.append(Node('BR'))
+                        elif (len(block) > i+1                       and   # still items on the stack
+                              isinstance(block[i+1], Line)           and   # next item is a line
+                              0 < len(block[i+1]) < short_line_length   ): # next line is short
+                            # the next line is short and so is this one
+                            current_line.append(Node('BR'))
+                    elif 0 < pll < short_line_length and 0 < ppll < short_line_length:
+                        # long line at the end of a sequence of short lines
+                        current_line.append(Node('BR'))
+                    current_line.extend(item)
+                    ppll = pll
+                    pll = len(item)
+                else:
+                    current_line = item
+                    ppll = -1
+                    pll = len(item)
+
+        if current_line is not None:
+            parsed_line = self._parse_line(current_line)
+            new_block.extend(parsed_line)
+
+        return new_block
+
+
+    def pre_replace(self, string):
+        for r in replace_list:
+            string = r.sub(string)
+        return string
+
+
+    def parse(self, string):
+        if self.smart_quotes:
+            string = self.pre_replace(string)
+        tokens = self.tokenize(string)
+        blocks = self._find_blocks(tokens)
+        parsed_blocks = Node('div')
+        for b in blocks:
+            nb = self._parse_block(b)
+            parsed_blocks.append(nb)
+        
+        return parsed_blocks
+
+
+
+if __name__ == '__main__':
+    import sys
+    w = PottyMouth(url_check_domains=('www.mysite.com', 'mysite.com'),
+                   url_white_lists=('https?://www\.mysite\.com/allowed/url\?id=\d+',),
+                   )
+    while True:
+        print 'input (end with Ctrl-D)>>'
+        try:
+            text = sys.stdin.read()
+        except KeyboardInterrupt:
+            break
+        if text:
+            blocks = w.parse(text)
+            for b in blocks:
+                print b
+            print '=' * 70

File PottyMouth.rb

+#!/usr/bin/env ruby1.9
+if RUBY_VERSION < '1.9'
+  puts "Ruby 1.9 or greater required. You are using Ruby #{RUBY_VERSION}."
+  exit()
+end
+$KCODE = 'UTF-8'
+
+module PottyMouth
+
+  ShortLineLength = 50
+
+  class TokenMatcher
+    attr_reader :replace, :name
+
+    def initialize(name, pattern, replace=nil)
+      @name = name
+      @pattern = pattern
+      @replace = replace
+    end
+
+
+    def match(string)
+      @pattern.match(string)
+    end
+
+  end
+
+  ProtocolPattern = /^\w+:\/\//i
+
+  domain_pattern = '([-\w]+\.)+\w\w+'
+
+  base_uri_pattern = ('(('                                    + 
+		      '(https?|webcal|feed|ftp|news|nntp)://' + # protocol
+		      '([-\w]+(:[-\w]+)?@)?'                  + # authentication
+		      ')|www\.)'                              + # or just www.
+		      domain_pattern                          + # domain
+		      '(/[-\w$\.+!*\'(),;:@%&=?/~#]*)?'       ) # path
+
+  uri_end_punctuation = '(?<![\]\.}>\)\s,?!;:"\'])'  # end punctuation
+
+  URIPattern = Regexp.new('^(' + base_uri_pattern + uri_end_punctuation + ')', 
+			  Regexp::IGNORECASE)
+
+  EmailPattern = Regexp.new('^([^()<>@,;:"\[\]\s]+@' + domain_pattern + ')')
+
+  ImagePattern = Regexp.new('^(' + base_uri_pattern + 
+			    '\.(jpe?g|png|gif)' + uri_end_punctuation + ')', 
+			    Regexp::IGNORECASE)
+
+  # YouTubePattern matches:
+  #  http://www.youtube.com/watch?v=KKTDRqQtPO8 and
+  #  http://www.youtube.com/v/KKTDRqQtPO8       and
+  #  http://youtube.com/watch?v=KKTDRqQtPO8     and
+  #  http://youtube.com/v/KKTDRqQtPO8
+  YouTubePattern = /^(http:\/\/(?:www\.)?youtube.com\/(?:watch\?)?v=?\/?([\w\-]{11}))/i
+
+
+  TokenList = [
+    TokenMatcher.new(:NEW_LINE    , /^(\r?\n)/), #fuck you, Microsoft!
+    TokenMatcher.new(:YOUTUBE     , YouTubePattern),
+    TokenMatcher.new(:IMAGE       , ImagePattern  ),
+    TokenMatcher.new(:URL         , URIPattern    ),
+    TokenMatcher.new(:EMAIL       , EmailPattern  ),
+
+    TokenMatcher.new(:HASH        , /^([\t ]*#[\t ]+)/     ),
+    TokenMatcher.new(:DASH        , /^([\t ]*-[\t ]+)/     ),
+    TokenMatcher.new(:NUMBERDOT   , /^([\t ]*\d+\.[\t ]+)/ ),
+    TokenMatcher.new(:ITEMSTAR    , /^([\t ]*\*[\t ]+)/    ),
+    TokenMatcher.new(:BULLET      , /^([\t ]*•[\t ]+)/     ),
+
+    TokenMatcher.new(:UNDERSCORE  , /^(_)/ ),
+    TokenMatcher.new(:STAR        , /^(\*)/),
+
+    TokenMatcher.new(:RIGHT_ANGLE , /^(>[\t ]*(?:>[\t ]*)*)/),
+
+    # The following are simple, context-independent replacement tokens
+    TokenMatcher.new(:EMDASH, /^(--)/, '—'),
+    # No way to reliably distinguish Endash from Hyphen, Dash & Minus, 
+    # so we don't.  See: http://www.alistapart.com/articles/emen/
+
+    TokenMatcher.new(:ELLIPSIS , /^(\.\.\.)/, '…'),
+    #TokenMatcher.new(:SMILEY  , /^(:\))/   , '?'), # smiley face, not in HTML 4.01, doesn't work in IE
+  ]
+
+
+  # The "Replacers" are context sensitive replacements, therefore they
+  # must be applied in-line to the string as a whole before tokenizing.
+  # Another option would be to keep track of previous context when
+  # applying tokens.
+  class Replacer
+
+    def initialize(pattern, replace)
+      @pattern = pattern
+      @replace = replace
+    end
+
+
+    def sub(string)
+      return string.gsub!(@pattern, @replace)
+    end
+
+  end
+
+
+  ReplaceList = [
+    Replacer.new(/(``)/, '“'),
+    Replacer.new(/('')/, '”'),
+
+    # First we look for inter-word " and ' 
+    Replacer.new(/(\b"\b)/, '"'), #' double prime
+    Replacer.new(/(\b'\b)/, '’'),  #' apostrophe
+    # Then we look for opening or closing " and ' 
+    Replacer.new(/(\b"\B)/, '”'), #" close double quote
+    Replacer.new(/(\B"\b)/, '“'), #" open double quote
+    Replacer.new(/(\b'\B)/, '’'), #' close single quote
+    Replacer.new(/(\B'\b)/, '‘'), #' open double quote
+
+    # Then we look for space-padded opening or closing " and ' 
+    Replacer.new(/(")(\s)/, '”\2'), #" close single quote
+    Replacer.new(/(\s)(")/, '\1“'), #" open double quote
+    Replacer.new(/(')(\s)/, '’\2'), #' close single quote
+    Replacer.new(/(\s)(')/, '\1‘'), #' open double quote
+
+    # Then we gobble up stand-alone ones
+    Replacer.new(/(`)/, '‘'), #`
+    #Replacer.new(/(")/, '”'),
+    #Replacer.new(/(')/, '’'),
+  ]
+
+
+  class Token < String
+    attr_reader :name
+
+    def initialize(name, obj='')
+      super(obj)
+      @name = name
+    end
+
+
+    def to_s 
+      # For debugging
+      super() + '{' + @name.to_s + '}'
+    end
+
+
+    def +(extra)
+      Token.new(@name, super(extra))
+    end
+
+  end
+
+
+  def Token::escape(string)
+    string.gsub!('&', '&amp;')
+    string.gsub!('<', '&lt;' )
+    string.gsub!('>', '&gt;' )
+    # I prefer Python's .encode('ascii', 'xmlcharrefreplace'). 
+    # Call me traditional.
+    string.unpack("U*").collect {|s| (s > 127 ? "&##{s};" : s.chr) }.join("")
+  end
+
+
+  class Line < Array
+    attr_accessor :depth
+
+    def initialize(obj=[])
+      super(obj)
+      @depth = 0
+    end
+
+
+    def to_s
+      # For debugging:
+      'Line[' + self.join('') + '(' + @depth.to_s + ')]'
+    end
+
+
+    def length
+      inject(0) do |sum, x|
+	sum += x.length
+      end
+    end
+
+  end
+
+
+
+  class Node < Array
+    attr_reader :name
+
+    def initialize(name, children=[], hashlist={})
+      super(children)
+      @name = name.downcase
+      @attributes = hashlist
+    end
+
+
+    def node_children?
+      each do |n|
+	return true if n.is_a? Node
+      end
+      return false
+    end
+
+
+    def to_str
+      if @name == :br # hr too
+	return "<#{@name} />"
+      else
+	content = (map {|x| x.to_str}).join("\n")
+	content.gsub!("\n", "\n  ")
+
+	if node_children?
+	  return "<#{@name}#{attribute_string()}>\n  #{content}\n</#{@name}>"
+	else
+	  return "<#{@name}#{attribute_string()}>#{content}</#{@name}>"
+	end
+      end
+    end
+
+
+    protected 
+    def attribute_string
+      x = ""  
+      if @attributes
+	@attributes.each {|k,v| x+=" #{k}=\"#{v}\""}
+      end
+      x
+    end
+
+  end
+
+
+
+  class URLNode < Node
+
+    attr_reader :internal
+
+    def initialize(content, internal=false)
+      super(:a, [content,])
+      @internal = internal
+    end
+
+  end
+
+
+
+  class LinkNode < URLNode
+
+    def to_str
+      class_attribute = @internal ? '' : ' class="external"'
+
+      displayed_url = self[0]
+      if self[0...7] == 'http://'
+	displayed_url = self[0][7..-1]
+      end
+
+      return "<#{@name} href=\"#{Token.escape(self[0])}\"#{class_attribute}>#{Token.escape(displayed_url)}</#{@name}>"
+    end
+
+  end
+
+
+
+  class EmailNode < URLNode
+
+    def to_str
+      return "<#{@name} href=\"mailto:#{Token.escape(self[0])}\">#{Token.escape(self[0])}</#{@name}>"
+    end
+
+  end
+
+
+
+  class ImageNode < Node
+
+    def initialize(content='')
+      super(:img, [content,])
+    end
+
+
+    def to_str
+      return "<#{@name} src=\"#{Token.escape(self[0])}\"/>"
+    end
+
+  end
+
+
+
+  class YouTubeNode < Node
+
+    def initialize(url)
+      super(:object, [], {:width=>'425',:height=>'350'})
+
+      ytid = YouTubePattern.match(url)[2]
+      url = 'http://www.youtube.com/v/'+ytid
+
+      push(Node.new(:param, [], {:name=>'movie', :value=>url          }))
+      push(Node.new(:param, [], {:name=>'wmode', :value=>'transparent'}))
+      push(Node.new(:embed, [], {
+		      :type  =>'application/x-shockwave-flash',
+		      :wmode =>'transparent',
+		      :src   =>url,
+		      :width =>'425',
+		      :height=>'350',
+		    }))
+    end
+
+  end
+
+
+
+  class PottyMouth
+
+    def initialize(url_check_domains=[], url_white_lists=[], allow_media=false)
+      @url_check_domain = nil
+      if url_check_domains and url_check_domains.length > 0
+	@url_check_domain = Regexp.new("(\w:+//)?((" + 
+				       url_check_domains.join(")|(") + 
+				       "))",
+				       Regexp::IGNORECASE)
+      end
+      @url_white_lists = url_white_lists
+      @allow_media = allow_media
+    end
+
+
+    def to_s
+      # For debugging
+      s = "allow_media=#{@allow_media};"
+      if @url_check_domain
+	s += "\nchecking: #{@url_check_domain.source}\nallowed URLs:\n\t#{(@url_white_lists.collect {|w| w.source}).join('\n\t')}"
+      else
+	s += "\nWARNING: hyperlinking ALL URLs."
+      end
+      s
+    end
+
+
+    protected 
+    def debug(*strings)
+      puts strings.join(' ')
+    end
+
+
+    def tokenize(string)
+      p = 0
+      found_tokens = []
+      unmatched_collection = ''
+      while p < string.length
+	#debug(string[p..-1])
+	found_token = false
+	for tm in TokenList
+	  m = tm.match(string[p..-1])
+	  if m and m.offset(0)[0] == 0
+	    # GIANT HACK ^^^^^^^^^^^^^ Ruby regexes are always in
+	    # multiline mode, so you have to do that to ensure you're
+	    # matching at the beginning. WTF.
+	    found_token = true
+	    content = m[0]
+	    #debug("Found ", tm.name, " at ", p, ":", content, " against:", string[p..-1])
+	    p += content.length
+
+	    if tm.replace != nil 
+	      unmatched_collection += tm.replace
+	      break
+	    end
+
+	    if unmatched_collection.length > 0
+	      # BUG what if this isn't unicode? the python version decodes from UTF-8
+	      found_tokens.push(Token.new(:TEXT, unmatched_collection)) 
+	      #debug("adding token " + found_tokens[-1].to_s)
+	    end
+
+	    unmatched_collection = ''
+
+	    if tm.name == :NEW_LINE
+	      if found_tokens.length > 0 and found_tokens[-1].name == :TEXT
+		found_tokens[-1] += ' '
+	      end
+	      content=' '
+	    end
+
+	    found_tokens.push(Token.new(tm.name, content))
+	    #debug("adding token " + found_tokens[-1].to_s)
+	    break
+	  end
+	end
+
+	if not found_token
+	  # Pull one character off the string and continue looking for tokens
+	  unmatched_collection += string[p]
+	  #debug(unmatched_collection)
+	  p += 1
+	end
+      end
+
+      if unmatched_collection.length > 0
+	found_tokens.push(Token.new(:TEXT, unmatched_collection))
+      end
+
+      #debug(found_tokens)
+      return found_tokens
+    end
+
+
+    def find_blocks(tokens)
+      finished = []
+
+      current_line = Line.new()
+
+      stack = []
+
+      old_depth = 0
+
+      for t in tokens
+	#debug(t)
+
+	if t.name == :NEW_LINE
+	  if current_line.length > 0
+	    #debug("current_line.depth ", current_line.depth, "; old_depth ", old_depth)
+	    if current_line.depth == 0 and old_depth != 0
+	      # figure out whether we're closing >> or * here and collapse the stack accordingly
+	      #debug('need to collapse the stack by' + old_depth.to_s)
+	      top = nil
+	      for i in 0...old_depth
+		if stack.length > 0 and stack[-1].name == :p
+		  top = stack.pop()
+		end
+		if stack.length > 0 and stack[-1].name == :blockquote
+		  top = stack.pop()
+		end
+	      end
+
+	      if stack.length == 0
+		if top != nil
+		  finished.push(top)
+		end
+		stack.push(Node.new(:p))
+		#debug("added ", stack[-1].to_str)
+	      end
+	      #debug('closing out the stack')
+	      old_depth = 0
+	    end
+	    #debug('appending line to top of stack')
+	    if stack.length == 0
+	      stack.push(Node.new(:p))
+	      #debug("added ", stack[-1].to_str)
+	    end
+	    stack[-1].push( current_line )
+	    current_line = Line.new()
+
+	  elsif stack.length > 0
+	    if [:p,:li].index(stack[-1].name)
+	      top = stack.pop() # the p or li
+	      #debug('\tpopped off because saw a blank line')
+
+	      while stack.length > 0
+		if [:blockquote,:ul,:ol,:li].index(stack[-1].name)
+		  top = stack.pop()
+		else
+		  break
+		end
+	      end
+	      if stack.length == 0
+		finished.push(top)
+	      end
+	    end
+	  end
+	elsif [:HASH,:NUMBERDOT,:ITEMSTAR,:BULLET,:DASH].index(t.name) and current_line.length == 0
+	  if stack.length > 0 and stack[-1].name == :p
+	    top = stack.pop()
+	    if current_line.depth < old_depth
+	      # pop off <blockquote> and <li> or <p>so we can apppend the new <li> in the right node
+	      for i in 0...(old_depth - current_line.depth)
+		top = stack.pop() # the <blockquote>
+		top = stack.pop() # the previous <li> or <p>
+	      end
+	    end
+	    if stack.length == 0
+	      finished.push(top)
+	    end
+	  end
+
+	  if stack.length > 0 and stack[-1].name == :li
+	    stack.pop() # the previous li
+	  elsif stack.length > 0 and [:ul,:ol].index(stack[-1].name)
+	    # do nothing
+	  else
+	    if [:HASH,:NUMBERDOT].index(t.name)
+	      newl = Node.new(:ol)
+	    elsif [:ITEMSTAR,:BULLET,:DASH].index(t.name)
+	      newl = Node.new(:ul)
+	    end
+	    if stack.length > 0
+	      stack[-1].push(newl)
+	    end
+	    stack.push(newl)
+	    #debug("added ", stack[-1].to_str)
+	  end
+
+	  newli = Node.new(:li)
+	  stack[-1].push(newli)
+	  stack.push(newli)
+	  #debug("added ", stack[-1].to_str)
+
+	elsif t.name == :RIGHT_ANGLE and current_line.length == 0
+	  new_depth = t.count('>')
+	  old_depth = 0
+
+	  for n in stack.reverse
+	    if n.name == :blockquote
+	      old_depth += 1
+	    elsif [:p,:li,:ul,:ol].index(n.name)
+	      # do nothing
+	    else
+	      break
+	    end
+	  end
+	  #debug("nd:", new_depth, "; od:", old_depth)
+	  current_line.depth = new_depth
+	  if new_depth == old_depth
+	    # same level, do nothing
+	    #debug('\tsame level, do nothing')
+
+	  elsif new_depth > old_depth
+	    # current_line is empty, so we just make some new nodes
+	    for i in 0...new_depth - old_depth
+	      if stack.length == 0
+		newp = Node.new(:p)
+		stack.push(newp)
+		#debug("added ", stack[-1].to_str)
+	      elsif not [:p,:li].index(stack[-1].name)
+		newp = Node.new(:p)
+		stack[-1].push(newp)
+		stack.push(newp)
+		#debug("added ", stack[-1].to_str)
+	      end
+	      newq = Node.new(:blockquote)
+	      stack[-1].push(newq)
+	      stack.push(newq)
+	      #debug("added ", stack[-1].to_str)
+	    end
+
+	  elsif new_depth < old_depth
+	    # current line is empty, so we just pop off the existing nodes
+	    for i in 0...old_depth - new_depth
+	      stack.pop() # the p
+	      stack.pop() # the blockquote
+	    end
+	  end
+	  old_depth = new_depth
+
+	else
+	  if stack.length > 0 and stack[-1].name == :blockquote
+	    newp = Node.new(:p)
+	    stack[-1].push(newp)
+	    stack.push(newp)
+	    #debug("added ", stack[-1].to_str)
+	  end
+
+	  if t.name == :URL
+	    handle_url(t, current_line)
+	  elsif t.name == :YOUTUBE
+	    if @allow_media
+	      handle_youtube(t, current_line)
+	    else
+	      handle_url(t, current_line)
+	    end
+	  elsif t.name == :IMAGE
+	    if @allow_media
+	      handle_image(t, current_line)
+	    else
+	      handle_url(t, current_line)
+	    end
+	  elsif t.name == :EMAIL
+	    handle_email(t, current_line)
+	  elsif current_line.length > 0 and t.strip.length > 0
+	    #debug('\tadding (possibly empty space) text token to current line')
+	    current_line.push(t)
+	  elsif t.strip().length > 0
+	    #debug('\tadding non-empty text token to current line')
+	    current_line.push(t)
+	  end
+	end
+      end
+
+      if current_line.length > 0
+	if stack.length == 0
+	  stack.push(Node.new(:p))
+	  #debug("added ", stack[-1].to_str)
+	end
+	stack[-1].push(current_line)
+      end
+
+      while stack.length > 0
+	top = stack.pop()
+	if stack.length > 0 and stack[-1].index(top)
+	  # skip
+	else
+	  finished.push(top)
+	end
+      end
+      
+      return finished
+    end
+
+
+    def handle_email(email, current_line)
+      current_line.push(EmailNode.new(email))
+    end
+
+
+    def handle_url(anchor, current_line)
+      if not ProtocolPattern.match(anchor)
+	anchor = Token.new(anchor.name, 'http://' + anchor)
+      end
+
+      if @url_check_domain and @url_check_domain.match(anchor)
+	#debug(anchor + " in check domains")
+	for w in @url_white_lists
+	  #debug("checking " + anchor + " against " + w.source)
+	  if w.match(anchor)
+	    a = handle_link(anchor, internal=true)
+	    current_line.push(a)
+	    return
+	  end
+	end
+	#debug(anchor + " did not match any URL whitelists")
+	current_line.push(anchor)
+      else
+	a = handle_link(anchor, internal=false)
+	current_line.push(a)
+      end
+    end
+
+
+    def handle_link(anchor, internal=false)
+      return LinkNode.new(anchor, internal)
+    end
+
+
+    def handle_youtube(youtube, current_line)
+      ytn = YouTubeNode.new(youtube)
+      current_line.push(ytn)
+    end
+
+
+    def handle_image(image, current_line)
+      i = ImageNode.new(image)
+      current_line.push(i)
+    end
+
+
+    def create_spans(sub_line)
+      new_sub_line = []
+      current_span = nil
+      sub_line.each do |t|
+	if t.is_a? Node
+	  if current_span != nil
+	    new_sub_line.push(current_span)
+	    current_span = nil
+	  end
+	  new_sub_line.push(t)
+	else
+	  if current_span == nil
+	    current_span = Node.new(:span)
+	  end
+	  et = Token.escape(t)
+	  current_span.push(et)
+	end
+      end
+      if current_span != nil
+	new_sub_line.push(current_span)
+      end
+
+      return new_sub_line
+    end
+
+
+    def reduce_balanced(name, last_idx, stack)
+      n = Node.new(name)
+      sub_line = create_spans(stack[last_idx+1..-1])
+
+      (last_idx...stack.length).each do |i|
+	stack.pop()
+      end
+
+      if sub_line.length > 0
+	n.push(*sub_line)
+	stack.push(n)
+      end
+    end
+
+
+    def parse_line(line)
+      # Parse bold and italic and other balanced items
+      stack = []
+      finished = []
+      
+      last_bold_idx = -1
+      last_ital_idx = -1
+
+      leading_space_pad = false
+
+      (0...line.length).each do |i|
+	t = line[i]
+	if t.is_a? URLNode
+	  # URL nodes can go inside balanced syntax
+	  #debug("pushing", t.to_s, "onto", stack.length)
+	  stack.push(t)
+	elsif t.is_a? Node
+	  if stack.length > 0
+	    # reduce stack, close out dangling * and _
+	    sub_line = create_spans(stack)
+	    finished += sub_line
+	    last_bold_idx = -1
+	    last_ital_idx = -1
+	    stack = []
+	  end
+	  # add node to new_line
+	  #debug("appending ", t.to_s, " to ", finished.to_s)
+	  finished.push(t)
+	elsif t.is_a? Token
+	  if t.name == :UNDERSCORE
+	    if last_ital_idx == -1
+	      last_ital_idx = stack.length
+	      #debug("pushing", t.to_s, "onto", stack.to_s)
+	      stack.push(t)
+	    else
+	      reduce_balanced(:i, last_ital_idx, stack)
+	      if last_ital_idx <= last_bold_idx
+		last_bold_idx = -1
+	      end
+	      last_ital_idx = -1
+	    end
+	  elsif [:STAR, :ITEMSTAR].index(t.name)
+	    if t.name == :ITEMSTAR
+	      # Because ITEMSTAR gobbles up following space, we have to space-pad the next (text) token
+	      leading_space_pad = true
+	    end
+	    if last_bold_idx == -1
+	      last_bold_idx = stack.length
+	      #debug("pushing", t.to_s, "onto", stack.to_s)
+	      stack.push(t)
+	    else
+	      reduce_balanced(:b, last_bold_idx, stack)
+	      if last_bold_idx <= last_ital_idx
+		last_ital_idx = -1
+	      end
+	      last_bold_idx = -1
+	    end
+	  else
+	    if leading_space_pad
+	      # Because ITEMSTAR gobbled up the following space, we have to space-pad this (text) token
+	      t = Token.new(t.name, ' '+t)
+	      leading_space_pad = false
+	    end
+	    #debug("pushing", t.to_s, "onto", stack.to_s)
+	    stack.push(t)
+	  end
+	elsif t == nil
+	  # skip it
+	else
+	  raise "Unknown object in Line: " + t.class.to_s + ':' + t.to_s
+	end
+      end
+
+      if stack.length > 0
+	# reduce stack, close out dangling * and _
+	#debug("stack: ", stack.length)
+	sub_line = create_spans(stack)
+	#debug("sub_line: ", sub_line.length)
+	finished += sub_line
+      end
+      #debug(finished.collect {|x| x.name })
+      return finished
+    end
+
+
+    def parse_block(block)
+      new_block = Node.new(block.name)
+      current_line = nil
+
+      ppll = -1 # previous previous line length
+      pll  = -1 # previous line length
+
+      (0...block.length).each do |i|
+	item = block[i]
+	# collapse lines together into single lines
+	if item.is_a? Node
+	  if current_line != nil
+	    # all these lines should be dealt with together
+	    parsed_line = parse_line(current_line)
+	    new_block.push(*parsed_line)
+	  end
+
+	  parsed_block = parse_block(item)
+	  new_block.push(parsed_block)
+	  current_line = nil
+	  ppll = -1
+	  pll  = -1
+
+	elsif item.is_a? Line
+	  if current_line != nil
+	    if item.length < ShortLineLength
+	      # Identify short lines
+	      if 0 < pll and pll < ShortLineLength
+		current_line.push(Node.new(:br))
+	      elsif (block.length > i+1                  and   # still items on the stack
+		     block[i+1].is_a? Line               and   # next item is a line
+		     0 < block[i+1].length and block[i+1].length < ShortLineLength) # next line is short
+		# the next line is short and so is this one
+		current_line.push(Node.new(:br))
+	      end
+	    elsif 0 < pll and pll < ShortLineLength and 0 < ppll and ppll < ShortLineLength
+	      # long line at the end of a sequence of short lines
+	      current_line.push(Node.new(:br))
+	    end
+	    current_line.push(*item)
+	    ppll = pll
+	    pll = item.length
+	  else
+	    current_line = item
+	    ppll = -1
+	    pll = item.length
+	  end
+	end
+      end
+
+      if current_line != nil
+	parsed_line = parse_line(current_line)
+	new_block.push(*parsed_line)
+      end
+
+      return new_block
+    end
+
+
+    def pre_replace(string)
+      ReplaceList.each do |r|
+	r.sub(string)
+      end
+    end
+
+
+    public
+    def parse(string)
+      pre_replace(string)
+      tokens = tokenize(string)
+      blocks = find_blocks(tokens)
+      parsed_blocks = Node.new(:div)
+      blocks.each do |b|
+	nb = parse_block(b)
+	parsed_blocks.push(nb)
+      end
+      return parsed_blocks
+    end
+
+  end
+
+end
+
+if __FILE__ == $0
+  pm = PottyMouth.new(url_check_domains=["www.mysite.com", "mysite.com"],
+		      url_white_lists=[/https?:\/\/www\.mysite\.com\/allowed\/url\?id=\d+/,],
+		      allow_media=true)
+
+  puts pm.to_s
+  while true
+    puts "input (end with Ctrl-D)>>"
+    text = $stdin.read
+    next if not text
+
+    blocks = pm.parse(text)
+    puts blocks.to_str
+    puts '=' * 70
+  end
+end

File build_deb.sh

+#!/bin/sh
+
+rm ../python-pottymouth_* -f
+debuild -uc -us

File build_gem.sh

+#!/bin/sh
+
+gem build pottymouth.gemspec

File debian/changelog

+python-pottymouth (1.1.4-0) jaunty; urgency=low
+
+  * output UTF-8, not ASCII, xHTML
+
+ -- matt <matt@mosuki.com>  Fri, 27 Feb 2009 22:56:26 -0800
+
+python-pottymouth (1.1.3-0) jaunty; urgency=low
+
+  * don't create span nodes unless they are required.
+
+ -- matt <matt@mosuki.com>  Sat, 14 Feb 2009 12:30:28 -0800
+
+python-pottymouth (1.1.2-0) intrepid; urgency=low
+
+  * do all escaping and encoding at __str__ time; internal objects store pure unicode
+
+ -- matt <matt@mosuki.com>  Wed, 24 Sep 2008 00:09:03 -0700
+
+python-pottymouth (1.1.1-0) intrepid; urgency=low
+
+  * restructuring to allow easier iteration over returned Node objects
+  * __repr__ no longer overridden to return Unicode objects
+
+ -- matt <matt@mosuki.com>  Mon, 22 Sep 2008 23:15:33 -0700
+
+python-pottymouth (1.1.0-0) gutsy; urgency=low
+
+  * added syntax configuration
+
+ -- matt <matt@mosuki.com>  Fri, 11 Apr 2008 17:29:44 -0700
+
+python-pottymouth (1.0.2-0) gutsy; urgency=low
+
+  * some minor structural changes inspired by the port to Ruby
+
+ -- matt <matt@mosuki.com>  Sun, 06 Apr 2008 11:10:16 -0700
+
+python-pottymouth (1.0.1-0) gutsy; urgency=low
+
+  * fixing security hole where bare www.* URLs were treated as external and not checked against whitelist
+
+ -- matt <matt@mosuki.com>  Thu, 14 Feb 2008 13:38:59 -0800
+
+python-pottymouth (1.0.0-0) feisty; urgency=low
+
+  * Added support for literal lists
+  * This is now ready for 1.0
+
+ -- matt <matt@mosuki.com>  Tue, 30 Oct 2007 13:49:10 -0700
+
+python-pottymouth (0.9.11-0) feisty; urgency=low
+
+  * make image and YouTube into first-class tokens
+  * image pattern is now usable by external code
+
+ -- matt <matt@mosuki.com>  Tue, 23 Oct 2007 17:18:09 -0700
+
+python-pottymouth (0.9.10) feisty; urgency=low
+
+  * fix subtle missing space between two adjacent non-text inline tokens
+  * replace --, ..., and :) with context-insensitive Token() instead of 
+    context-sensitive Replacer
+  * YouTube URLs can have - in them
+  * better interactive prompt
+  * removed legacy code that was supposed to be removed in 0.9.8
+  * <hr> comment
+  * fixed some code formatting
+
+ -- matt <matt@mosuki.com>  Fri,  7 Sep 2007 16:55:11 -0700
+
+python-pottymouth (0.9.9-0) feisty; urgency=low
+
+  * identify URLs that begin with just www. , put http:// on the front,
+    and treat them like ordinary URLs
+  * url detection is now case-insensitive
+
+ -- matt <matt@mosuki.com>  Tue, 14 Aug 2007 19:21:32 -0700
+
+python-pottymouth (0.9.8-0.0) unstable; urgency=low
+
+  * two extremely minor bug fixes to email and youtube embedding
+
+ -- matt <matt@mosuki.com>  Tue, 15 May 2007 18:06:44 -0700
+
+python-pottymouth (0.9.7-0.1) unstable; urgency=low
+
+  * Non-maintainer upload.
+  * Built against latest head: b8de
+
+ -- Jeremy Avnet <brainsik@devsuki.com>  Mon, 14 May 2007 19:15:03 -0700
+
+python-pottymouth (0.9.7-0) unstable; urgency=low
+
+  * Fixed buggy output of <br/> tags
+  * Fixed subtle * and _ nesting bugs 
+  * Removed ~devsuki1 from versions
+  * Added readme.html and demo web script 
+  * Handle backticks, doubled backticks and doubled single quotes
+
+ -- matt <matt@mosuki.com>  Mon, 14 May 2007 18:03:37 -0700
+
+python-pottymouth (0.9.6-0) unstable; urgency=low
+
+  * Added smart identification of quotes, ellipsis, and emdash
+  * Fixed bug with hyperlinks inside bold/italic
+
+ -- matt <matt@mosuki.com>  Sat, 12 May 2007 12:37:19 -0700
+
+python-pottymouth (0.9.5-0.1) unstable; urgency=low
+
+  * Set pyversions to "2.4-" to allow installing with Python > v2.4.
+
+ -- Jeremy Avnet <brainsik@devsuki.com>  Tue, 10 Apr 2007 17:56:23 -0700
+
+python-pottymouth (0.9.5-0) unstable; urgency=low
+
+  * allow YouTube video embedding
+
+ -- matt <matt@mosuki.com>  Tue,  3 Apr 2007 12:22:03 -0700
+
+python-pottymouth (0.9.4-0) unstable; urgency=low
+
+  * bug fix for ImageNode object;  allow jpEg images too
+
+ -- matt <matt@mosuki.com>  Fri, 16 Mar 2007 16:38:37 -0700
+
+python-pottymouth (0.9.3-0) unstable; urgency=low
+
+  * optionally generate <img> tags for png, jpg, gif URLs
+
+ -- matt <matt@mosuki.com>  Fri, 16 Mar 2007 16:03:39 -0700
+
+python-pottymouth (0.9.2-0) unstable; urgency=low
+
+  * generate all HTML nodes as lowercase to be compatible with XHTML
+
+ -- matt <matt@mosuki.com>  Fri, 16 Mar 2007 15:24:45 -0700
+
+python-pottymouth (0.9.1-0) unstable; urgency=low
+
+  * Token.escape is a staticmethod for easier escaping
+
+ -- Matt Chisholm <matt@mosuki.com>  Tue, 20 Feb 2007 15:27:32 -0800
+
+python-pottymouth (0.9-0) unstable; urgency=low
+
+  * Initial release.
+
+ -- Matt Chisholm <matt@mosuki.com>  Thu, 15 Feb 2007 15:26:44 -0800

File debian/control

+Source: python-pottymouth
+Section: python
+Priority: optional
+Maintainer: Matt Chisholm <matt@mosuki.com>
+Uploaders: Matt Chisholm <matt@mosuki.com>
+Build-Depends: cdbs (>= 0.4.46), python-dev (>= 2.4.3-11), debhelper (>= 5.0.37.3), python-support (>= 0.5.3)
+Standards-Version: 3.7.2.0
+
+
+Package: python-pottymouth
+Architecture: all
+Depends: ${python:Depends}, ${misc:Depends}, ${shlibs:Depends}
+Provides: ${python:Provides}
+Description: transform unstructured, untrusted text to safe, valid XHTML
+ PottyMouth transforms completely unstructured and untrusted 
+ text to valid, nice-looking, completely safe XHTML.
+ .
+ PottyMouth is designed to handle input text from non-technical,
+ potentially careless or malicious users. It produces HTML that is
+ completely safe, programmatically and visually, to include on any web
+ page. And you don't need to make your users read any instructions
+ before they start typing. They don't even need to know that
+ PottyMouth is being used.

File debian/copyright

+Copyright (c) 2007, Matt Chisholm
+
+You are free to distribute this software under the terms of the BSD License.
+On Debian systems, the complete text of the BSD License can be found in
+/usr/share/common-licenses/BSD.
+

File debian/pyversions

+2.4-

File debian/rules

+#!/usr/bin/make -f
+# -*- makefile -*-
+# Debian rules file for python-icalendar
+# Uncomment this to turn on verbose mode.
+#export DH_VERBOSE=1
+
+# Python
+DEB_PYTHON_SYSTEM=pysupport
+
+################################################################################
+# CDBS File Inclusions and Variable Declarations
+################################################################################
+include /usr/share/cdbs/1/rules/debhelper.mk
+include /usr/share/cdbs/1/rules/simple-patchsys.mk
+include /usr/share/cdbs/1/class/python-distutils.mk
+
+# Careful, fails to recognize epochs
+UPSTREAM_VERSION=$(shell dpkg-parsechangelog|sed -n -e 's/^Version: \(.*\)-.*/\1/ p')
+
+# Pure python library for site-python directory
+#DEB_PYTHON_INSTALL_ARGS_ALL = --no-compile -O0 --install-purelib=/usr/lib/site-python
+
+clean::
+	-find -name \*.pyc -exec rm \{\} \;
+DEBEMAIL='matt@mosuki.com' dch -v 1.1.3-0

File pottymouth.gemspec

+Gem::Specification.new do |spec|
+  spec.name = "PottyMouth"
+  spec.version = "1.0.2.1"
+  spec.homepage = "http://devsuki.com/pottymouth"
+  spec.author = "Matt Chisholm"
+  spec.email = "matt@mosuki.com"
+  spec.summary = "transform unstructured, untrusted text to safe, valid XHTML"
+  spec.description = <<-EOF
+PottyMouth transforms completely unstructured and untrusted text to valid, nice-looking, completely safe XHTML.
+
+PottyMouth is designed to handle input text from non-technical, potentially careless or malicious users. It produces HTML that is completely safe, programmatically and visually, to include on any web page. And you don't need to make your users read any instructions before they start typing. They don't even need to know that PottyMouth is being used.
+EOF
+
+  spec.files = ["PottyMouth.rb",]
+  spec.test_file = "test.rb"
+  spec.extra_rdoc_files = ['readme.html', 'LICENSE.txt']
+  spec.required_ruby_version = '>= 1.9.0' 
+end

File pypi-index.html

+<?xml version="1.0" encoding="utf-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
+	<head>
+	<meta http-equiv="refresh" content="0;url=http://devsuki.com/pottymouth" />
+	</head>
+	<body>
+		<a href="http://devsuki.com/pottymouth">PottyMouth documentation can be found at http://devsuki.com/pottymouth</a>
+	</body>
+</html>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<html>
+  <head>
+    <title>PottyMouth</title>
+  </head>
+
+  <body>
+    <style type="text/css" media="screen">
+      body {
+        background-color:#e0e0ef;
+        font-family:Verdana, 'Bitstream Vera Sans', sans-serif;
+        color:#444;
+      }
+      h1, h2, h3, h4 { 
+        color:#000; 
+      }
+      a { color:#113; }
+      a:visited { color:#224; }
+      a:hover   { color:#446; }
+      a:active  { color:#557; }
+      blockquote {
+        border-left-color:#303000;
+      }
+      code, pre, #potty_input{
+        color:#202020;
+        background-color:#eee;
+      }
+      div.head {
+        color:#000;
+      }
+      .potty, #potty_output{
+        color:#303000;
+        background-color:#efefe0;
+      }
+      div.main {
+        background-color:#fff;
+        width:expression('40em'); /* IE doesn't support max/min width */
+        min-width:30em;
+        max-width:50em;
+        padding:1ex 2ex;
+      }
+    </style>
+    <style type="text/css" media="print">
+      body { font-family:Palatino, Times, serif; }
+      a { color:#000000; text-decoration:none; }
+      div.nav { display:none; }
+    </style>
+    <style type="text/css" media="all">
+      body {
+        text-align:center;
+        margin:0px;
+        line-height:1.5em;
+      }
+      h1, h2, h3, h4 {
+        font-variant:small-caps;
+      }
+      code, pre{
+        font-size:large;
+        line-height:1.25em;
+      }
+      code, span.potty{
+        padding:1px;
+      }
+      pre, div.potty > p{
+        padding:4px;
+      }
+      div.potty, pre, #potty_input, #potty_output {
+        width:85%;
+      }
+      ul li{
+        list-style-type:disc;
+      }
+      li { 
+        line-height:1.25em;
+        margin-bottom:0.25em;
+      }
+      blockquote {
+        border-left-width:2px;
+        border-left-style:solid;
+        margin-left:1.5em;
+        padding-left:0.5em;
+      }
+      div.head {
+        text-align:center;
+        font-size:medium;
+      }
+      div.main {
+        text-align:left;
+        margin:auto;
+      }
+      .potty {
+        font-family:serif;
+      }
+      acronym {
+        font-variant:small-caps;
+      }
+    </style>
+    <style>
+      ul.nav {
+        float:right;
+        margin:0;
+      }
+      ul.nav li{
+	list-style-type:none;
+      }
+      ul.nav li a{
+        background:#e0e0ef;
+        padding:2px 4px;
+        margin:4px 0px;
+        display:block;
+        font-size:small;
+      }
+      ul.nav li a:hover {
+        background:#f0f0ff;
+      }
+    </style>
+    <div class="main">
+
+      <ul class="nav">
+	<li><a href="#">introduction</a></li>
+        <!--
+	<li><a href="#do"></a></li>
+	<li><a href="#for"></a></li>
+	<li><a href="#notfor"></a></li>
+	<li><a href="#unstructured"></a></li>
+	<li><a href="#safeHTML"></a></li>
+	<li><a href="#untrusted"></a></li>
+	<li><a href="#secureHTML"></a></li>
+	<li><a href="#prevent"></a></li>
+	-->
+	<li><a href="#syntax">syntax</a></li>
+	<!--
+	<li><a href="#lines"></a></li>
+	<li><a href="#lists"></a></li>
+	<li><a href="#quotes"></a></li>
+	<li><a href="#links"></a></li>
+	<li><a href="#media"></a></li>
+	<li><a href="#bold"></a></li>
+	<li><a href="#italic"></a></li>
+	<li><a href="#characters"></a></li>
+	-->
+	<li><a href="#usage">usage</a></li>
+	<li><a href="#download">download</a></li>
+	<li><a href="#demo">demonstration</a></li>
+      </ul>
+	
+      <h1>PottyMouth</h1>
+      
+      <p style="font-size:small;line-height:1em;margin-top:0;">&copy; 2007-2009 <a href="http://glyphobet.net/">Matt Chisholm</a>
+         <br/>
+         <tt>matt dash pottymouth at <a href="http://mosuki.com">mosuki dot com</a></tt></p>
+
+      <h3 id="do">What does it <i>do</i>?</h3>
+
+      <p>PottyMouth transforms completely unstructured and untrusted text to valid, nice-looking, completely safe XHTML.</p>
+
+      <p>PottyMouth is designed to handle input text from non-technical, potentially careless or malicious users. It produces HTML that is completely safe, programmatically and visually, to include on any web page.  And you don&#8217;t need to make your users read any instructions before they start typing.  They don&#8217;t even need to know that PottyMouth is being used.</p>
+
+
+      <h3 id="for">What is it <i>for</i>?</h3>
+
+      <p>PottyMouth is ideal for displaying blog comments, text email bodies in a web mail application or mailing list web archive, or any text fields on any site with user input text, such as a social networking, dating, or community site. In short, any input which is displayed in HTML and is input as text by a non-technical and/or untrusted user. It has been in use on <a href="http://mosuki.com">mosuki.com</a> since January 2007, and on <a href="http://spydentify.com">spydentify.com</a> since January 2008.</p>
+
+
+      <h3 id="notfor">What is it <i>not</i> for?</h3>
+
+      <p>PottyMouth is not intended for HTML page generation, such as writing blog entries, where the author is an authorized and trusted user who may want to exert more control over the content of his or her post. <a href="http://daringfireball.net/projects/markdown/">Markdown</a> and <a href="http://daringfireball.net/projects/smartypants/">SmartyPants</a>, or <a href="http://www.textism.com/tools/textile/">Textism</a> are good solutions for trusted HTML authoring.</p>
+
+      <p>PottyMouth is also not intended for wikis, where the text is more heavily structured and where poorly formatted or malicious input can be quickly corrected by another user. There are <a href="http://en.wikipedia.org/wiki/Comparison_of_wiki_software">many</a> <a href="http://www.mediawiki.org/wiki/MediaWiki">good</a> <a href="http://moinmoin.wikiwikiweb.de/">wiki</a> <a href="http://freshmeat.net/search/?q=wiki&amp;section=projects">packages</a> out there; this is not one of them.</p>
+
+      <h3 id="care">Why should I care about&#8230;?</h3>
+
+      <h4 id="unstructured">&#8230;unstructured text input?</h4>
+
+      <p>The average, non-technical user doesn&#8217;t care about formatting syntax and won&#8217;t take the time to learn it.  PottyMouth lets your website display any user input without having to make your users learn <b>anything.</b> The only &#8220;syntax&#8221; that PottyMouth uses are conventions that are ubiquitous on-line. If your site displays text input from external programs, third-party sites, or other sources like email, you can&#8217;t rely on your users to know about your site&#8217;s text formatting conventions.</p>
+
+
+      <h4 id="safeHTML">&#8230;layout-safe HTML?</h4>
+
+      <p>You want to allow your users the freedom to put whatever they want on your site.  But you don&#8217;t want badly formatted text to make that text look ugly, or to screw up the layout of other elements on the page.</p>
+
+
+      <h4 id="untrusted">&#8230;untrusted text input?</h4>
+
+      <p>If it&#8217;s possible for an untrusted or anonymous user to input text that gets inserted in HTML on your site, you need to process that text to make sure it cannot cause problems for other visitors. If your site displays text input from external programs, third-party sites, or other sources like email, you can&#8217;t control or check that text until you are displaying it.</p>
+
+
+      <h4 id="secureHTML">&#8230;secure HTML?</h4>
+
+      <p>Allowing anyone to insert raw or even limited HTML into your site is dangerous. If an attacker can insert JavaScript, media, or malicious links into your site, he or she can cause a user or their browser to perform malicious actions or send spam, on your site or third party sites, or they can insert DHTML id attributes or JavaScript to break your DHTML/JavaScript application. If an attacker can insert CSS into your site, they can hide or override advertisements, warnings, or instructions with their own content.</p>
+
+
+      <h3 id="prevent">What does it prevent?</h3>
+
+      <p>PottyMouth prevents against a wide range of potential problems:</p>
+      
+      <ul>
+	<li>no JavaScript or HTML insertion via <code>&lt;iframe&gt;</code> tags</li>
+	<li>no JavaScript insertion via: <code>&lt;script&gt;</code> tags</li>
+	<li>no JavaScript insertion via: event handler attributes on tags</li>
+	<li>no JavaScript insertion via <code>javascript:</code> hyperlinks</li>
+	<li>no JavaScript insertion via CSS <code>expression()</code></li>
+	<li>no overriding of site CSS via <code>&lt;style&gt;</code> tags</li>
+	<li>no attacks via malicious <code>href</code> attributes in <code>&lt;a&gt;</code> or <code>src</code> attributes in <code>&lt;img&gt;</code>, <code>&lt;embed&gt;</code> or other media tags</li> 
+	<li>no damage to site layout via inserted CSS or <code>width</code>, <code>height</code>, or other HTML attributes</li>
+	<li>no ability to break or compromise JavaScript applications by generating HTML tags with identifiers that collide with existing DOM identifiers.</li>
+      </ul>
+
+      <p>Although the problems above could be solved by simply allowing a short white-list of HTML tags and no HTML attributes whatsoever, inserting raw HTML tags is a feature that non-technical users don&#8217;t need. And PottyMouth automatically detects most of the instances where the average user would want HTML tags.</p>
+
+      <ul class="nav">
+	<li><a href="#">introduction</a></li>
+	<li><a href="#syntax">syntax</a></li>
+	<li><a href="#usage">usage</a></li>
+	<li><a href="#download">download</a></li>
+	<li><a href="#demo">demonstration</a></li>
+      </ul>
+
+
+      <h2 id="syntax">PottyMouth syntax</h2>
+
+      <p>Although PottyMouth has no syntax that users must learn, it does parse input text to transform it to HTML.  It relies on some ubiquitous text formatting conventions to do the best formatting job possible.</p> 
+      
+
+      <h4 id="lines">Paragraphs, newlines, and ad-hoc lists</h4>
+      
+      <p>PottyMouth intelligently identifies paragraph breaks, newlines, and ad-hoc lists.  A sequence of more than one blank line is turned into a paragraph break.  Within a single paragraph, PottyMouth distinguishes between &#8220;short&#8221; and &#8220;long&#8221; lines and treats them differently.  </p>
+      
+      <ul>
+	<li>A sequence of long lines is treated as a single, unbroken line, without newlines. </li>
+	
+	<li>A single short line between two long lines is also treated as part of the single, long line, and does not insert a newline either.  This ensures that text that has been hard wrapped more than once at decreasing line lengths is repaired, and rendered as a single unbroken paragraph.</li>
+	
+	<li>Two or more consecutive short lines are treated as an ad-ho