Parsing an HTML Text Editor's Content Using Python and BeautifulSoup

, a 35-minute piece by Jared Nagle Jared Nagle

If any software developer has attempted sifting through the output of a rich text editor, they'll have come across the nightmare of parsing an arbitrarily-redundant attribute-heavy DOM of wonders. Even in today's advanced Internet society, browsers still differ in their treatment of rich text. However, I'm only regurgitating the comments of my colleagues and superiors as I have never actually needed to attempt this task myself.

My recent assignment involved the parsing of content produced by such an editor. The objective was to export this data into a word document (.docx). Easy, right? Granted, this is my first ever attempt at parsing a large DOM chunk entirely and having no prior experience, my hopes were — naively? — high. I'd already learned about the comedic reactions to those who had attempted to regex arbitrary HTML so went in cautious. The editor in question is a restricted version of the Google Closure UI Editor. Google Closure is a JavaScript API providing cross-browser compilation/minification (see Google Closure). It took a few refactoring iterations, but I believe I've come down to a palatable solution. At least for those lucky enough to be coding in Python. That being said, the solution is generic enough that if you happen to come across a similar library for objectifying HTML into native code, you'll probably have an easy time porting this code over.

BeautifulSoup is an HTML parsing library that does a fantastic job of translating any HTML tree into a Python tree of just two types; Tag and NavigableString.

A lot of trust has been placed in the ability of BeautifulSoup to parse the HTML correctly, so if you're using another language, make sure the reputation of your library is solid, or deal with a subset of HTML that you know your parser will understand.

Due to the foresight of our team, the editor itself was a restricted version that only supported a few element types; numbered and bullet lists as well as underline, bold and italic text.

After a few errorsome implementations and fumbling, I have arrived at an elegant solution that provides a flexible interface to handle the traversal of an arbitrary HTML structure.

The Traversal Interface

# rich_text_parser.py

...

class RichTextTraversal(object):
    def handle_string(self, navigable_string):
        """
        handle the actual contents of a tag. A tag may have zero to many of these.
        :param navigable_string: for all intents and purposes, this may be treated as a native python string
        :type navigable_string: ``NavigableString``
        """
        pass

    def start_tag(self, tag):
        """
        logic when entering a tag
        :param tag: the Tag object provided by the BeautifulSoup library
        :type tag: ``Tag``
        """
        pass

    def end_tag(self, tag):
        """
        logic when leaving a tag
        :param tag: the Tag object provided by the BeautifulSoup library
        :type tag: ``Tag``
        """
        pass

The object I was working with was a tree of tags that may contain strings, so a traverser needs only know what to do when entering and exiting a tag as well as what to do when it encounters a string. The context of the traversal is handled by the traversal implementation and need not know about the parser.

The Parser

# rich_text_parser.py

from bs4 import NavigableString, BeautifulSoup
from bs4 import Tag

class RichTextParser(object):
    """
    Assists in the parsing of html markup
    """

    def __init__(self, markup):
        """
        takes either a beautiful soup object or a markup string and attempts to parse it into a ``BeautifulSoup``
        :param markup: semantic html markup string or a pre-parsed BeautifulSoup object
        :type markup: ``str`` | ``BeautifulSoup``
        """
        if isinstance(markup, str):
            markup = BeautifulSoup(markup, 'lxml')
        self._soup = markup
        self._traverser = None  # type: ``RichTextTraversal``

    def traverse(self, rich_text_traversal):
        """
        traverses the parsed structure and
        :param rich_text_traversal: the traversal implementation
        :type rich_text_traversal: T <= ``RichTextTraversal``
        :rtype: T
        """
        self._traverser = rich_text_traversal
        self._tag(self._soup)
        return rich_text_traversal

    def _tag(self, tag):
        """
        navigates the traversal's logic for each tag and their contents recursively
        :param tag: the string or tag object
        :type tag: ``NavigableString`` | ``Tag``
        """
        if isinstance(tag, NavigableString):
            self._traverser.handle_string(tag)
        else:
            self._traverser.start_tag(tag)
            for child_tag in tag.contents:
                self._tag(child_tag)
            self._traverser.end_tag(tag)

...

This implementation follows the structure of the Traversal interface. The traverse function accepts a RichTextTraversal implementation and hands off control to the _tag(self, tag) function. The logic is simple:

If a string is encountered anywhere, it cannot have contents, so we can safely call self._traverser.handle_string(tag) and continue.

The Traversal Implementation

from bs4 import NavigableString
from bs4 import Tag
from docx import Document
from docx.blkcntnr import BlockItemContainer
from docx.text.paragraph import Paragraph
from docx.text.run import Run

from st.soup import RichTextTraversal


class FillDocumentRichTextTraversal(RichTextTraversal):
    def __init__(self, container=None, append=True):
        """
        create a docx from a soup traversal
        :param container: the document container to append to
        :type container: ``BlockItemContainer``
        """
        if container is None:
            container = Document()
        self._overwrite_current_paragraph = not append
        self.rootDocumentContainer = container  # type: BlockItemContainer
        self._current_paragraph = None  # type: Paragraph
        self._current_indent = 0
        self._overflow_indent = 0
        self._current_run = None  # type: Run
        self._format_context = {"b": 0, "u": 0, "i": 0}
        self._start_tag_handlers = {
            'ol': lambda _: self._handle_list_start(),
            'ul': lambda _: self._handle_list_start(),
            'li': self._handle_li_start,
            'br': self._handle_br,
            'b': lambda _: self._up_format("b"),
            'u': lambda _: self._up_format("u"),
            'i': lambda _: self._up_format("i")
        }  # type: dict[str, Tag -> None]
        self._end_tag_handlers = {
            'b': lambda _: self._down_format("b"),
            'u': lambda _: self._down_format("u"),
            'i': lambda _: self._down_format("i"),
            'ul': lambda _: self._handle_list_end(),
            'ol': lambda _: self._handle_list_end()
        }  # type: dict[str, Tag -> None]
        self._block_tags = {
            'p', 'div', 'ol', 'ul', 'br'
        }
        self._inline_tags = {
            'span', 'b', 'u', 'i'
        }

    def handle_string(self, navigable_string):
        run = self.get_or_add_current_run()
        run.text += navigable_string
        if self._format_context['b'] > 0:
            run.bold = True
        if self._format_context['i'] > 0:
            run.italic = True
        if self._format_context['u'] > 0:
            run.underline = True
        self._current_run = None


    def start_tag(self, tag):
        if tag.name in self._block_tags:
            self._start_block_tag()
        if tag.name in self._start_tag_handlers.keys():
            self._start_tag_handlers[tag.name](tag)

    def end_tag(self, tag):
        if tag.name in self._end_tag_handlers:
            self._end_tag_handlers[tag.name](tag)

    def _handle_li_start(self, list_item):
        if len(list_item.contents) == 1 and list_item.contents[0] is not NavigableString \
                and list_item.contents[0].name != 'br' or list_item.next_sibling:
            self.get_or_add_paragraph(force_add=True)
        self.get_or_add_paragraph().style = "List {style}{number}".format(
            style="Number" if list_item.parent.name == 'ol' else "Bullet",
            number=" {}".format(self._current_indent) if self._current_indent > 1 else "")

    def _handle_br(self, br):
        inside_list = br.find_parent('ol') or br.find_parent('ul')
        if not inside_list:
            self.get_or_add_paragraph(force_add=True)
        elif br.find_parent('li').next_sibling and br.previous_sibling:
            self.get_or_add_paragraph(force_add=True).style = 'List{}'.format(
                " {}".format(self._current_indent) if self._current_indent > 1 else "")

    def get_or_add_current_run(self, force_add=False):
        if self._current_run is None or self._current_paragraph is None:
            current_paragraph = self.get_or_add_paragraph()
            self._current_run = current_paragraph.add_run()
        return self._current_run

    def get_or_add_paragraph(self, force_add=False):
        if self._overwrite_current_paragraph and len(self.rootDocumentContainer.paragraphs) > 0:
            self._overwrite_current_paragraph = False
            self._current_paragraph = self.rootDocumentContainer.paragraphs[-1]
        elif self._overwrite_current_paragraph:
            # we don't want to overwrite the next paragraph if there wasn't any to begin with
            self._overwrite_current_paragraph = False
        elif self._current_paragraph is None or force_add:
            self._current_paragraph = self.rootDocumentContainer.add_paragraph()
        return self._current_paragraph

    def _start_block_tag(self):
        self._current_paragraph = None

    def _up_format(self, format_type):
        self._format_context.update({format_type: self._format_context[format_type] + 1})

    def _down_format(self, format_type):
        self._format_context.update({format_type: max(self._format_context[format_type] - 1, 0)})

    def _handle_list_start(self):
        if self._overflow_indent < 1 and self._current_indent < 3:
            self._current_indent += 1
        else:
            self._overflow_indent += 1

    def _handle_list_end(self):
        if self._overflow_indent > 0:
            self._overflow_indent -= 1
        else:
            self._current_indent -= 1

The main idea in this implementation is that no actual content is created until a NavigableString is encountered. When entering a tag, such as a new <ul>/<ol>, a string format tag (<b>, <i>, <u>), a context — such as the indent level or whether the next string is bold or italic — is updated accordingly.

Once the string is encountered, the correct context is available to apply all the formatting we need to place it in the document. The current run or paragraph is lazily retrieved so as to always have a non-None context for appending text to.

Parsing markup from a rich text editor is never a trivial task and, if at all possible, reducing the abilities of your editor reduces the problem set you will have to deal with. As I write these concluding remarks, my heart saddens, as the parser I wrote is no longer a viable solution for our project due to the afformentioned reasons of ever-manifesting issues as more data is pushed through it. We have resorted to producing a PDF instead using the very capable WeasyPrint library. Even though this is an article about parsing markup, I would still professionally recommend using something other than rich text editors in the first place. The maintainer of WeasyPrint also happens to be one of the maintainers of the official CSS3 specifications, so producing the PDF from markup was quite trivial compared to this task. Anyhow, I do hope this helps anyone else who may be under the pump to produce a similar result.

Next Up: a 1-minute piece by Dev Mukherjee Dev Mukherjee

An API First Approach to Designing Software

Read more