Statamic

Importing HTML into Statamic’s Bard fieldtype

This article covers importing HTML in Statamic 3.3 and lower with the html-to-prosemirror package. For Statamic 3.4 and higher check out the updated version that uses tiptap-php.

Statamic’s Bard fieldtype stores values as ProseMirror documents, so if you’re importing existing HTML with a PHP script it makes sense to convert it to the nodes and marks Bard expects¹. This guide outlines how to convert HTML to ProseMirror, a couple of gotchas, and finally how to handle sets.

Converting HTML to ProseMirror
Potential problems
- Fixing invalid node data
- Excluding disabled node types
Converting elements to sets

Converting HTML to ProseMirror

Getting started is very easy. Statamic includes the html-to-prosemirror package, and all we need to do is create a new renderer instance and pass the HTML to it:

use HtmlToProseMirror\Renderer;
 
$value = (new Renderer)->render($html)['content'];use HtmlToProseMirror\Renderer;
 
$value = (new Renderer)->render($html)['content'];

The renderer will return a full document node, but Bard only needs the content.

The html-to-prosemirror renderer expects UTF-8 encoded data. If your HTML is in a different encoding you’ll need to convert it first:

$html = mb_convert_encoding($html, 'utf-8', [source-encoding]);$html = mb_convert_encoding($html, 'utf-8', [source-encoding]);

Potential problems

Fixing invalid node data

Unfortunately things aren’t always that simple. ProseMirror uses a schema that defines which nodes and marks are valid where, but html-to-prosemirror doesn't actually enforce it. Each element is simply converted to the nearest ProseMirror equivalent, maintaining the original hierarchy.

One common issue is text nodes at the root of the returned value, which you’ll find can’t be edited through the control panel. This can happen if your HTML contains text that’s not wrapped in a paragraph or other block element.

We can fix these nodes by wrapping them in paragraphs after conversion:

$value = collect($value)
    ->map(function ($node) {
        return $node['type'] === 'text' ? [
            'type' => 'paragraph',
            'content' => [$node],
        ] : $node;
    })
    ->filter(function ($node) {
        return $node['type'] !== 'hard_break';
    })
    ->values()
    ->all();$value = collect($value)
    ->map(function ($node) {
        return $node['type'] === 'text' ? [
            'type' => 'paragraph',
            'content' => [$node],
        ] : $node;
    })
    ->filter(function ($node) {
        return $node['type'] !== 'hard_break';
    })
    ->values()
    ->all();

Root hard break (<br>) nodes will cause the same problem, but as we’re introducing paragraphs those can simply be filtered out (remember this is only affecting the root nodes).

Excluding disabled node types

If your HTML contains elements that you haven’t enabled in the Bard field you’ll run into invalid data errors in the control panel. Excluding these nodes and marks is as simple as not including the relevant extensions when initialising the renderer:

use HtmlToProseMirror\Nodes;
use HtmlToProseMirror\Marks;
 
$value = (new Renderer)
    ->withNodes([
        Nodes\Blockquote::class,
        Nodes\BulletList::class,
        Nodes\CodeBlock::class,
        Nodes\CodeBlockWrapper::class,
        Nodes\HardBreak::class,
        Nodes\Heading::class,
        Nodes\HorizontalRule::class,
        Nodes\Image::class,
        Nodes\ListItem::class,
        Nodes\OrderedList::class,
        Nodes\Paragraph::class,
        // Nodes\Table::class,
        // Nodes\TableCell::class,
        // Nodes\TableHeader::class,
        // Nodes\TableRow::class,
        // Nodes\TableWrapper::class,
        Nodes\Text::class,
        Nodes\User::class,
    ])
    ->withMarks([
        Marks\Bold::class,
        Marks\Code::class,
        Marks\Italic::class,
        // Marks\Link::class,
        Marks\Strike::class,
        Marks\Subscript::class,
        Marks\Superscript::class,
        Marks\Underline::class,
    ])
    ->render($html)['content'];use HtmlToProseMirror\Nodes;
use HtmlToProseMirror\Marks;
 
$value = (new Renderer)
    ->withNodes([
        Nodes\Blockquote::class,
        Nodes\BulletList::class,
        Nodes\CodeBlock::class,
        Nodes\CodeBlockWrapper::class,
        Nodes\HardBreak::class,
        Nodes\Heading::class,
        Nodes\HorizontalRule::class,
        Nodes\Image::class,
        Nodes\ListItem::class,
        Nodes\OrderedList::class,
        Nodes\Paragraph::class,
        // Nodes\Table::class,
        // Nodes\TableCell::class,
        // Nodes\TableHeader::class,
        // Nodes\TableRow::class,
        // Nodes\TableWrapper::class,
        Nodes\Text::class,
        Nodes\User::class,
    ])
    ->withMarks([
        Marks\Bold::class,
        Marks\Code::class,
        Marks\Italic::class,
        // Marks\Link::class,
        Marks\Strike::class,
        Marks\Subscript::class,
        Marks\Superscript::class,
        Marks\Underline::class,
    ])
    ->render($html)['content'];

When html-to-prosemirror can't match an element it will skip it, but it will still process its children. Therefore if you exclude the link extension your value will still include the inner link text.

Converting elements to sets

The built-in html-to-prosemirror extensions do a great job of converting common HTML elements to standard nodes and marks, but what about Statamic sets? Say your HTML contains images wrapped in figure elements, and you want to convert those to sets instead of image nodes.

Something like this:

<figure class="image">
    <img src="pizza.jpg" alt="Pizza slice" />
    <figcaption>A tasty slice of Pizza!</figcaption>
</figure><figure class="image">
    <img src="pizza.jpg" alt="Pizza slice" />
    <figcaption>A tasty slice of Pizza!</figcaption>
</figure>

To handle that you’ll need a custom extension. Extensions receive a DOMNode object, and must implement a matching method and a data method. The matching method should check whether the DOMNode matches the elements you’re looking for, and the data method should return the equivalent ProseMirror node data. The same rules apply to marks.

For the HTML above the following extension will do what we need:

use HtmlToProseMirror\Nodes\Node;
 
class ImageSet extends Node
{
    public function matching()
    {
        return $this->DOMNode->nodeName === 'figure'
            && $this->DOMNode->getAttribute('class') === 'image';
    }
 
    public function data()
    {
        $img = $this->DOMNode->getElementsByTagName('img')->item(0);
        $caption = $this->DOMNode->getElementsByTagName('figcaption')->item(0);
 
        while ($this->DOMNode->hasChildNodes()) {
            $this->DOMNode->removeChild($this->DOMNode->firstChild);
        }
 
        return [
            'type' => 'set',
            'attrs' => [
                'values' => [
                    'type' => 'image',
                    'image' => $img->getAttribute('src'),
                    'alt' => $img->getAttribute('alt'),
                    'caption' => $caption->textContent,
                ],
            ],
        ];
    }
}use HtmlToProseMirror\Nodes\Node;
 
class ImageSet extends Node
{
    public function matching()
    {
        return $this->DOMNode->nodeName === 'figure'
            && $this->DOMNode->getAttribute('class') === 'image';
    }
 
    public function data()
    {
        $img = $this->DOMNode->getElementsByTagName('img')->item(0);
        $caption = $this->DOMNode->getElementsByTagName('figcaption')->item(0);
 
        while ($this->DOMNode->hasChildNodes()) {
            $this->DOMNode->removeChild($this->DOMNode->firstChild);
        }
 
        return [
            'type' => 'set',
            'attrs' => [
                'values' => [
                    'type' => 'image',
                    'image' => $img->getAttribute('src'),
                    'alt' => $img->getAttribute('alt'),
                    'caption' => $caption->textContent,
                ],
            ],
        ];
    }
}

$value = (new Renderer)
    ->withNodes([
        ImageSet::class,
        // ...
    ])
    ->render($html)['content'];$value = (new Renderer)
    ->withNodes([
        ImageSet::class,
        // ...
    ])
    ->render($html)['content'];

We have to remove all child nodes once we're done with them, otherwise they’ll be processed by the renderer as well, resulting in duplicate content.

Using this extension on the example above results in this output:

-
    type: set
    attrs:
        values:
            type: image
            image: pizza.jpg
            alt: 'Pizza slice'
            caption: 'A tasty slice of Pizza!'-
    type: set
    attrs:
        values:
            type: image
            image: pizza.jpg
            alt: 'Pizza slice'
            caption: 'A tasty slice of Pizza!'

Extensions are matched in the order they’re added, so it’s always best to put your custom extensions first in the list. Another thing to bear in mind is that sets are only valid at the root of the value.

You can just copy the HTML over and let the control panel figure it out when the entry is next edited. However, converting during import not only ensures everything is in the right format from the start, but it also allows you to fix any issues and control exactly how it's converted. ←

If you'd like some help putting together an HTML to ProseMirror conversion tool specific to your needs I'm available on a freelance basis. Feel free to get in touch to discuss your project.

9th January 2023

Statamic

Guides

Importing HTML into Statamic’s Bard fieldtype

#Converting HTML to ProseMirror

#Potential problems

#Fixing invalid node data

#Excluding disabled node types

#Converting elements to sets

Converting HTML to ProseMirror

Potential problems

Fixing invalid node data

Excluding disabled node types

Converting elements to sets