Statamic

Importing HTML into Statamic’s Bard fieldtype

This article covers importing HTML in Statamic 3.4 and higher with the tiptap-php package. For Statamic 3.3 and lower check out the previous version that uses html-to-prosemirror.

Statamic’s Bard fieldtype stores values as ProseMirror documents, so if you’re importing existing HTML with a PHP script it makes sense to convert it to the nodes and marks Bard expects¹. This guide outlines how to convert HTML to ProseMirror, a couple of gotchas, and finally how to handle sets.

Converting HTML to ProseMirror
Potential problems
- Fixing invalid node data
- Excluding disabled node types
Converting elements to sets

Converting HTML to ProseMirror

Getting started is very easy. Statamic includes the tiptap-php package, and all we need to do is create a new editor instance and pass the HTML to it:

use Tiptap\Editor;
 
$value = (new Editor())->setContent($html)->getDocument()['content'];use Tiptap\Editor;
 
$value = (new Editor())->setContent($html)->getDocument()['content'];

The editor will return a full document node, but Bard only needs the content.

The tiptap-php editor expects UTF-8 encoded data. If your HTML is in a different encoding you’ll need to convert it first:

$html = mb_convert_encoding($html, 'utf-8', [source-encoding]);$html = mb_convert_encoding($html, 'utf-8', [source-encoding]);

By default the edtior loads an extension called StarterKit, which only includes basic text elements. In order to convert other elements such as images, links and tables you will need to include the necessary extensions when initializing your editor:

use Tiptap\Extensions;
use Tiptap\Marks;
use Tiptap\Nodes;
 
$value = (new Editor([
    'extensions' => [
        new Extensions\StarterKit,
        new Nodes\Image,
        new Nodes\Table,
        new Nodes\TableCell,
        new Nodes\TableHeader,
        new Nodes\TableRow,
        new Marks\Link,
    ],
]))->setContent($html)->getDocument()['content'];use Tiptap\Extensions;
use Tiptap\Marks;
use Tiptap\Nodes;
 
$value = (new Editor([
    'extensions' => [
        new Extensions\StarterKit,
        new Nodes\Image,
        new Nodes\Table,
        new Nodes\TableCell,
        new Nodes\TableHeader,
        new Nodes\TableRow,
        new Marks\Link,
    ],
]))->setContent($html)->getDocument()['content'];

You can find the full list of node and mark extensions in the repo.

You can also convert HTML to ProseMirror with the Bard Augmentor classes renderHtmlToProsemirror method. However using tiptap-php directly allows you to specify the exact extensions to use and configure them however you want.

Potential problems

Fixing invalid node data

Unfortunately things aren’t always that simple. ProseMirror uses a schema that defines which nodes and marks are valid where, but tiptap-php doesn't actually enforce it. Each element is simply converted to the nearest ProseMirror equivalent, maintaining the original hierarchy.

One common issue is text nodes at the root of the returned value, which you’ll find can’t be edited through the control panel. This can happen if your HTML contains text that’s not wrapped in a paragraph or other block element.

We can fix these nodes by wrapping them in paragraphs after conversion:

$value = collect($value)
    ->map(function ($node) {
        return $node['type'] === 'text' ? [
            'type' => 'paragraph',
            'content' => [$node],
        ] : $node;
    })
    ->filter(function ($node) {
        return $node['type'] !== 'hard_break';
    })
    ->values()
    ->all();$value = collect($value)
    ->map(function ($node) {
        return $node['type'] === 'text' ? [
            'type' => 'paragraph',
            'content' => [$node],
        ] : $node;
    })
    ->filter(function ($node) {
        return $node['type'] !== 'hard_break';
    })
    ->values()
    ->all();

Root hard break (<br>) nodes will cause the same problem, but as we’re introducing paragraphs those can simply be filtered out (remember this is only affecting the root nodes).

Excluding disabled node types

If your HTML contains elements that you haven’t enabled in the Bard field you’ll run into invalid data errors in the control panel. To exclude these you can either skip the StarterKit extension and manually list the extensions you need, or just tell the StarterKit which default extensions to swtich off:

$value = (new Editor([
    'extensions' => [
        new Extensions\StarterKit([
            'heading' => false,
            'horizontalRule' => false,
        ]),
        // ...
    ],
]))->setContent($html)->getDocument()['content'];$value = (new Editor([
    'extensions' => [
        new Extensions\StarterKit([
            'heading' => false,
            'horizontalRule' => false,
        ]),
        // ...
    ],
]))->setContent($html)->getDocument()['content'];

When tiptap-php can't match an element it will skip it, but it'll still process its children. Therefore if you exclude the heading extension your value will still include the inner heading text.

Converting elements to sets

The built-in tiptap-php extensions do a great job of converting common HTML elements to standard nodes and marks, but what about Statamic sets? Say your HTML contains images wrapped in figure elements, and you want to convert those to sets instead of image nodes.

Something like this:

<figure class="image">
    <img src="pizza.jpg" alt="Pizza slice" />
    <figcaption>A tasty slice of Pizza!</figcaption>
</figure><figure class="image">
    <img src="pizza.jpg" alt="Pizza slice" />
    <figcaption>A tasty slice of Pizza!</figcaption>
</figure>

To handle that you’ll need a custom extension. A extension's parseHTML method should specify the element types you’re looking for, and it's addAttributes method should list the attributes the node has. If the attributes don't directly map to HTML attributes you can specify a parseHTML callback within the attributes array, which will receive the DOMNode object. Statamic sets store all their values in a single values attribute.

For the HTML above the following extension will do what we need:

use Tiptap\Core\Node;
 
class Set extends Node
{
    public static $name = 'set';
 
    public static $priority = 200;
 
    public function parseHTML()
    {
        return [
            [
                'tag' => 'figure[class="image"]',
            ],
        ];
    }
 
    public function addAttributes()
    {
        return ['values' => [
            'parseHTML' => function ($DOMNode) {
                $img = $DOMNode->getElementsByTagName('img')->item(0);
                $caption = $DOMNode->getElementsByTagName('figcaption')->item(0);
 
                while ($DOMNode->hasChildNodes()) {
                    $DOMNode->removeChild($DOMNode->firstChild);
                }
 
                return [
                    'type' => 'image',
                    'image' => $img->getAttribute('src'),
                    'alt' => $img->getAttribute('alt'),
                    'caption' => $caption->textContent,
                ];
            },
        ]];
    }
}use Tiptap\Core\Node;
 
class Set extends Node
{
    public static $name = 'set';
 
    public static $priority = 200;
 
    public function parseHTML()
    {
        return [
            [
                'tag' => 'figure[class="image"]',
            ],
        ];
    }
 
    public function addAttributes()
    {
        return ['values' => [
            'parseHTML' => function ($DOMNode) {
                $img = $DOMNode->getElementsByTagName('img')->item(0);
                $caption = $DOMNode->getElementsByTagName('figcaption')->item(0);
 
                while ($DOMNode->hasChildNodes()) {
                    $DOMNode->removeChild($DOMNode->firstChild);
                }
 
                return [
                    'type' => 'image',
                    'image' => $img->getAttribute('src'),
                    'alt' => $img->getAttribute('alt'),
                    'caption' => $caption->textContent,
                ];
            },
        ]];
    }
}

$value = (new Editor([
    'extensions' => [
        new Set,
        // ...
    ],
]))->setContent($html)->getDocument()['content'];$value = (new Editor([
    'extensions' => [
        new Set,
        // ...
    ],
]))->setContent($html)->getDocument()['content'];

We have to remove all child nodes once we're done with them, otherwise they’ll be processed by the editor as well, resulting in duplicate content.

Using this extension on the example above results in this output:

-
    type: set
    attrs:
        values:
            type: image
            image: pizza.jpg
            alt: 'Pizza slice'
            caption: 'A tasty slice of Pizza!'-
    type: set
    attrs:
        values:
            type: image
            image: pizza.jpg
            alt: 'Pizza slice'
            caption: 'A tasty slice of Pizza!'

There can only be one set extension, so if you want to convert different elements to different set types you'll need to implement them all within one extension. The parseHTML method can return multiple element types to match.

Extensions are matched in the order they’re added, but you can add a priority to your extension if you want it to be matched earlier. Most default extensions have a priority of 100. Another thing to bear in mind is that sets are only valid at the root of the value.

You can just copy the HTML over and let the control panel figure it out when the entry is next edited. However, converting during import not only ensures everything is in the right format from the start, but it also allows you to fix any issues and control exactly how it's converted. ←

If you'd like some help putting together an HTML to ProseMirror conversion tool specific to your needs I'm available on a freelance basis. Feel free to get in touch to discuss your project.

20th March 2023

Statamic

Guides

Importing HTML into Statamic’s Bard fieldtype

#Converting HTML to ProseMirror

#Potential problems

#Fixing invalid node data

#Excluding disabled node types

#Converting elements to sets

Converting HTML to ProseMirror

Potential problems

Fixing invalid node data

Excluding disabled node types

Converting elements to sets