Skip to main content
Jack Sleight .DEV
Statamic

Importing HTML into Statamic’s Bard fieldtype

Statamic’s Bard fieldtype stores values as ProseMirror documents, so if you’re importing existing HTML with a PHP script it makes sense to convert it to the nodes and marks Bard expects1. This guide outlines how to convert HTML to ProseMirror, a couple of gotchas, and finally how to handle sets and other custom content.

Converting HTML to ProseMirror

Statamic 3.4 will introduce Bard 2, which uses Tiptap 2 and the new tiptap-php package. The examples below will not work with tiptap-php. I plan to publish an updated version of this guide once Statamic 3.4 is released.

Getting started is very easy. Statamic includes the html-to-prosemirror package, and all we need to do is create a new renderer instance and pass the HTML to it:

use HtmlToProseMirror\Renderer;
 
$value = (new Renderer)->render($html)['content'];
use HtmlToProseMirror\Renderer;
 
$value = (new Renderer)->render($html)['content'];

The renderer will return a full document node, but Bard only needs the content.

The html-to-prosemirror renderer expects UTF-8 encoded data. If your HTML is in a different encoding you’ll need to convert it first:

$html = mb_convert_encoding($html, 'utf-8', [source-encoding]);
$html = mb_convert_encoding($html, 'utf-8', [source-encoding]);

Potential problems

Fixing invalid node data

Unfortunately things aren’t always that simple. ProseMirror uses a schema that defines which nodes and marks are valid where, but html-to-prosemirror doesn't actually enforce it. Each element is simply converted to the nearest ProseMirror equivalent, maintaining the original hierarchy.

One common issue is text nodes at the root of the returned value, which you’ll find can’t be edited through the control panel. This can happen if your HTML contains text that’s not wrapped in a paragraph or other block element.

We can fix these nodes by wrapping them in paragraphs after conversion:

$value = collect($value)
->map(function ($node) {
return $node['type'] === 'text' ? [
'type' => 'paragraph',
'content' => [$node],
] : $node;
})
->filter(function ($node) {
return $node['type'] !== 'hard_break';
})
->values()
->all();
$value = collect($value)
->map(function ($node) {
return $node['type'] === 'text' ? [
'type' => 'paragraph',
'content' => [$node],
] : $node;
})
->filter(function ($node) {
return $node['type'] !== 'hard_break';
})
->values()
->all();

Root hard break (<br>) nodes will cause the same problem, but as we’re introducing paragraphs those can simply be filtered out (remember this is only affecting the root nodes).

Excluding disabled node types

If your HTML contains elements that you haven’t enabled in the Bard field you’ll run into invalid data errors in the control panel. Excluding these nodes and marks is as simple as not including the relevant extensions when initialising the renderer:

use HtmlToProseMirror\Nodes;
use HtmlToProseMirror\Marks;
 
$value = (new Renderer)
->withNodes([
Nodes\Blockquote::class,
Nodes\BulletList::class,
Nodes\CodeBlock::class,
Nodes\CodeBlockWrapper::class,
Nodes\HardBreak::class,
Nodes\Heading::class,
Nodes\HorizontalRule::class,
Nodes\Image::class,
Nodes\ListItem::class,
Nodes\OrderedList::class,
Nodes\Paragraph::class,
// Nodes\Table::class,
// Nodes\TableCell::class,
// Nodes\TableHeader::class,
// Nodes\TableRow::class,
// Nodes\TableWrapper::class,
Nodes\Text::class,
Nodes\User::class,
])
->withMarks([
Marks\Bold::class,
Marks\Code::class,
Marks\Italic::class,
// Marks\Link::class,
Marks\Strike::class,
Marks\Subscript::class,
Marks\Superscript::class,
Marks\Underline::class,
])
->render($html)['content'];
use HtmlToProseMirror\Nodes;
use HtmlToProseMirror\Marks;
 
$value = (new Renderer)
->withNodes([
Nodes\Blockquote::class,
Nodes\BulletList::class,
Nodes\CodeBlock::class,
Nodes\CodeBlockWrapper::class,
Nodes\HardBreak::class,
Nodes\Heading::class,
Nodes\HorizontalRule::class,
Nodes\Image::class,
Nodes\ListItem::class,
Nodes\OrderedList::class,
Nodes\Paragraph::class,
// Nodes\Table::class,
// Nodes\TableCell::class,
// Nodes\TableHeader::class,
// Nodes\TableRow::class,
// Nodes\TableWrapper::class,
Nodes\Text::class,
Nodes\User::class,
])
->withMarks([
Marks\Bold::class,
Marks\Code::class,
Marks\Italic::class,
// Marks\Link::class,
Marks\Strike::class,
Marks\Subscript::class,
Marks\Superscript::class,
Marks\Underline::class,
])
->render($html)['content'];

When html-to-prosemirror can't match an element it will skip it, but it will still process its children. Therefore if you exclude the link extension your value will still include the inner link text.

Handling custom content

Converting elements to sets

The built-in html-to-prosemirror extensions do a great job of converting common HTML elements to standard nodes and marks, but what about Statamic sets? Say your HTML contains images wrapped in figure elements, and you want to convert those to sets instead of image nodes.

Something like this:

<figure class="image">
<img src="pizza.jpg" alt="Pizza slice" />
<figcaption>A tasty slice of Pizza!</figcaption>
</figure>
<figure class="image">
<img src="pizza.jpg" alt="Pizza slice" />
<figcaption>A tasty slice of Pizza!</figcaption>
</figure>

To handle that you’ll need a custom extension. Extensions receive a DOMNode object, and must implement a matching method and a data method. The matching method should check whether the DOMNode matches the elements you’re looking for, and the data method should return the equivalent ProseMirror node data. The same rules apply to marks.

For the HTML above the following extension will do what we need:

use HtmlToProseMirror\Nodes\Node;
 
class ImageSet extends Node
{
public function matching()
{
return $this->DOMNode->nodeName === 'figure'
&& $this->DOMNode->getAttribute('class') === 'image';
}
 
public function data()
{
$img = $this->DOMNode->getElementsByTagName('img')->item(0);
$caption = $this->DOMNode->getElementsByTagName('figcaption')->item(0);
 
while ($this->DOMNode->hasChildNodes()) {
$this->DOMNode->removeChild($this->DOMNode->firstChild);
}
 
return [
'type' => 'set',
'attrs' => [
'values' => [
'type' => 'image',
'image' => $img->getAttribute('src'),
'alt' => $img->getAttribute('alt'),
'caption' => $caption->textContent,
],
],
];
}
}
use HtmlToProseMirror\Nodes\Node;
 
class ImageSet extends Node
{
public function matching()
{
return $this->DOMNode->nodeName === 'figure'
&& $this->DOMNode->getAttribute('class') === 'image';
}
 
public function data()
{
$img = $this->DOMNode->getElementsByTagName('img')->item(0);
$caption = $this->DOMNode->getElementsByTagName('figcaption')->item(0);
 
while ($this->DOMNode->hasChildNodes()) {
$this->DOMNode->removeChild($this->DOMNode->firstChild);
}
 
return [
'type' => 'set',
'attrs' => [
'values' => [
'type' => 'image',
'image' => $img->getAttribute('src'),
'alt' => $img->getAttribute('alt'),
'caption' => $caption->textContent,
],
],
];
}
}
$value = (new Renderer)
->withNodes([
ImageSet::class,
// ...
])
->render($html)['content'];
$value = (new Renderer)
->withNodes([
ImageSet::class,
// ...
])
->render($html)['content'];

We have to remove all child nodes once we're done with them, otherwise they’ll be processed by the renderer as well, resulting in duplicate content.

Using this extension on the example above results in this output:

-
type: set
attrs:
values:
type: image
image: pizza.jpg
alt: 'Pizza slice'
caption: 'A tasty slice of Pizza!'
-
type: set
attrs:
values:
type: image
image: pizza.jpg
alt: 'Pizza slice'
caption: 'A tasty slice of Pizza!'

Extensions are matched in the order they’re added, so it’s always best to put your custom extensions first in the list. Another thing to bear in mind is that sets are only valid at the root of the value.

Skipping unwanted elements

If your source HTML contains things that you don’t want to import a simple way to filter them out is with another custom extension.

Below is an example extension that will cause the renderer to skip empty paragraphs and paragraphs that only contain a non-breaking space2. Returning null from the data method will prevent it from producing a node of its own, but the child nodes will still be processed. To exclude the child nodes as well we need to remove them at the same time:

use HtmlToProseMirror\Nodes\Node;
 
class Cleaner extends Node
{
public function matching()
{
return $this->DOMNode->nodeName === 'p'
&& ($this->DOMNode->textContent === "" || $this->DOMNode->textContent === "\xc2\xa0");
}
 
public function data()
{
while ($this->DOMNode->hasChildNodes()) {
$this->DOMNode->removeChild($this->DOMNode->firstChild);
}
 
return null;
}
}
use HtmlToProseMirror\Nodes\Node;
 
class Cleaner extends Node
{
public function matching()
{
return $this->DOMNode->nodeName === 'p'
&& ($this->DOMNode->textContent === "" || $this->DOMNode->textContent === "\xc2\xa0");
}
 
public function data()
{
while ($this->DOMNode->hasChildNodes()) {
$this->DOMNode->removeChild($this->DOMNode->firstChild);
}
 
return null;
}
}
$value = (new Renderer)
->withNodes([
Cleaner::class,
// ...
])
->render($html)['content'];
$value = (new Renderer)
->withNodes([
Cleaner::class,
// ...
])
->render($html)['content'];

The matching method can easily be expanded to check for other unwanted elements.


  1. You can just copy the HTML over and let the control panel figure it out when the entry is next edited. However, converting during import not only ensures everything is in the right format from the start, but it also allows you to fix any issues and control exactly how it's converted. 

  2. This isn't really a proper node extension as it's not creating any nodes, but html-to-prosemirror doesn't support generic extensions so this will do as as a workaround. 


If you'd like some help putting together an HTML to ProseMirror conversion tool specific to your needs I'm available on a freelance basis. Feel free to get in touch to discuss your project.
9th January 2023