Importing HTML into Statamic’s Bard fieldtype
Statamic’s Bard fieldtype stores values as ProseMirror documents, so if you’re importing existing HTML with a PHP script it makes sense to convert it to the nodes and marks Bard expects1. This guide outlines how to convert HTML to ProseMirror, a couple of gotchas, and finally how to handle sets and other custom content.
Converting HTML to ProseMirror
Statamic 3.4 will introduce Bard 2, which uses Tiptap 2 and the new tiptap-php
package. The examples below will not work with tiptap-php
. I plan to publish an updated version of this guide once Statamic 3.4 is released.
Getting started is very easy. Statamic includes the html-to-prosemirror
package, and all we need to do is create a new renderer instance and pass the HTML to it:
use HtmlToProseMirror\Renderer;$value = (new Renderer)->render($html)['content'];
use HtmlToProseMirror\Renderer;$value = (new Renderer)->render($html)['content'];
The renderer will return a full document node, but Bard only needs the content.
The html-to-prosemirror
renderer expects UTF-8 encoded data. If your HTML is in a different encoding you’ll need to convert it first:
$html = mb_convert_encoding($html, 'utf-8', [source-encoding]);
$html = mb_convert_encoding($html, 'utf-8', [source-encoding]);
Potential problems
Fixing invalid node data
Unfortunately things aren’t always that simple. ProseMirror uses a schema that defines which nodes and marks are valid where, but html-to-prosemirror
doesn't actually enforce it. Each element is simply converted to the nearest ProseMirror equivalent, maintaining the original hierarchy.
One common issue is text nodes at the root of the returned value, which you’ll find can’t be edited through the control panel. This can happen if your HTML contains text that’s not wrapped in a paragraph or other block element.
We can fix these nodes by wrapping them in paragraphs after conversion:
$value = collect($value)->map(function ($node) {return $node['type'] === 'text' ? ['type' => 'paragraph','content' => [$node],] : $node;})->filter(function ($node) {return $node['type'] !== 'hard_break';})->values()->all();
$value = collect($value)->map(function ($node) {return $node['type'] === 'text' ? ['type' => 'paragraph','content' => [$node],] : $node;})->filter(function ($node) {return $node['type'] !== 'hard_break';})->values()->all();
Root hard break (<br>
) nodes will cause the same problem, but as we’re introducing paragraphs those can simply be filtered out (remember this is only affecting the root nodes).
Excluding disabled node types
If your HTML contains elements that you haven’t enabled in the Bard field you’ll run into invalid data errors in the control panel. Excluding these nodes and marks is as simple as not including the relevant extensions when initialising the renderer:
use HtmlToProseMirror\Nodes;use HtmlToProseMirror\Marks;$value = (new Renderer)->withNodes([Nodes\Blockquote::class,Nodes\BulletList::class,Nodes\CodeBlock::class,Nodes\CodeBlockWrapper::class,Nodes\HardBreak::class,Nodes\Heading::class,Nodes\HorizontalRule::class,Nodes\Image::class,Nodes\ListItem::class,Nodes\OrderedList::class,Nodes\Paragraph::class,// Nodes\Table::class,// Nodes\TableCell::class,// Nodes\TableHeader::class,// Nodes\TableRow::class,// Nodes\TableWrapper::class,Nodes\Text::class,Nodes\User::class,])->withMarks([Marks\Bold::class,Marks\Code::class,Marks\Italic::class,// Marks\Link::class,Marks\Strike::class,Marks\Subscript::class,Marks\Superscript::class,Marks\Underline::class,])->render($html)['content'];
use HtmlToProseMirror\Nodes;use HtmlToProseMirror\Marks;$value = (new Renderer)->withNodes([Nodes\Blockquote::class,Nodes\BulletList::class,Nodes\CodeBlock::class,Nodes\CodeBlockWrapper::class,Nodes\HardBreak::class,Nodes\Heading::class,Nodes\HorizontalRule::class,Nodes\Image::class,Nodes\ListItem::class,Nodes\OrderedList::class,Nodes\Paragraph::class,// Nodes\Table::class,// Nodes\TableCell::class,// Nodes\TableHeader::class,// Nodes\TableRow::class,// Nodes\TableWrapper::class,Nodes\Text::class,Nodes\User::class,])->withMarks([Marks\Bold::class,Marks\Code::class,Marks\Italic::class,// Marks\Link::class,Marks\Strike::class,Marks\Subscript::class,Marks\Superscript::class,Marks\Underline::class,])->render($html)['content'];
When html-to-prosemirror
can't match an element it will skip it, but it will still process its children. Therefore if you exclude the link extension your value will still include the inner link text.
Handling custom content
Converting elements to sets
The built-in html-to-prosemirror
extensions do a great job of converting common HTML elements to standard nodes and marks, but what about Statamic sets? Say your HTML contains images wrapped in figure elements, and you want to convert those to sets instead of image nodes.
Something like this:
<figure class="image"><img src="pizza.jpg" alt="Pizza slice" /><figcaption>A tasty slice of Pizza!</figcaption></figure>
<figure class="image"><img src="pizza.jpg" alt="Pizza slice" /><figcaption>A tasty slice of Pizza!</figcaption></figure>
To handle that you’ll need a custom extension. Extensions receive a DOMNode
object, and must implement a matching
method and a data
method. The matching
method should check whether the DOMNode
matches the elements you’re looking for, and the data
method should return the equivalent ProseMirror node data. The same rules apply to marks.
For the HTML above the following extension will do what we need:
use HtmlToProseMirror\Nodes\Node;class ImageSet extends Node{public function matching(){return $this->DOMNode->nodeName === 'figure'&& $this->DOMNode->getAttribute('class') === 'image';}public function data(){$img = $this->DOMNode->getElementsByTagName('img')->item(0);$caption = $this->DOMNode->getElementsByTagName('figcaption')->item(0);while ($this->DOMNode->hasChildNodes()) {$this->DOMNode->removeChild($this->DOMNode->firstChild);}return ['type' => 'set','attrs' => ['values' => ['type' => 'image','image' => $img->getAttribute('src'),'alt' => $img->getAttribute('alt'),'caption' => $caption->textContent,],],];}}
use HtmlToProseMirror\Nodes\Node;class ImageSet extends Node{public function matching(){return $this->DOMNode->nodeName === 'figure'&& $this->DOMNode->getAttribute('class') === 'image';}public function data(){$img = $this->DOMNode->getElementsByTagName('img')->item(0);$caption = $this->DOMNode->getElementsByTagName('figcaption')->item(0);while ($this->DOMNode->hasChildNodes()) {$this->DOMNode->removeChild($this->DOMNode->firstChild);}return ['type' => 'set','attrs' => ['values' => ['type' => 'image','image' => $img->getAttribute('src'),'alt' => $img->getAttribute('alt'),'caption' => $caption->textContent,],],];}}
$value = (new Renderer)->withNodes([ImageSet::class,// ...])->render($html)['content'];
$value = (new Renderer)->withNodes([ImageSet::class,// ...])->render($html)['content'];
We have to remove all child nodes once we're done with them, otherwise they’ll be processed by the renderer as well, resulting in duplicate content.
Using this extension on the example above results in this output:
-type: setattrs:values:type: imageimage: pizza.jpgalt: 'Pizza slice'caption: 'A tasty slice of Pizza!'
-type: setattrs:values:type: imageimage: pizza.jpgalt: 'Pizza slice'caption: 'A tasty slice of Pizza!'
Extensions are matched in the order they’re added, so it’s always best to put your custom extensions first in the list. Another thing to bear in mind is that sets are only valid at the root of the value.
Skipping unwanted elements
If your source HTML contains things that you don’t want to import a simple way to filter them out is with another custom extension.
Below is an example extension that will cause the renderer to skip empty paragraphs and paragraphs that only contain a non-breaking space2. Returning null
from the data
method will prevent it from producing a node of its own, but the child nodes will still be processed. To exclude the child nodes as well we need to remove them at the same time:
use HtmlToProseMirror\Nodes\Node;class Cleaner extends Node{public function matching(){return $this->DOMNode->nodeName === 'p'&& ($this->DOMNode->textContent === "" || $this->DOMNode->textContent === "\xc2\xa0");}public function data(){while ($this->DOMNode->hasChildNodes()) {$this->DOMNode->removeChild($this->DOMNode->firstChild);}return null;}}
use HtmlToProseMirror\Nodes\Node;class Cleaner extends Node{public function matching(){return $this->DOMNode->nodeName === 'p'&& ($this->DOMNode->textContent === "" || $this->DOMNode->textContent === "\xc2\xa0");}public function data(){while ($this->DOMNode->hasChildNodes()) {$this->DOMNode->removeChild($this->DOMNode->firstChild);}return null;}}
$value = (new Renderer)->withNodes([Cleaner::class,// ...])->render($html)['content'];
$value = (new Renderer)->withNodes([Cleaner::class,// ...])->render($html)['content'];
The matching
method can easily be expanded to check for other unwanted elements.
You can just copy the HTML over and let the control panel figure it out when the entry is next edited. However, converting during import not only ensures everything is in the right format from the start, but it also allows you to fix any issues and control exactly how it's converted. ←
This isn't really a proper node extension as it's not creating any nodes, but
html-to-prosemirror
doesn't support generic extensions so this will do as as a workaround. ←