Importing HTML into Statamic’s Bard fieldtype
This article covers importing HTML in Statamic 3.3 and lower with the html-to-prosemirror
package. For Statamic 3.4 and higher check out the updated version that uses tiptap-php
.
Statamic’s Bard fieldtype stores values as ProseMirror documents, so if you’re importing existing HTML with a PHP script it makes sense to convert it to the nodes and marks Bard expects1. This guide outlines how to convert HTML to ProseMirror, a couple of gotchas, and finally how to handle sets.
Converting HTML to ProseMirror
Getting started is very easy. Statamic includes the html-to-prosemirror
package, and all we need to do is create a new renderer instance and pass the HTML to it:
use HtmlToProseMirror\Renderer;$value = (new Renderer)->render($html)['content'];
use HtmlToProseMirror\Renderer;$value = (new Renderer)->render($html)['content'];
The renderer will return a full document node, but Bard only needs the content.
The html-to-prosemirror
renderer expects UTF-8 encoded data. If your HTML is in a different encoding you’ll need to convert it first:
$html = mb_convert_encoding($html, 'utf-8', [source-encoding]);
$html = mb_convert_encoding($html, 'utf-8', [source-encoding]);
Potential problems
Fixing invalid node data
Unfortunately things aren’t always that simple. ProseMirror uses a schema that defines which nodes and marks are valid where, but html-to-prosemirror
doesn't actually enforce it. Each element is simply converted to the nearest ProseMirror equivalent, maintaining the original hierarchy.
One common issue is text nodes at the root of the returned value, which you’ll find can’t be edited through the control panel. This can happen if your HTML contains text that’s not wrapped in a paragraph or other block element.
We can fix these nodes by wrapping them in paragraphs after conversion:
$value = collect($value)->map(function ($node) {return $node['type'] === 'text' ? ['type' => 'paragraph','content' => [$node],] : $node;})->filter(function ($node) {return $node['type'] !== 'hard_break';})->values()->all();
$value = collect($value)->map(function ($node) {return $node['type'] === 'text' ? ['type' => 'paragraph','content' => [$node],] : $node;})->filter(function ($node) {return $node['type'] !== 'hard_break';})->values()->all();
Root hard break (<br>
) nodes will cause the same problem, but as we’re introducing paragraphs those can simply be filtered out (remember this is only affecting the root nodes).
Excluding disabled node types
If your HTML contains elements that you haven’t enabled in the Bard field you’ll run into invalid data errors in the control panel. Excluding these nodes and marks is as simple as not including the relevant extensions when initialising the renderer:
use HtmlToProseMirror\Nodes;use HtmlToProseMirror\Marks;$value = (new Renderer)->withNodes([Nodes\Blockquote::class,Nodes\BulletList::class,Nodes\CodeBlock::class,Nodes\CodeBlockWrapper::class,Nodes\HardBreak::class,Nodes\Heading::class,Nodes\HorizontalRule::class,Nodes\Image::class,Nodes\ListItem::class,Nodes\OrderedList::class,Nodes\Paragraph::class,// Nodes\Table::class,// Nodes\TableCell::class,// Nodes\TableHeader::class,// Nodes\TableRow::class,// Nodes\TableWrapper::class,Nodes\Text::class,Nodes\User::class,])->withMarks([Marks\Bold::class,Marks\Code::class,Marks\Italic::class,// Marks\Link::class,Marks\Strike::class,Marks\Subscript::class,Marks\Superscript::class,Marks\Underline::class,])->render($html)['content'];
use HtmlToProseMirror\Nodes;use HtmlToProseMirror\Marks;$value = (new Renderer)->withNodes([Nodes\Blockquote::class,Nodes\BulletList::class,Nodes\CodeBlock::class,Nodes\CodeBlockWrapper::class,Nodes\HardBreak::class,Nodes\Heading::class,Nodes\HorizontalRule::class,Nodes\Image::class,Nodes\ListItem::class,Nodes\OrderedList::class,Nodes\Paragraph::class,// Nodes\Table::class,// Nodes\TableCell::class,// Nodes\TableHeader::class,// Nodes\TableRow::class,// Nodes\TableWrapper::class,Nodes\Text::class,Nodes\User::class,])->withMarks([Marks\Bold::class,Marks\Code::class,Marks\Italic::class,// Marks\Link::class,Marks\Strike::class,Marks\Subscript::class,Marks\Superscript::class,Marks\Underline::class,])->render($html)['content'];
When html-to-prosemirror
can't match an element it will skip it, but it will still process its children. Therefore if you exclude the link extension your value will still include the inner link text.
Converting elements to sets
The built-in html-to-prosemirror
extensions do a great job of converting common HTML elements to standard nodes and marks, but what about Statamic sets? Say your HTML contains images wrapped in figure elements, and you want to convert those to sets instead of image nodes.
Something like this:
<figure class="image"><img src="pizza.jpg" alt="Pizza slice" /><figcaption>A tasty slice of Pizza!</figcaption></figure>
<figure class="image"><img src="pizza.jpg" alt="Pizza slice" /><figcaption>A tasty slice of Pizza!</figcaption></figure>
To handle that you’ll need a custom extension. Extensions receive a DOMNode
object, and must implement a matching
method and a data
method. The matching
method should check whether the DOMNode
matches the elements you’re looking for, and the data
method should return the equivalent ProseMirror node data. The same rules apply to marks.
For the HTML above the following extension will do what we need:
use HtmlToProseMirror\Nodes\Node;class ImageSet extends Node{public function matching(){return $this->DOMNode->nodeName === 'figure'&& $this->DOMNode->getAttribute('class') === 'image';}public function data(){$img = $this->DOMNode->getElementsByTagName('img')->item(0);$caption = $this->DOMNode->getElementsByTagName('figcaption')->item(0);while ($this->DOMNode->hasChildNodes()) {$this->DOMNode->removeChild($this->DOMNode->firstChild);}return ['type' => 'set','attrs' => ['values' => ['type' => 'image','image' => $img->getAttribute('src'),'alt' => $img->getAttribute('alt'),'caption' => $caption->textContent,],],];}}
use HtmlToProseMirror\Nodes\Node;class ImageSet extends Node{public function matching(){return $this->DOMNode->nodeName === 'figure'&& $this->DOMNode->getAttribute('class') === 'image';}public function data(){$img = $this->DOMNode->getElementsByTagName('img')->item(0);$caption = $this->DOMNode->getElementsByTagName('figcaption')->item(0);while ($this->DOMNode->hasChildNodes()) {$this->DOMNode->removeChild($this->DOMNode->firstChild);}return ['type' => 'set','attrs' => ['values' => ['type' => 'image','image' => $img->getAttribute('src'),'alt' => $img->getAttribute('alt'),'caption' => $caption->textContent,],],];}}
$value = (new Renderer)->withNodes([ImageSet::class,// ...])->render($html)['content'];
$value = (new Renderer)->withNodes([ImageSet::class,// ...])->render($html)['content'];
We have to remove all child nodes once we're done with them, otherwise they’ll be processed by the renderer as well, resulting in duplicate content.
Using this extension on the example above results in this output:
-type: setattrs:values:type: imageimage: pizza.jpgalt: 'Pizza slice'caption: 'A tasty slice of Pizza!'
-type: setattrs:values:type: imageimage: pizza.jpgalt: 'Pizza slice'caption: 'A tasty slice of Pizza!'
Extensions are matched in the order they’re added, so it’s always best to put your custom extensions first in the list. Another thing to bear in mind is that sets are only valid at the root of the value.
You can just copy the HTML over and let the control panel figure it out when the entry is next edited. However, converting during import not only ensures everything is in the right format from the start, but it also allows you to fix any issues and control exactly how it's converted. ←