Generate Simplified DOM
Type: generate_simplified_dom
When you're looking at the DOM of a web page, there's a lot of unnecessary data that can be discarded if you are only interested in the page's elements or looking to export the data into a LLM.
The generate_simplified_dom
output format processes the HTML in the following way:
Removes all links in the
head
Removes all
script
nodes and links to scriptsRemoves all
style
nodesRemove
style
attributes from all elementsRemove all links to stylesheets
Remove all
noscript
elements outside of the bodyFinds all
hrefs
with query strings and removes the query stringsRemoves all
class
attributesImportant
meta
tags are kept, all others are removedRemove all
alternate
linksRemove all SVG paths
Remove empty text nodes and excessive spacing
Parameters
See universal parameters.
Usage
The following JSON captures the DOM of the page and simplifies it.
We are actively working to improve this and to make this process more configurable - let us know if there's something you think we can improve.
Example Output
Last updated