HtmlCleaner is an HTML parser written in Java that converts HTML to well-formed XML, and allow you to use the DOM and XPath to locate particular elements. JSoup is a different parser that converts HTML more as browsers do, and allows you to use the DOM and XPath, and also a querySelector-like filter to find elements.

The following table shows how to convert common DOM / XPath functions from HtmlCleaner into Jsoup.

HtmlCleaner Jsoup Notes
new HtmlCleaner().clean Jsoup.parse  
TagNode Element  
TagNode.getName() Element.tagName()  
TagNode.getAllChildren() Element.childNodes() children() is just Elements
TagNode.getAttributeByName(X) Element.attr(X) returns blank string instead of null on missing, you can use hasAttribute to check if a blank-string attribute exists
TagNode.getAttributes() Element.attributes() not a map, but can be iterated over
TagNode.getElementListByName("X", true) Element.children().select("X") need children() to avoid matching current element
TagNode.getElementListByName("X", false) Element.select(">X")  
TagNode.getElementsByName("X", true) Element.children().select("X") need children() to avoid matching current element, need .size() instead of .length
TagNode.getElementsByName("X", false) Element.select(">X") need .size() instead of .length
TagNode.getChildTags() Element.children()  
TagNode.getText().toString() Element.wholeText() text() collapses spaces
TagNode.getParent() Element.parent()  
TagNode.getParent().removeChild(X) X.remove()  
TagNode.findElementByName("X", true) Element.children().select("X").first()  
TagNode.findElementByName("X", false) Element.selectFirst(">X")  
TagNode.findElementByAttValue("X", "Y", true, false) Element.children().select("[X=Y]").first()  
TagNode.findElementByAttValue("X", "Y", false, false) Element.selectFirst(">[X=Y]")  
TagNode.evaluateXPath(X) Element.selectXpath(X, Node.class) See XPath section below
BaseToken Node  
BaseToken.getName() Node.nodeName() if you’re operating on Nodes instead of Elements
ContentNode TextNode  
ContentNode.getContent() TextNode.getWholeText() text() collapses spaces
CommentNode Comment  
CommentNode.getContent() Comment.getData()  

XPath

HtmlCleaner will allow invalid XPath queries. For example, //label/input[@type='checkbox']@checked is treated the same as //label/input[@type='checkbox']/@checked (extra slash between ] and the last @). In general, it looks like the query splits at non-alphabetic characters, so adjacent punctuation characters don’t need to be separated by a slash. If this is exposed to the user anywhere, user-defined functions may break.

Jsoup also cannot return non-nodes from its XPath implementation, so things like //@value won’t work. You can get around this by returning the node and then doing .attr("value"), but it’s less convenient.

Jsoup also seems more performant if you use its CSS selector than its XPath selector, but either way is slower than the HtmlCleaner XPath.

Attribute escaping

HtmlCleaner will prefer to escape entities where possible (e.g. '). Jsoup will prefer to not escape entities where not necessary (e.g. '). This will mean that if you’re looking for a particular string (e.g. contains or regex) you may need to change the search.