htmlutil

package
v0.0.11 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 27, 2026 License: MIT Imports: 6 Imported by: 0

Documentation

Overview

Package htmlutil provides HTML form and field extraction utilities.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func FindLabel

func FindLabel(form *goquery.Selection, elem *goquery.Selection) *goquery.Selection

FindLabel finds the <label> element associated with a form field. It checks for label[for=id] or ancestor <label>.

func GetAllFormText

func GetAllFormText(form *goquery.Selection) string

GetAllFormText returns all text content inside the form.

func GetBodyText added in v0.0.3

func GetBodyText(doc *goquery.Document, maxLen int) string

GetBodyText returns visible text from the body, truncated for feature extraction.

func GetErrorIndicators added in v0.0.3

func GetErrorIndicators(doc *goquery.Document) map[string]any

GetErrorIndicators returns features for detecting error/soft-404/special pages.

func GetFieldsToAnnotate

func GetFieldsToAnnotate(form *goquery.Selection) []*goquery.Selection

GetFieldsToAnnotate returns visible fields with non-empty name attribute.

func GetFormAction

func GetFormAction(form *goquery.Selection) string

GetFormAction returns the form's action attribute.

func GetFormCSS

func GetFormCSS(form *goquery.Selection) string

GetFormCSS returns the form's class and id attributes.

func GetFormMethod

func GetFormMethod(form *goquery.Selection) string

GetFormMethod returns the form's method attribute, lowercased.

func GetForms

func GetForms(doc *goquery.Document) []*goquery.Selection

GetForms returns all <form> elements in the document.

func GetH1Text added in v0.0.3

func GetH1Text(doc *goquery.Document) string

GetH1Text returns concatenated text of all <h1> elements.

func GetHeadings added in v0.0.3

func GetHeadings(doc *goquery.Document) string

GetHeadings returns concatenated text of all h1-h6 elements.

func GetInputCSS

func GetInputCSS(form *goquery.Selection) string

GetInputCSS returns CSS classes and IDs of non-hidden input elements.

func GetInputCount

func GetInputCount(form *goquery.Selection) int

GetInputCount returns the number of named input elements (matching lxml form.inputs.keys()).

func GetInputNames

func GetInputNames(form *goquery.Selection) string

GetInputNames returns names of all non-hidden <input> elements, cleaned up.

func GetInputTitles

func GetInputTitles(form *goquery.Selection) string

GetInputTitles returns title attributes of non-hidden input elements.

func GetLabelText

func GetLabelText(form *goquery.Selection) string

GetLabelText returns text of all <label> elements in the form.

func GetLinksText

func GetLinksText(form *goquery.Selection) string

GetLinksText returns text of all links inside the form.

func GetMetaDescription added in v0.0.3

func GetMetaDescription(doc *goquery.Document) string

GetMetaDescription returns the content of <meta name="description">.

func GetMetaKeywords added in v0.0.3

func GetMetaKeywords(doc *goquery.Document) string

GetMetaKeywords returns the content of <meta name="keywords">.

func GetMetaRobots added in v0.0.3

func GetMetaRobots(doc *goquery.Document) string

GetMetaRobots returns the content of <meta name="robots">.

func GetNavText added in v0.0.3

func GetNavText(doc *goquery.Document) string

GetNavText returns concatenated text of all <nav> elements.

func GetPageCSS added in v0.0.3

func GetPageCSS(doc *goquery.Document) string

GetPageCSS returns class and id attributes from <body> and <main> elements.

func GetPageLinkTexts added in v0.0.3

func GetPageLinkTexts(doc *goquery.Document) string

GetPageLinkTexts returns concatenated text of all <a> elements.

func GetPageStructure added in v0.0.3

func GetPageStructure(doc *goquery.Document) map[string]any

GetPageStructure returns structural boolean features and counts about the page.

func GetPageTitle added in v0.0.3

func GetPageTitle(doc *goquery.Document) string

GetPageTitle returns the <title> text content.

func GetSubmitTexts

func GetSubmitTexts(form *goquery.Selection) string

GetSubmitTexts returns the values of all <input type="submit"> elements.

func GetTypeCounts

func GetTypeCounts(form *goquery.Selection) map[string]int

GetTypeCounts returns counts of different input types in a form.

func GetVisibleFields

func GetVisibleFields(form *goquery.Selection) []*goquery.Selection

GetVisibleFields returns visible form fields (textarea, select, button, non-hidden inputs).

func LoadHTML

func LoadHTML(r io.Reader) (*goquery.Document, error)

LoadHTML parses HTML bytes into a goquery Document.

func LoadHTMLString

func LoadHTMLString(htmlStr string) (*goquery.Document, error)

LoadHTMLString parses HTML string into a goquery Document.

Types

type TextAround

type TextAround struct {
	Before map[*goquery.Selection]string
	After  map[*goquery.Selection]string
}

TextAround holds text before and after each element.

func GetTextAroundElems

func GetTextAroundElems(root *goquery.Selection, elems []*goquery.Selection) TextAround

GetTextAroundElems returns text before and after each specified element, matching lxml's text/tail walk behavior from Formasaurus.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL