crawler

package

v0.0.0-...-82d9017 Latest Latest Go to latest Published: Feb 20, 2023 License: BSD-3-Clause Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/HuguesGuilleus/isty-search

Links

Open Source Insights

Documentation ¶

Index ¶

func Crawl(mainContext context.Context, config Config) error
func Process(db *crawldatabase.Database[Page], processList ...interface{ Process(*Page) }) error
type Config
type Page
- func (page *Page) GetURLs() map[keys.Key]*url.URL
type ProcessFunc
- func (process ProcessFunc) Process(page *Page)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Crawl ¶

func Crawl(mainContext context.Context, config Config) error

func Process ¶

func Process(db *crawldatabase.Database[Page], processList ...interface{ Process(*Page) }) error

Call each Page with a HTML from the database call is sequenticaly.

Types ¶

type Config ¶

type Config struct {
	// The database, DB or the root if DB is nil.
	// DB crawldatabase.Database[Page]
	DBopener func(logger *slog.Logger, base string, logStatistics bool) ([]*url.URL, *crawldatabase.Database[Page], error)
	// The base path of the database.
	// Argument of the DBopener.
	DBbase string

	// Root URL to begin to read
	Input []*url.URL

	// Filter by URL or by the page (for exemple by the language).
	// Return true to strike the page.
	// The file: "/robots.txt" and "/favicon.ico" are not tested.
	FilterURL  []func(*url.URL) bool
	FilterPage []func(*htmlnode.Root) bool

	// The max size of the html page.
	// 15M for Google https://developers.google.com/search/docs/crawling-indexing/googlebot#how-googlebot-accesses-your-site
	MaxLength int64

	// Maximum of crawl goroutine
	MaxGo int

	// The min and max CrawlDelay.
	// The used value if determined by the robots.txt.
	// Must: minCrawlDelay < maxCrawlDelay
	MinCrawlDelay, MaxCrawlDelay time.Duration

	// A simple logger to slog the database.
	Logger *slog.Logger

	// Use to fetch all HTTP ressource.
	RoundTripper http.RoundTripper
}

type Page ¶

type Page struct {
	URL url.URL

	// Content, on of the following filed.
	Html   *htmlnode.Root
	Robots *robotstxt.File
}

func (*Page) GetURLs ¶

func (page *Page) GetURLs() map[keys.Key]*url.URL

Get all urls of the page

type ProcessFunc ¶

type ProcessFunc func(page *Page)

A function that can be used by Process function.

func (ProcessFunc) Process ¶

func (process ProcessFunc) Process(page *Page)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
database
htmlnode
robotstxt Parse robots.txt and match url the rules.	Parse robots.txt and match url the rules.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL