httpsyet

package
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 18, 2018 License: MIT Imports: 10 Imported by: 0

README

A Crawler

Use Go Channels to Build a Crawler

@jorinvo aims to provide "an implementation that abstracts the coordination using channels" in his original post,

Having looked at it while ago, having read the explanations and justifications, I thought it might be a nice usecase for pipe/s.

A refacturing

Inspired by Jorin's "qvl.io/httpsyet/httpsyet", which I stumbled upon (via GoLangWeekly).

So, as a "real life" example, the original got refactored with focus solely on the aspects of conncurrency, and intentionally and respectfully each and all code related to the actual crawling was left as untouched as possible.

Please feel free to compare the refactored crawler.go with the untouched original crawler.go.ori.304 (304 LoC) or see crawler.go.mini.224 where the parts which became obsolete are not commented out, but entirely removed.

Overview

The original Crawler "is used as configuration for Run." only, and this limited/focused purpose deserves respect.

In order to give a home to the data structures needed during crawling, a new type crawling struct (in new crawling.go - see below) represents a crawling Crawler.

Crawler (the config) and a *sync-WaitGroup are embedded anonymously; thus crawling inherits all their respective methods.

Further, crawling becomes the new home for the (remaining) channels involved.

(Note: The original implementation uses four hand-made channels and very cleverly orchestrates their handling. Too clever, may be.)

Two channels become obsolete:

  • queue becomes obsolete as we feed-back directly into c.sites.
  • wait becomes a *sync-WaitGroup (to keep track of the traffic inside the circular net)

The remainig two channels sites and results get a new home in the new crawling.

Using functions generated from pipe/s (see genny.go below), the actual concurrent network to process the sites channel becomes:

	sites, seen := ForkSiteSeenAttr(c.sites, site.attr)
	for _, inp := range ScatterSite(sites, size) {
		DoneSiteFunc(inp, c.crawl) // sites leave inside crawler's crawl
	}
	DoneSite(PipeSiteLeave(seen, c)) // seen leave without further processing

Simple, is it not? ;-)

Please note: we do not need any sync-WaitGroup around the parallel processes (as did the original), Also we may safely discard Done...'s results.

Our *sync-WaitGroup -embedded in crawling- controls the traffic:

  • crawling.add registers entering urls (synchonously, and parallal!)
  • PipeSiteLeave decrements the "I've seen your url before"-sites
  • crawling.crawl decrements every crawled site
  • crawling.wait patiently awaits it's Wait()

crawling.go

  • defines type crawling to represent a crawling Crawler.

  • Crawler.crawling instatiates a new crawling, and calls it crawling.crawling (please forgive the pun), which

    • builds the process network (see above)
    • feeds the initial urls (using the original func queueURLs)
    • launches the closer (who simply does a crawling.Wait() before he closes the channels owned by crawling)
    • and returns a signal channel to receive a signal upon close of results (after each has gone thru crawlings c.report)

crawler_test.go

As we feed sites back into the crawling in parallel (which did not happen originally due to the use of channel queue) the visited map needs to become a guarded map (defined at the end of the source file). Feel free to compare with crawler_test.go.ori.

genny.go

Just contains the go:generate comments for genny to generate what we need from the pipe/s library.

Changes to crawler.go

  • func (c Crawler) Run() error

    • typo corrected: "Run the cralwer." => "// Run the crawler."
    • ca 30 LoC after initial validation removed,
    • finish with <-c.crawling(urls) instead - wait for crawling to finish
  • func makeQueue()

    • completely removed - no need
    • ca 35 LoC
  • func (c Crawler) worker

    • remove the for s := range sites loop
    • becomes a method of crawling: func (c *crawling) crawlSite(s site) (urls []*url.URL)
    • sends into c.results (instead of results)
  • func queueURLs

    • is now launched in func (c *crawling) add

Thus, ca 80 LoC are removed / deactivated, and:

  • no channel is created
  • no go routine in launched
  • only two send's remain:
    • c.results <- ... from crawlSite(s site)
    • queue <- site from queueURLs, now called with c.sites as argument queue (from func (c *crawling) add).

Documentation

Overview

Package httpsyet provides the configuration and execution for crawling a list of sites for links that can be updated to HTTPS.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ChanSite

func ChanSite(inp ...site) (out <-chan site)

ChanSite returns a channel to receive all inputs before close.

func ChanSiteFuncErr

func ChanSiteFuncErr(gen func() (site, error)) (out <-chan site)

ChanSiteFuncErr returns a channel to receive all results of generator `gen` until `err != nil` before close.

func ChanSiteFuncNok

func ChanSiteFuncNok(gen func() (site, bool)) (out <-chan site)

ChanSiteFuncNok returns a channel to receive all results of generator `gen` until `!ok` before close.

func ChanSiteSlice

func ChanSiteSlice(inp ...[]site) (out <-chan site)

ChanSiteSlice returns a channel to receive all inputs before close.

func ChanString

func ChanString(inp ...string) (out <-chan string)

ChanString returns a channel to receive all inputs before close.

func ChanStringFuncErr

func ChanStringFuncErr(gen func() (string, error)) (out <-chan string)

ChanStringFuncErr returns a channel to receive all results of generator `gen` until `err != nil` before close.

func ChanStringFuncNok

func ChanStringFuncNok(gen func() (string, bool)) (out <-chan string)

ChanStringFuncNok returns a channel to receive all results of generator `gen` until `!ok` before close.

func ChanStringSlice

func ChanStringSlice(inp ...[]string) (out <-chan string)

ChanStringSlice returns a channel to receive all inputs before close.

func DoneSite

func DoneSite(inp <-chan site) (done <-chan struct{})

DoneSite returns a channel to receive one signal before close after `inp` has been drained.

func DoneSiteFunc

func DoneSiteFunc(inp <-chan site, act func(a site)) (done <-chan struct{})

DoneSiteFunc returns a channel to receive one signal after `act` has been applied to every `inp` before close.

func DoneSiteSlice

func DoneSiteSlice(inp <-chan site) (done <-chan []site)

DoneSiteSlice returns a channel to receive a slice with every site received on `inp` before close.

Note: Unlike DoneSite, DoneSiteSlice sends the fully accumulated slice, not just an event, once upon close of inp.

func DoneString

func DoneString(inp <-chan string) (done <-chan struct{})

DoneString returns a channel to receive one signal before close after `inp` has been drained.

func DoneStringFunc

func DoneStringFunc(inp <-chan string, act func(a string)) (done <-chan struct{})

DoneStringFunc returns a channel to receive one signal after `act` has been applied to every `inp` before close.

func DoneStringSlice

func DoneStringSlice(inp <-chan string) (done <-chan []string)

DoneStringSlice returns a channel to receive a slice with every string received on `inp` before close.

Note: Unlike DoneString, DoneStringSlice sends the fully accumulated slice, not just an event, once upon close of inp.

func FanIn2Site

func FanIn2Site(inp1, inp2 <-chan site) (out <-chan site)

FanIn2Site returns a channel to receive all to receive all from both `inp1` and `inp2` before close.

func FanIn2String

func FanIn2String(inp1, inp2 <-chan string) (out <-chan string)

FanIn2String returns a channel to receive all to receive all from both `inp1` and `inp2` before close.

func FiniSite

func FiniSite() func(inp <-chan site) (done <-chan struct{})

FiniSite returns a closure around `Donesite(_)`.

func FiniSiteFunc

func FiniSiteFunc(act func(a site)) func(inp <-chan site) (done <-chan struct{})

FiniSiteFunc returns a closure around `DonesiteFunc(_, act)`.

func FiniSiteSlice

func FiniSiteSlice() func(inp <-chan site) (done <-chan []site)

FiniSiteSlice returns a closure around `DonesiteSlice(_)`.

func FiniString

func FiniString() func(inp <-chan string) (done <-chan struct{})

FiniString returns a closure around `Donestring(_)`.

func FiniStringFunc

func FiniStringFunc(act func(a string)) func(inp <-chan string) (done <-chan struct{})

FiniStringFunc returns a closure around `DonestringFunc(_, act)`.

func FiniStringSlice

func FiniStringSlice() func(inp <-chan string) (done <-chan []string)

FiniStringSlice returns a closure around `DonestringSlice(_)`.

func ForkSite

func ForkSite(inp <-chan site) (out1, out2 <-chan site)

ForkSite returns two channels either of which is to receive every result of inp before close.

func ForkSiteSeen

func ForkSiteSeen(inp <-chan site) (new, old <-chan site)

ForkSiteSeen returns two channels, `new` and `old`, where `new` is to receive all `inp` not been seen before and `old` all `inp` seen before (internally growing a `sync.Map` to discriminate) until close.

func ForkSiteSeenAttr

func ForkSiteSeenAttr(inp <-chan site, attr func(a site) interface{}) (new, old <-chan site)

ForkSiteSeenAttr returns two channels, `new` and `old`, where `new` is to receive all `inp` whose attribute `attr` has not been seen before and `old` all `inp` seen before (internally growing a `sync.Map` to discriminate) until close.

func ForkString

func ForkString(inp <-chan string) (out1, out2 <-chan string)

ForkString returns two channels either of which is to receive every result of inp before close.

func MakeSiteChan

func MakeSiteChan() (out chan site)

MakeSiteChan returns a new open channel (simply a 'chan site' that is). Note: No 'site-producer' is launched here yet! (as is in all the other functions).

This is useful to easily create corresponding variables such as:

var mysitePipelineStartsHere := MakeSiteChan() // ... lot's of code to design and build Your favourite "mysiteWorkflowPipeline"

// ...
// ... *before* You start pouring data into it, e.g. simply via:
for drop := range water {

mysitePipelineStartsHere <- drop

}

close(mysitePipelineStartsHere)

Hint: especially helpful, if Your piping library operates on some hidden (non-exported) type
(or on a type imported from elsewhere - and You don't want/need or should(!) have to care.)

Note: as always (except for PipeSiteBuffer) the channel is unbuffered.

func MakeStringChan

func MakeStringChan() (out chan string)

MakeStringChan returns a new open channel (simply a 'chan string' that is). Note: No 'string-producer' is launched here yet! (as is in all the other functions).

This is useful to easily create corresponding variables such as:

var mystringPipelineStartsHere := MakeStringChan() // ... lot's of code to design and build Your favourite "mystringWorkflowPipeline"

// ...
// ... *before* You start pouring data into it, e.g. simply via:
for drop := range water {

mystringPipelineStartsHere <- drop

}

close(mystringPipelineStartsHere)

Hint: especially helpful, if Your piping library operates on some hidden (non-exported) type
(or on a type imported from elsewhere - and You don't want/need or should(!) have to care.)

Note: as always (except for PipeStringBuffer) the channel is unbuffered.

func PairSite

func PairSite(inp <-chan site) (out1, out2 <-chan site)

PairSite returns a pair of channels to receive every result of inp before close.

Note: Yes, it is a VERY simple fanout - but sometimes all You need.

func PairString

func PairString(inp <-chan string) (out1, out2 <-chan string)

PairString returns a pair of channels to receive every result of inp before close.

Note: Yes, it is a VERY simple fanout - but sometimes all You need.

func PipeSiteBuffer

func PipeSiteBuffer(inp <-chan site, cap int) (out <-chan site)

PipeSiteBuffer returns a buffered channel with capacity `cap` to receive all `inp` before close.

func PipeSiteEnter

func PipeSiteEnter(inp <-chan site, wg SiteWaiter) (out <-chan site)

PipeSiteEnter returns a channel to receive all `inp` and registers throughput as arrival on the given `sync.WaitGroup` until close.

func PipeSiteFunc

func PipeSiteFunc(inp <-chan site, act func(a site) site) (out <-chan site)

PipeSiteFunc returns a channel to receive every result of action `act` applied to `inp` before close. Note: it 'could' be PipeSiteMap for functional people, but 'map' has a very different meaning in go lang.

func PipeSiteLeave

func PipeSiteLeave(inp <-chan site, wg SiteWaiter) (out <-chan site)

PipeSiteLeave returns a channel to receive all `inp` and registers throughput as departure on the given `sync.WaitGroup` until close.

func PipeSiteSeen

func PipeSiteSeen(inp <-chan site) (out <-chan site)

PipeSiteSeen returns a channel to receive all `inp` not been seen before while silently dropping everything seen before (internally growing a `sync.Map` to discriminate) until close. Note: PipeSiteFilterNotSeenYet might be a better name, but is fairly long.

func PipeSiteSeenAttr

func PipeSiteSeenAttr(inp <-chan site, attr func(a site) interface{}) (out <-chan site)

PipeSiteSeenAttr returns a channel to receive all `inp` whose attribute `attr` has not been seen before while silently dropping everything seen before (internally growing a `sync.Map` to discriminate) until close. Note: PipeSiteFilterAttrNotSeenYet might be a better name, but is fairly long.

func PipeStringBuffer

func PipeStringBuffer(inp <-chan string, cap int) (out <-chan string)

PipeStringBuffer returns a buffered channel with capacity `cap` to receive all `inp` before close.

func PipeStringFunc

func PipeStringFunc(inp <-chan string, act func(a string) string) (out <-chan string)

PipeStringFunc returns a channel to receive every result of action `act` applied to `inp` before close. Note: it 'could' be PipeStringMap for functional people, but 'map' has a very different meaning in go lang.

func ScatterSite

func ScatterSite(inp <-chan site, size int) (outS [](<-chan site))

ScatterSite returns a slice (of size = size) of channels one of which shall receive any inp before close.

func TubeSiteBuffer

func TubeSiteBuffer(cap int) (tube func(inp <-chan site) (out <-chan site))

TubeSiteBuffer returns a closure around PipeSiteBuffer (_, cap).

func TubeSiteEnter

func TubeSiteEnter(wg SiteWaiter) (tube func(inp <-chan site) (out <-chan site))

TubeSiteEnter returns a closure around PipeSiteEnter (_, wg) registering throughput on the given `sync.WaitGroup` as arrival.

func TubeSiteFunc

func TubeSiteFunc(act func(a site) site) (tube func(inp <-chan site) (out <-chan site))

TubeSiteFunc returns a closure around PipeSiteFunc (_, act).

func TubeSiteLeave

func TubeSiteLeave(wg SiteWaiter) (tube func(inp <-chan site) (out <-chan site))

TubeSiteLeave returns a closure around PipeSiteLeave (_, wg) registering throughput on the given `sync.WaitGroup` as departure.

func TubeSiteSeen

func TubeSiteSeen() (tube func(inp <-chan site) (out <-chan site))

TubeSiteSeen returns a closure around PipeSiteSeen() (silently dropping every site seen before).

func TubeSiteSeenAttr

func TubeSiteSeenAttr(attr func(a site) interface{}) (tube func(inp <-chan site) (out <-chan site))

TubeSiteSeenAttr returns a closure around PipeSiteSeenAttr() (silently dropping every site whose attribute `attr` was seen before).

func TubeStringBuffer

func TubeStringBuffer(cap int) (tube func(inp <-chan string) (out <-chan string))

TubeStringBuffer returns a closure around PipeStringBuffer (_, cap).

func TubeStringFunc

func TubeStringFunc(act func(a string) string) (tube func(inp <-chan string) (out <-chan string))

TubeStringFunc returns a closure around PipeStringFunc (_, act).

Types

type Crawler

type Crawler struct {
	Sites    []string                             // At least one URL.
	Out      io.Writer                            // Required. Writes one detected site per line.
	Log      *log.Logger                          // Required. Errors are reported here.
	Depth    int                                  // Optional. Limit depth. Set to >= 1.
	Parallel int                                  // Optional. Set how many sites to crawl in parallel.
	Delay    time.Duration                        // Optional. Set delay between crawls.
	Get      func(string) (*http.Response, error) // Optional. Defaults to http.Get.
	Verbose  bool                                 // Optional. If set, status updates are written to logger.
}

Crawler is used as configuration for Run. Is validated in Run().

func (Crawler) Run

func (c Crawler) Run() error

Run the crawler. Can return validation errors. All crawling errors are reported via logger. Output is written to writer. Crawls sites recursively and reports all external links that can be changed to HTTPS. Also reports broken links via error logger.

type SiteWaiter

type SiteWaiter interface {
	Add(delta int)
	Done()
}

SiteWaiter - as implemented by `*sync.WaitGroup` - attends Flapdoors and keeps track of how many enter and how many leave.

Use Your provided `*sync.WaitGroup.Wait()` to know when to close the facilities.

Just make sure to have _all_ entrances and exits attended, and don't `wg.Wait()` before You've flooded the facilities.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL