User’s Guide¶
Configuration¶
tcpaddress¶
Default: Optional parameter
Host and Port combination that telnet console will bind to, e.g: localhost:7654
httpaddress¶
Default: Optional parameter
Host and Port combination that HTTP API server will bind to, e.g: localhost:5555
redisaddress¶
Default: Optional parameter
Host and Port combination of redis server, which is required for http api frontend as well as storage.
Scraper Configuration¶
patterns¶
Default: Optional parameter
List of patterns to validate url that's currently being scraped against. See patterns configuration.
extractor¶
Default: Optional parameter
Short name of extractor struct which implements Extractable interface, by defualt LinkExtractor (link) is used.
Patterns Configuration¶
type¶
Default: This parameter is mandatory
Either contains or regexp. First one uses string matching, the latter relies on regular expression.
pattern¶
Default: This parameter is mandatory
Value that's used as string to match against or regexp expression depending on the type of pattern.
Example configuration¶
project: test
tcpaddress: localhost:7654
redisaddress: localhost:6379
httpaddress: localhost:5555
scrapers:
- name: golang
url: http://golangweekly.com
requestlimit: 200
patterns:
- type: contains
pattern: /issues
- name: scrapinghub
url: https://blog.scrapinghub.com
requestlimit: 200
Exports¶
Extensions¶
Items¶
Middleware¶
Patterns¶
TCP Server¶
Gotana offers telnet console for inspecting crawlers and controlling the engine. The idea behind this service is to offer simple remote control over the scrapers.
Available commands¶
Name | Description |
---|---|
HELP | Displays list of available commands |
LIST | Displays lists of available scrapers |
STATS | Displays statistics of currently running scrapers |
MIDDLEWARE | Display installed middleware |
EXTENSIONS | Display installed extensions |
Usage¶
telnet localhost 7654
HELP
-------------------------------------
Available commands: LIST, STATS, HELP, STOP
STATS
-------------------------------------
Total scrapers: 1. Total requests: 45
-------------------------------------
------------------------------------------------------------------------------------------
<Scraper: golangweekly.com>. Crawled: 45, successful: 44, failed: 1. Scraped: 44, saved: 9
------------------------------------------------------------------------------------------
--------------------------------------------------------
Currently fetching: http://golangweekly.com/rss/14p9ef33
--------------------------------------------------------
LIST
------------------------
Running scrapers: golang
STOP
--------------------
Stopping scrapers...