User’s Guide

Configuration

project

Default: This parameter is mandatory

Name of the project used internally by the engine

tcpaddress

Default: Optional parameter

Host and Port combination that telnet console will bind to, e.g: localhost:7654

httpaddress

Default: Optional parameter

Host and Port combination that HTTP API server will bind to, e.g: localhost:5555

redisaddress

Default: Optional parameter

Host and Port combination of redis server, which is required for http api frontend as well as storage.

scrapers

Default: This parameter is mandatory

List of scrapers that will be executed by the engine

Scraper Configuration

name

Default: This parameter is mandatory

Internal name of the scraper

url

Default: This parameter is mandatory

Base url which will be used to start crawling

requestlimit

Default: 1 millisecond

Number of millisecond to wait between requests

patterns

Default: Optional parameter

List of patterns to validate url that's currently being scraped against. See patterns configuration.

extractor

Default: Optional parameter

Short name of extractor struct which implements Extractable interface, by defualt LinkExtractor (link) is used.

Patterns Configuration

type

Default: This parameter is mandatory

Either contains or regexp. First one uses string matching, the latter relies on regular expression.

pattern

Default: This parameter is mandatory

Value that's used as string to match against or regexp expression depending on the type of pattern.

Example configuration

project: test
tcpaddress: localhost:7654
redisaddress: localhost:6379
httpaddress: localhost:5555
scrapers:
- name: golang
  url: http://golangweekly.com
  requestlimit: 200
  patterns:
  - type: contains
    pattern: /issues
- name: scrapinghub
  url: https://blog.scrapinghub.com
  requestlimit: 200

Exports

Extensions

Items

Middleware

Patterns

TCP Server

Gotana offers telnet console for inspecting crawlers and controlling the engine. The idea behind this service is to offer simple remote control over the scrapers.

Available commands

Name Description
HELP Displays list of available commands
LIST Displays lists of available scrapers
STATS Displays statistics of currently running scrapers
MIDDLEWARE Display installed middleware
EXTENSIONS Display installed extensions

Usage

telnet localhost 7654
HELP
-------------------------------------
Available commands: LIST, STATS, HELP, STOP
STATS
-------------------------------------
Total scrapers: 1. Total requests: 45
-------------------------------------
------------------------------------------------------------------------------------------
<Scraper: golangweekly.com>. Crawled: 45, successful: 44, failed: 1. Scraped: 44, saved: 9
------------------------------------------------------------------------------------------
--------------------------------------------------------
Currently fetching: http://golangweekly.com/rss/14p9ef33
--------------------------------------------------------
LIST
------------------------
Running scrapers: golang
STOP
--------------------
Stopping scrapers...

Configuration

tcpaddress

Default: Optional parameter

Host and Port combination that telnet console will bind to, e.g: localhost:7654

Tutorial

Tutorial

This tutorial introduces gotana core concepts by example.

Getting started

Basic usage

Info

Additional