Yaml Crawler
Steps
- define which web url you wanna crawl, lets say
https://www.xxx.com/aaa.apex
- create a page pojo
org.example.business.page.MainPage
to describe that page
Then you can create a yaml file named root-pages.yaml
and its content is
- '@class': "org.example.business.page.MainPage"
url: "https://www.xxx.com/aaa.apex"
- and then define a process flow yaml file, implying how to process web pages the crawler will meet.
processorChain:
- '@class': "org.example.crawler.core.processor.decorator.ExceptionRecord"
processor:
'@class': "org.example.crawler.core.processor.decorator.RetryControl"
processor:
'@class': "org.example.crawler.core.processor.decorator.SpeedControl"
processor:
'@class': "org.example.business.hs.code.MainPageProcessor"
application: "app-name"
time: 100
unit: "MILLISECONDS"
retryTimes: 1
- '@class': "org.example.crawler.core.processor.decorator.ExceptionRecord"
processor:
'@class': "org.example.crawler.core.processor.decorator.RetryControl"
processor:
'@class': "org.example.crawler.core.processor.decorator.SpeedControl"
processor:
'@class': "org.example.crawler.core.processor.download.DownloadProcessor"
pagePersist:
'@class': "org.example.business.persist.DownloadPageDatabasePersist"
downloadPageRepositoryBeanName: "downloadPageRepository"
downloadPageTransformer:
'@class': "org.example.crawler.download.DefaultDownloadPageTransformer"
skipExists:
'@class': "org.example.crawler.download.SkipExistsById"
time: 1
unit: "SECONDS"
retryTimes: 1
nThreads: 1
pollWaitingTime: 30
pollWaitingTimeUnit: "SECONDS"
waitFinishedTimeout: 180
waitFinishedTimeUnit: "SECONDS"
ExceptionRecord
,RetryControl
,SpeedControl
are provided by the yaml crawler itself, dont worry. you only need to extend how to process your pageMainPage
, for example, you defined aMainPageProcessor
. each processor will produce a set of other page orDownloadPage
.DownloadPage
like a ship containing information you need, and this framework will help you processDownloadPage
and download or persist.
- Vola, run your crawler then.