How juicer's focused crawling works
At Headrun, we believe that a crawling sub-system is defined not only by the effectiveness and accuracy by which data is brought forth but also by how robust, fault tolerant and self-healing the underlying architecture needs to be. This was factored in right from the framework level and is what we firmly believe gives us the technological edge.
The web-scale Juicer spiders, designed on our inhouse framework, relentlessly crawl public sites for data whilst strictly maintaining polite polling standards. A combination of adaptive smart algorithms are embedded into the sub-system that work towards determining and auto-adapting to site crawling frequency and data to be brought forth.The crawled data is then segregated based on type and pushed into a master database. The feeds from our master database can be pushed to any storage structure as required.