If you are a programmer, scrapy[0] will be a good bet. It can handle robots.txt, request throttling by ip, request throttling by domain, proxies and all other common nitty-gritties of crawling. The only drawback is handling pure javascript sites. We have to manually dig into the api or add a headless browser invocation within the scrapy handler.

Scrapy also has the ability to pause and restart crawls [1], run the crawlers distributed [2] etc. It is my goto option.

[1] https://doc.scrapy.org/en/latest/topics/jobs.html

[2] https://github.com/rmax/scrapy-redis