I wanted to do some larger distributed scraping jobs recently and although it was easy to get everything running on one machine (with different tools including Scrapy), I was surprised how hard it was to do at scale. The open source ones I could find was hard/impossible to get working, overly complex, badly documented etc.

The services I found to be reasonably priced for small jobs, but at scale they quickly become vastly more expensive than setting this up yourself. Especially when you need to run these jobs every month or so. Even if you have to write some code to make the open source solutions actually work.

AWS Lambdas are an easy way to get scheduled scraping jobs running.

I use their Python-based chalice framework (https://github.com/aws/chalice) which allows you to add a decorator to a method for a schedule,

  @app.schedule(Rate(30, unit=Rate.MINUTES)) 
It's also a breeze to deploy.

  chalice deploy