Always fascinated by how diverse the discussion and answers is for HN threads on web-scraping. Goes to show that "web-scraping" has a ton of connotations, everything from automated-fetching of URLs via wget or cURL, to data management via something like scrapy.

Scrapy is a whole framework that may be worthwhile, but if I were just starting out for a specific task, I would use:

- requests http://docs.python-requests.org/en/master/

- lxml http://lxml.de/

- cssselect https://cssselect.readthedocs.io/en/latest/

Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize. But using the web developer tools you can usually figure out the requests made by the browser and then use the Session object in the Requests library to deal with stateful requests:

http://docs.python-requests.org/en/master/user/advanced/

I usually just download pages/data/files as raw files and worry about parsing/collating them later. I try to focus on the HTTP mechanics and, if needed, the HTML parsing, before worrying about data extraction.

lxml can be hit-or-miss on HTML5 docs. I've had greater success with a modified version of gumbo-parser.

Ah very cool, had seen various python libraries about HTML5, but not gumbo (or at least I had starred it).

https://github.com/google/gumbo-parser

Is the modified version you use a personal version or a well-known fork?