What does HackerNews think of goquery?

A little like that j-thing, only in Go.

Language: Go

Your comment was posted 4 minutes ago. That means you still have enough time to edit your comment to change it so it contains real URLs that link to the project repos for the packages mentioned:

<https://github.com/PuerkitoBio/goquery>

<https://github.com/dop251/goja>

(Please do not reply to this comment of mine—if you do, I won't be able to delete it once the previous post is fixed, because the existence of the replies will prevent that.)

Really depends on how big your scraping operation is going to be. These days there's a lot of "managed" providers that give you headless browsers / proxy rotators through an easy API so it's relatively easy to plug them into your code. Examples of these would be https://www.browserless.io or https://www.scrapingbee.com for headless browsers to render JS.

From my work experience of working on a large scraping stack with thousands of integrations, I can say that we are very happy with our own custom framework, written in Go (https://github.com/PuerkitoBio/goquery for HTML parsing) and using headless Chrome for JS rendering.

I had a similar problem that I solved with goquery and otto. You can use goquery to traverse the DOM and otto to execute the script fragment. Then just grab the data from otto's VM.

Your scraping being slow and using Chrome might be a blessing in disguise though. If you aren't careful you can get detected as a bot and banned from the site.

https://github.com/PuerkitoBio/goquery https://github.com/robertkrimen/otto

I haven't found anything that matches Beautiful Soup's ability to parse completely borked HTML.

For HTML that's at least mostly valid, I like goquery: https://github.com/PuerkitoBio/goquery

For DOM parsing I cannot imagine that there could anything better than https://github.com/PuerkitoBio/goquery.
Very nice, thanks for posting =).

Can people suggest any additional resources/reading on scraping/crawling as well?

I was hoping to experiment with it in GoLang, but there doesn't seem to be much on crawling/scraping with GoLang, except for GoQuery (https://github.com/PuerkitoBio/goquery)

I've written a couple "polite" crawlers in Go (i.e. obeys robots.txt, delays between requests to the same host).

- Fetchbot: https://github.com/PuerkitoBio/fetchbot

Flexible, similar API to net/http (uses a Handler interface with a simple mux provided, supports middleware, etc.)

- gocrawl: https://github.com/PuerkitoBio/gocrawl

Higher-level, more framework than library.

Coupled with goquery (https://github.com/PuerkitoBio/goquery ) to scrape the dom (well, the net/html nodes), this makes custom scrapers trivial to write.

(sorry for the self-promoting comment, but this is quite on topic)

edit: polite crawlers, not scrapers.