What does HackerNews think of goquery?

Show HN: Flyscrape – A standalone and scriptable web scraper in Go | Nov 2023

Your comment was posted 4 minutes ago. That means you still have enough time to edit your comment to change it so it contains real URLs that link to the project repos for the packages mentioned:

<https://github.com/PuerkitoBio/goquery>

<https://github.com/dop251/goja>

(Please do not reply to this comment of mine—if you do, I won't be able to delete it once the previous post is fixed, because the existence of the replies will prevent that.)

Ask HN: What weird technical scene are you fond/part of? | Nov 2022

Expand Context ↕

Really depends on how big your scraping operation is going to be. These days there's a lot of "managed" providers that give you headless browsers / proxy rotators through an easy API so it's relatively easy to plug them into your code. Examples of these would be https://www.browserless.io or https://www.scrapingbee.com for headless browsers to render JS.

From my work experience of working on a large scraping stack with thousands of integrations, I can say that we are very happy with our own custom framework, written in Go (https://github.com/PuerkitoBio/goquery for HTML parsing) and using headless Chrome for JS rendering.

I Need to Find an Apartment | Apr 2022

Expand Context ↕

I had a similar problem that I solved with goquery and otto. You can use goquery to traverse the DOM and otto to execute the script fragment. Then just grab the data from otto's VM.

Your scraping being slow and using Chrome might be a blessing in disguise though. If you aren't careful you can get detected as a bot and banned from the site.

https://github.com/PuerkitoBio/goquery https://github.com/robertkrimen/otto

On Learning Rust and Go: Migrating Away from Python | Mar 2019

Expand Context ↕

I haven't found anything that matches Beautiful Soup's ability to parse completely borked HTML.

For HTML that's at least mostly valid, I like goquery: https://github.com/PuerkitoBio/goquery

Colly – Scraping Framework for Golang | Dec 2017

For DOM parsing I cannot imagine that there could anything better than https://github.com/PuerkitoBio/goquery.

Introduction to web scraping with Python | Oct 2017

Expand Context ↕

goquery [1] is pretty nice.

[1] - https://github.com/PuerkitoBio/goquery

Scrape: A simple, higher level interface for Go web scraping | May 2015

I like goquery[1] for doing this type of thing.

[1] https://github.com/PuerkitoBio/goquery

Python web scraping resources | Aug 2014

Very nice, thanks for posting =).

Can people suggest any additional resources/reading on scraping/crawling as well?

I was hoping to experiment with it in GoLang, but there doesn't seem to be much on crawling/scraping with GoLang, except for GoQuery (https://github.com/PuerkitoBio/goquery)

Import.io – Structured Web Data Scraping | Apr 2014

Expand Context ↕

I've written a couple "polite" crawlers in Go (i.e. obeys robots.txt, delays between requests to the same host).

- Fetchbot: https://github.com/PuerkitoBio/fetchbot

Flexible, similar API to net/http (uses a Handler interface with a simple mux provided, supports middleware, etc.)

- gocrawl: https://github.com/PuerkitoBio/gocrawl

Higher-level, more framework than library.

Coupled with goquery (https://github.com/PuerkitoBio/goquery ) to scrape the dom (well, the net/html nodes), this makes custom scrapers trivial to write.

(sorry for the self-promoting comment, but this is quite on topic)

edit: polite crawlers, not scrapers.