Yeah, there are certainly more problems these days. For one, the size of the web is larger, more of it is spam causing issues with pure page rank to detect networks that heavily link to each other.
Important sites have a bunch of anti-crawling detection set up (especially news sites). It's even worse that the best user-generated content is behind walled gardens in facebook groups, slack channels, quora threads, etc...
The rest of the good sites are javascript-heavy and you often have to run chrome headless to render the page and find the content - but that is detectable so you end up renting IP's from mobile number farms or trying to build your own 4G network.
On the upside, https://commoncrawl.org/ now exists and makes the prototype crawling work much easier. It's not the full internet, but gives you plenty to work with and test against so you can skip to the part where you figure out if you can produce anything useful should you actually try to crawl the whole internet.
- Tries (patricia, radix, etc...)
- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)
- Consensus (raft, paxos, etc..)
- Block storage (disk block size optimizations, mmap files, delta storage, etc..)
- Probabilistic filters (hyperloloog, bloom filters, etc...)
- Binary Search (sstables, sorted inverted indexes)
- Ranking (pagerank, tf/idf, bm25, etc...)
- NLP (stemming, POS tagging, subject identification, etc...)
- HTML (document parsing/lexing)
- Images (exif extraction, removal, resizing / proxying, etc...)
- Queues (SQS, NATS, Apollo, etc...)
- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)
- Rate limiting (leaky bucket, windowed, etc...)
- text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)
- etc...
I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.
- https://github.com/quickwit-oss/tantivy
- https://github.com/valeriansaliou/sonic
- https://github.com/mosuka/phalanx
- https://github.com/meilisearch/MeiliSearch
- https://github.com/blevesearch/bleve
- https://github.com/thomasjungblut/go-sstables
A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. That might work for something small like a curated collection of a few hundred sites.