Hey all, Patrick here & one of the mentors for the Mozilla Builders "fix-the-internet" incubator https://builders.mozilla.community/.

We have 3 different offerings for this Summer. All incubator style, meaning you meet weekly or biweekly with mentors and we really try to help drive you from point A to point C.

1. $75k investment in a startup. MUST be serious about wanting to build something awesome and put in the hard work it takes to do so.

2. $16k funding in a much earlier stage project (idea stage / MVP stage). MUST be serious about commitment it takes to get to launch.

3. OPEN LABS: these are open to the entire community and you have access to the mentors. 10 min checkins each week & peer sessions. We've had TONS of amazing projects for our Open Labs in the Spring and we hope to see TONS more for the Summer.

In terms of MISSION and what we're looking for:

We started this new incubator out of Mozilla in order to work with & invest in developers, startups, and technology enthusiasts who are building things that will shape the internet and have a positive impact without needing to hyper focus on the bottom line. We call this our ”fix-the-internet” incubator.

Here's my "fix the internet" idea: build a search engine that is itself ad-free, and searches over only the ad-free segment of the web. More options: allow users to exclude sites with ads, sites with ecommerce, sites with tracking, or simply allow users to build and share lists of sites to exclude. Rationale: the unmonetized or under-monetized web was awesome, a lot of it still exists under the radar now, and it would be good to reify it as a tangible thing. Bonus 1: competitors probably won't copy your features. Bonus 2: spam won't be a big problem, as most of it contains ads.

That's an idea I had for a few years now. I started some motions [0], but progress was slow, because of life. I wanted to start with going through the Common Crawl [1] data at first for testing purposes and to calculate a rough percentage of sites being uBlock-Origin clean.

I think that such sites would be in ballpark of a few ‰. That would enable me to offer the contentless index for download. With delta updates and torrent for distribution it could be not that expensive, but that's a thing that I could charge for.

My intention is to use AdBlock rules like easylist to check whether or not indeed the page.

My initial code is fine in Go, but I lost enthusiasm for Go lately and careerwise it's not a good fit for me (I don't have much time to learn something not as useful for me professionally). So I started to rewrite it in Rust, while learning it, you can laugh now (Rust Evangelism Strike Force el oh el). It has an advantage with ready to use rules parser from Brave [2] and presumably high quality tokenizer from html5ever [3].

I want to use a tokenizer instead of a full parser to be able to do stream processing bringing costs down.

Common Crawl data lays on S3 so the processing must be done initially on EC2 to keep it low cost.

[0] Current Go code: https://github.com/hadrianw/abracabra

[1] https://commoncrawl.org/

[2] https://github.com/brave/adblock-rust

[3] https://docs.rs/html5ever/0.25.1/html5ever/tokenizer/index.h...

EDIT:

Also for the search part I want to use something more stand alone than Elasticsearch to offer desktop search with downloaded index. When I started with Go I wanted to use Bleve [4], now I'm not sure, but I think that Bleve is getting mature enough. I will worry when I will have some data to search through.

One of the challenges with this whole enterprise is a small need of JavaScript parsing. There is a common pattern, that for example Google Analytics uses, that uses a snippet of JavaScript to insert a proper script tag. But those snippets are very short so I think they may not need a full JS VM, maybe even a tokenizer would be good enough. Browser AdBlockers base on the site executing JavaScript already.

[4] https://blevesearch.com/