What does HackerNews think of readability?
A standalone version of the readability lib
For the sake of completeness: Mozilla's Readability [1] is obviously a reference in the JS world.
https://github.com/mozilla/readability
Btw, readability, is also available in few other languages like Kotlin:
And also please note that Mozilla has their algorithm in the open here: https://github.com/mozilla/readability
- Readability (https://github.com/mozilla/readability) to strip down the page's HTML to a bare minimum.
- Turndown.js (https://github.com/mixmark-io/turndown) to convert the plain HTML to a markdown format with the GFM plugins enabled.
- Puppeteer (https://github.com/puppeteer/puppeteer) to download the page.
It costs me only several cents to parse an entire page, and I think OP can make some money out of this if they get the pricing right.
Also, some unsolicited feedbacks on the API:
- An option to enable/disable javascript would be great, since not all pages actually need to have it enabled to be parsable.
- You can probably tweak the header of the headless browser to bypass the paywalls of some sites. Some are as simple as setting the useragent to a crawler bot (like `googlebot`).
- Maybe an option to fill in the front matter (https://jekyllrb.com/docs/front-matter/) with a metadata given in the payload?
I've been thinking about creating this + adding https://github.com/mozilla/readability to grab the links that are text articles and present them in-page (and cleaned up, just the text+images+similar, removing all the sidebars, popups, etc) instead of having to go to a 3rd party website with all the popups and such.
It'd have to be either a personal website or a browser extension like yours, since I wouldn't be able to host a given article for anyone to read (for copyright reasons), but I can have a modified browser that loads a 3rd party article different.
[0] https://github.com/mozilla/readability/#isprobablyreaderable...
There are several open source projects for extracting web contents. However, there are three extractors that I've worked with and give us good result:
- readability.js[1], web extractor by Mozilla that used in Firefox.
- dom-distiller[2], web extractor by Chromium team, written in Java.
- trafilatura[3], Python package by Adrien Barbaresi from BBAW[4].
First, readability.js, as expected is the most famous extractor. It's a single file Javascript library with modest 2,000+ lines of code, released under Apache license. Since it's in JS, you can use it wherever you want, either in web page using `script` tag or by using it in Node project.
Next, DomDistiller is extractor that used in Chromium. It uses Java language with whopping 14,000+ lines of code and can only be used as part of Chromium browser, so you can't exactly use it as standalone library or CLI.
Finally, Trafilatura is a Python package released under GPLv3 license. Created in order to build a text databases[5] for NLP research, it mainly intended for German web pages. However, as development continues, it works really great with other languages. It's a bit slow though compared to Readability.js.
All of those three work in similar way: extract metadata, remove unneeded contents, and finally returns the cleaned up content. Their differences (that I remembered) are:
- In Readability, they insist to make no special rules for any website, while DomDistiller and Trafilatura give a small exception for popular sites like Wikipedia. Thanks to this, if you use Readability.js in Wikipedia pages, it will shows `[edit]` button thorough the extracted content.
- Readability has a small function to detect whether a web page can be converted to reader mode. While it's not really accurate, it's quite convenient to have.
- In DomDistiller, the metadata extraction is more thorough than the others. It supports OpenGraph, Schema.org, and even the old IE Reading View mark up tags.
- Since DomDistiller is only usable within Chromium, it has the advantage to be able to use CSS styling to determine if an element is important or not. If an element is styled to be invisible (e.g. `display: none`) then it will be deemed unimportant. However, according to a research[6] this step is actually doesn't really affect the extraction result.
- DomDistiller also has an experimental feature to find and extract next page in sites that separated its article to several partial pages.
- For Trafilatura, since it was created for collecting web corpus, it main ability is extracting text and the publication date of a web page. For the latter, they've created a Python package named htmldate[7] whose only purpose is to extract the publication or modification date for a web page.
- Trafilatura also has an experimental feature to remove elements that repeated too often. The idea is if the element occured too often, then it's not important to the reader.
I've found benchmark[8] that compare the performance between the extractors, and it said that Trafilatura has the best accuracy compared to the others. However, before you start rushing to use Trafilatura, you should remember that Trafilatura is intended for gathering web corpus, so it's really great for extracting text content, but IIRC is not as good as Readability.js and DomDistiller for extracting a proper article with images and embedded iframes (depending on how you look, it could be a feature though).
By the way, if you are using Go and need to use a web extractor, I already ported all three of them to Go[9][10][11] including their dependencies[12][13], so have fun with it.
[1]: https://github.com/mozilla/readability
[2]: https://github.com/chromium/dom-distiller
[3]: https://github.com/adbar/trafilatura
[5]: https://www.dwds.de/d/k-web
[6]: https://arxiv.org/abs/1811.03661
[7]: https://github.com/adbar/htmldate
[8]: https://github.com/scrapinghub/article-extraction-benchmark
[9]: https://github.com/go-shiori/go-readability
[10]: https://github.com/markusmobius/go-domdistiller
[11]: https://github.com/markusmobius/go-trafilatura
Afaik https://github.com/mozilla/readability is what Firefox uses and that's where you can report websites who don't parse well.
My general suspicion is that adhering to a simple HTML5 documemnt structure, and possible use of microformats (https://microformats.io/) goes a long way.
Update: there's some discussion here: https://news.ycombinator.com/item?id=28301113
I'm currently trying to bring the PHP port up to speed here: https://github.com/fivefilters/readability.php
We use an older version as part of our article extraction for Push to Kindle: https://www.fivefilters.org/push-to-kindle/
A good search engine for "my stuff" and "stuff I've seen before" is not available for most people in my experience. Pinboard and similar sites fill some of that role, but only for things that you bookmark (and I'm not sure they do full-text search of the documents).
---
Two things I'd mention are:
1. Digital footprint usually means your info on other sites, not just things I've accessed. If I read a blog that is not part of my footprint, but if I leave a comment on that blog that comment is part of it. The term is also mostly used in a tracking and negative context (although there are exceptions), so you might want to change that: https://en.wikipedia.org/wiki/Digital_footprint
2. I don't really get what makes it UNIX-style (or what exactly you mean by that? There seems to be many definitions), and the readme does not seem to clarify much besides expecting me to notice it by myself.
I work on a service called Full-Text RSS[2] that used a PHP port of Readability, coupled with site-specific extraction rules[3] to identify and extract article content from each feed item. It then produces a full-text version of the given feed. The idea is you subscribe to the full-text version in whichever feed reader you use and it will give you full-text articles where you had partial content before.
[1] https://github.com/mozilla/readability
I had no idea that Readability.js was available as a standalone library. That’s awesome!
Congrats on getting started.
I agree with Obsidian - I think that most people forget the maintenance time it takes to build a lifelong Knowledge Management System.
I like your idea - document similarity is a well known area in ML.
Feel free to take my Chrome Extension and use the parts where it tracks key paragraphs in an article (using a user's click/ hover/ attention behaviour) and use that as the corpus for your ML similarity models.
Intuitively it makes more sense to run document similarity on key points/ paragraphs than the whole web page.
If you want the whole web page though, there's code in the Chrome Extension that use's Mozilla's readability lib (https://github.com/mozilla/readability) to purify the web content.
FF uses this for its reader view: https://github.com/mozilla/readability
https://github.com/mozilla/readability
I use it profusely along with Puppeteer in Node and it's an awesome building block for web scraping!
I'd expect the Safari feature to work quite similar.
There is also a JS library called readability which is what is used by Firefox's reader mode.
Check out their Github repo here - https://github.com/mozilla/readability
>You need at least one
tag around the text, you want to see in Reader View and at least 516 characters in 7 words inside the text.
Source - https://stackoverflow.com/questions/30661650/how-does-firefo...
- Pretend they're a crawler such as Google and pull down the HTML, potentially executing javascript
- Once it's pulled down, clean it up using open source code such as readability https://github.com/mozilla/readability
- Store that result as a document in a nosql database
Once they have pulled the article down once they don't need to get it again.
https://github.com/buriy/python-readability
https://github.com/goose3/goose3
https://github.com/codelucas/newspaper
For a complete solution, you can also look at Wallabag, which can be self-hosted.
-I don't know how the FF reader mode is coded-