What does HackerNews think of readability?

A standalone version of the readability lib

Language: JavaScript

Concerning tooling I'd say you have two different worlds, JavaScript and Python, each with a series of tools to tackle such tasks. It's not easy to compare them directly because of varying software environments and I haven't had a chance to test JS tools thoroughly.

For the sake of completeness: Mozilla's Readability [1] is obviously a reference in the JS world.

[1]: https://github.com/mozilla/readability

I wonder if Firefox "reader mode as a utility" might be a viable alternative for Pinboard like "content oriented" archiving?

https://github.com/mozilla/readability

Depending upon the type of content, one might want to look into using the Readability (Browder's reader view) to parse the webpage. It will give you all the useful info without the junk. Then you can put it in the DB as needed.

https://github.com/mozilla/readability

Btw, readability, is also available in few other languages like Kotlin:

https://github.com/dankito/Readability4J

Playing the devil's advocate: Why is Firefox Reader mode not good enough?

And also please note that Mozilla has their algorithm in the open here: https://github.com/mozilla/readability

I maintain something similar today, and I'm guessing that the OP uses some combination of the following libraries too (?):

- Readability (https://github.com/mozilla/readability) to strip down the page's HTML to a bare minimum.

- Turndown.js (https://github.com/mixmark-io/turndown) to convert the plain HTML to a markdown format with the GFM plugins enabled.

- Puppeteer (https://github.com/puppeteer/puppeteer) to download the page.

It costs me only several cents to parse an entire page, and I think OP can make some money out of this if they get the pricing right.

Also, some unsolicited feedbacks on the API:

- An option to enable/disable javascript would be great, since not all pages actually need to have it enabled to be parsable.

- You can probably tweak the header of the headless browser to bypass the paywalls of some sites. Some are as simple as setting the useragent to a crawler bot (like `googlebot`).

- Maybe an option to fill in the front matter (https://jekyllrb.com/docs/front-matter/) with a metadata given in the payload?

Do you accept feature requests/think this might be a good idea?

I've been thinking about creating this + adding https://github.com/mozilla/readability to grab the links that are text articles and present them in-page (and cleaned up, just the text+images+similar, removing all the sidebars, popups, etc) instead of having to go to a 3rd party website with all the popups and such.

It'd have to be either a personal website or a browser extension like yours, since I wouldn't be able to host a given article for anyone to read (for copyright reasons), but I can have a modified browser that loads a 3rd party article different.

It's a feature of Reader actually. The library they use has a "isProbablyReaderable()" function to determine if a page can be simplified, and their extension uses that to decide if Reader mode should be available or not.

[0] https://github.com/mozilla/readability/#isprobablyreaderable...

Because Reader Mode is using heuristics to fetch the content of the article, and that can fail if the article is weirdly formatted in HTML. You can read more about this on the repo for Mozilla's Readability, which is what Firefox uses under the hood: https://github.com/mozilla/readability
I've been working on several web extractors project, so I think I could share some of my findings while working on them. Granted it's been several months since I worked on it so I might be forgetting some things.

There are several open source projects for extracting web contents. However, there are three extractors that I've worked with and give us good result:

- readability.js[1], web extractor by Mozilla that used in Firefox.

- dom-distiller[2], web extractor by Chromium team, written in Java.

- trafilatura[3], Python package by Adrien Barbaresi from BBAW[4].

First, readability.js, as expected is the most famous extractor. It's a single file Javascript library with modest 2,000+ lines of code, released under Apache license. Since it's in JS, you can use it wherever you want, either in web page using `script` tag or by using it in Node project.

Next, DomDistiller is extractor that used in Chromium. It uses Java language with whopping 14,000+ lines of code and can only be used as part of Chromium browser, so you can't exactly use it as standalone library or CLI.

Finally, Trafilatura is a Python package released under GPLv3 license. Created in order to build a text databases[5] for NLP research, it mainly intended for German web pages. However, as development continues, it works really great with other languages. It's a bit slow though compared to Readability.js.

All of those three work in similar way: extract metadata, remove unneeded contents, and finally returns the cleaned up content. Their differences (that I remembered) are:

- In Readability, they insist to make no special rules for any website, while DomDistiller and Trafilatura give a small exception for popular sites like Wikipedia. Thanks to this, if you use Readability.js in Wikipedia pages, it will shows `[edit]` button thorough the extracted content.

- Readability has a small function to detect whether a web page can be converted to reader mode. While it's not really accurate, it's quite convenient to have.

- In DomDistiller, the metadata extraction is more thorough than the others. It supports OpenGraph, Schema.org, and even the old IE Reading View mark up tags.

- Since DomDistiller is only usable within Chromium, it has the advantage to be able to use CSS styling to determine if an element is important or not. If an element is styled to be invisible (e.g. `display: none`) then it will be deemed unimportant. However, according to a research[6] this step is actually doesn't really affect the extraction result.

- DomDistiller also has an experimental feature to find and extract next page in sites that separated its article to several partial pages.

- For Trafilatura, since it was created for collecting web corpus, it main ability is extracting text and the publication date of a web page. For the latter, they've created a Python package named htmldate[7] whose only purpose is to extract the publication or modification date for a web page.

- Trafilatura also has an experimental feature to remove elements that repeated too often. The idea is if the element occured too often, then it's not important to the reader.

I've found benchmark[8] that compare the performance between the extractors, and it said that Trafilatura has the best accuracy compared to the others. However, before you start rushing to use Trafilatura, you should remember that Trafilatura is intended for gathering web corpus, so it's really great for extracting text content, but IIRC is not as good as Readability.js and DomDistiller for extracting a proper article with images and embedded iframes (depending on how you look, it could be a feature though).

By the way, if you are using Go and need to use a web extractor, I already ported all three of them to Go[9][10][11] including their dependencies[12][13], so have fun with it.

[1]: https://github.com/mozilla/readability

[2]: https://github.com/chromium/dom-distiller

[3]: https://github.com/adbar/trafilatura

[4]: https://www.bbaw.de/en/

[5]: https://www.dwds.de/d/k-web

[6]: https://arxiv.org/abs/1811.03661

[7]: https://github.com/adbar/htmldate

[8]: https://github.com/scrapinghub/article-extraction-benchmark

[9]: https://github.com/go-shiori/go-readability

[10]: https://github.com/markusmobius/go-domdistiller

[11]: https://github.com/markusmobius/go-trafilatura

[12]: https://github.com/markusmobius/go-htmldate

[13]: https://github.com/markusmobius/go-dateparser

I've had a lot of success by running HTML pages through mozilla's readability[0] tool (actually the go port of it[1]) before indexing it.

[0]: https://github.com/mozilla/readability

[1]: https://github.com/go-shiori/go-readability

https://outline.com/

Afaik https://github.com/mozilla/readability is what Firefox uses and that's where you can report websites who don't parse well.

I've just poked through both the GetPocket site (https://getpocket.com/publisher/) and Mozilla's Readability Library GitHub page (https://github.com/mozilla/readability) without seeing obvious guidelines.

My general suspicion is that adhering to a simple HTML5 documemnt structure, and possible use of microformats (https://microformats.io/) goes a long way.

Update: there's some discussion here: https://news.ycombinator.com/item?id=28301113

Any developers who'd like to contribute to improving how article content is extracted from web pages should check out Mozilla's Readability repository: https://github.com/mozilla/readability

I'm currently trying to bring the PHP port up to speed here: https://github.com/fivefilters/readability.php

We use an older version as part of our article extraction for Push to Kindle: https://www.fivefilters.org/push-to-kindle/

I think this[0] comes close to what is used to extract text from an HTML document. Fetching can be done via any HTTP client. Will need jsdom to convert the text to DOM before feeding it to readability.

[0]: https://github.com/mozilla/readability

Looks very much like one of the ideas I've been thinking of building! The way I planned to do it was to use a similar approach to rga for files ( https://github.com/phiresky/ripgrep-all ) and having a webextension to pull all webpages I vist (filtered via something like https://github.com/mozilla/readability ), dump that into either sqlite with FTS5 or postgres with FTS for search.

A good search engine for "my stuff" and "stuff I've seen before" is not available for most people in my experience. Pinboard and similar sites fill some of that role, but only for things that you bookmark (and I'm not sure they do full-text search of the documents).

---

Two things I'd mention are:

1. Digital footprint usually means your info on other sites, not just things I've accessed. If I read a blog that is not part of my footprint, but if I leave a comment on that blog that comment is part of it. The term is also mostly used in a tracking and negative context (although there are exceptions), so you might want to change that: https://en.wikipedia.org/wiki/Digital_footprint

2. I don't really get what makes it UNIX-style (or what exactly you mean by that? There seems to be many definitions), and the readme does not seem to clarify much besides expecting me to notice it by myself.

If you're trying to build one yourself, have a look at the open source Readability code[1]. It was originally developed by Arc90 and is now used by Apple and Mozilla in their browser reader views. The code has been ported to a number of different languages.

I work on a service called Full-Text RSS[2] that used a PHP port of Readability, coupled with site-specific extraction rules[3] to identify and extract article content from each feed item. It then produces a full-text version of the given feed. The idea is you subscribe to the full-text version in whichever feed reader you use and it will give you full-text articles where you had partial content before.

[1] https://github.com/mozilla/readability

[2] https://www.fivefilters.org/full-text-rss/

[3] https://github.com/fivefilters/ftr-site-config

https://github.com/mozilla/readability

I had no idea that Readability.js was available as a standalone library. That’s awesome!

On a related note, does anyone know how the Safari Reader Mode works and where one can get the code? I'm able to get some of the code from inspector but it seems like some of the core functionality is built into the Safari engine. The alternatives out there (like Firefox's implementation, readability.js[0]) don't seem to be as good.

[0] https://github.com/mozilla/readability

Vaporware is better than going nowhere! (Get it...noware...haha).

Congrats on getting started.

I agree with Obsidian - I think that most people forget the maintenance time it takes to build a lifelong Knowledge Management System.

I like your idea - document similarity is a well known area in ML.

Feel free to take my Chrome Extension and use the parts where it tracks key paragraphs in an article (using a user's click/ hover/ attention behaviour) and use that as the corpus for your ML similarity models.

Intuitively it makes more sense to run document similarity on key points/ paragraphs than the whole web page.

If you want the whole web page though, there's code in the Chrome Extension that use's Mozilla's readability lib (https://github.com/mozilla/readability) to purify the web content.

I suspect it might be something mentioned in https://github.com/masukomi/arc90-readability. I believe it's what powers "readability" view in modern browsers (firefox https://github.com/mozilla/readability, safari).
Apparently it's a pretty complicated, holistic system of pulling just the main text.

FF uses this for its reader view: https://github.com/mozilla/readability

Not only the readability mode of Firefox is awesome, but they've also opensourced it separately, in case you feel you can submit a PR for a persistence of some kind or create your own extension that makes use of it.

https://github.com/mozilla/readability

I use it profusely along with Puppeteer in Node and it's an awesome building block for web scraping!

If you have examples of sites where Firefox's reading mode doesn't work, you can file bugs here: https://github.com/mozilla/readability
I've no idea about Safari's reader mode, but you can have a look how the readability feature in Firefox works: https://github.com/mozilla/readability

I'd expect the Safari feature to work quite similar.

There is a Python library called Newspaper that is designed to do that. I believe this is what outline.com uses.

There is also a JS library called readability which is what is used by Firefox's reader mode.

https://newspaper.readthedocs.io/en/latest/

https://github.com/mozilla/readability

Firefox's reader's view is able to do that.

Check out their Github repo here - https://github.com/mozilla/readability

>You need at least one

tag around the text, you want to see in Reader View and at least 516 characters in 7 words inside the text.

Source - https://stackoverflow.com/questions/30661650/how-does-firefo...

Just took a look at this, here's my guess.

- Pretend they're a crawler such as Google and pull down the HTML, potentially executing javascript

- Once it's pulled down, clean it up using open source code such as readability https://github.com/mozilla/readability

- Store that result as a document in a nosql database

Once they have pulled the article down once they don't need to get it again.

I couldn't find a reference for Reliability? Is there source?

-I don't know how the FF reader mode is coded-

Edit-- https://github.com/mozilla/readability

Also, Mozilla's Readability library[0] should help you out to extract only relevant content (this is what's behind Firefox's reading mode). So, the only semi-difficult part is the NLP.

[0] https://github.com/mozilla/readability

Firefox maintains a fork of Readability for Firefox's reader mode here:

https://github.com/mozilla/readability

Not exactly, though inspired/reacting to that approach. The reader view is based on this: https://github.com/mozilla/readability
Firefox's reader mode is JS library that parses the page. It does not share any code with Pocket. https://github.com/mozilla/readability
We've put the source we're using up on github ( https://github.com/mozilla/readability ) and plan to use the same library on all platforms. Any contributions are welcome.