What does HackerNews think of readability?

A standalone version of the readability lib

Language: JavaScript

Trafilatura: Python tool to gather text on the Web | Aug 2023

Expand Context ↕

Concerning tooling I'd say you have two different worlds, JavaScript and Python, each with a series of tools to tackle such tasks. It's not easy to compare them directly because of varying software environments and I haven't had a chance to test JS tools thoroughly.

For the sake of completeness: Mozilla's Readability [1] is obviously a reference in the JS world.

[1]: https://github.com/mozilla/readability

Webrecorder: Capture interactive websites and replay them at a later time | Aug 2023

Expand Context ↕

I wonder if Firefox "reader mode as a utility" might be a viable alternative for Pinboard like "content oriented" archiving?

https://github.com/mozilla/readability

Creating a search engine with PostgreSQL | Jul 2023

Expand Context ↕

Depending upon the type of content, one might want to look into using the Readability (Browder's reader view) to parse the webpage. It will give you all the useful info without the junk. Then you can put it in the DB as needed.

https://github.com/mozilla/readability

Btw, readability, is also available in few other languages like Kotlin:

https://github.com/dankito/Readability4J

Show HN: Chatblade – A CLI Swiss Army Knife for ChatGPT | Mar 2023

Expand Context ↕

have you already tried this: https://github.com/mozilla/readability ?

Show HN: Unclutter – Reader mode, but better | Nov 2022

Playing the devil's advocate: Why is Firefox Reader mode not good enough?

And also please note that Mozilla has their algorithm in the open here: https://github.com/mozilla/readability

Show HN: Extract Markdown, HTML or text from content-heavy websites | Sep 2022

I maintain something similar today, and I'm guessing that the OP uses some combination of the following libraries too (?):

- Readability (https://github.com/mozilla/readability) to strip down the page's HTML to a bare minimum.

- Turndown.js (https://github.com/mixmark-io/turndown) to convert the plain HTML to a markdown format with the GFM plugins enabled.

- Puppeteer (https://github.com/puppeteer/puppeteer) to download the page.

It costs me only several cents to parse an entire page, and I think OP can make some money out of this if they get the pricing right.

Also, some unsolicited feedbacks on the API:

- An option to enable/disable javascript would be great, since not all pages actually need to have it enabled to be parsable.

- You can probably tweak the header of the headless browser to bypass the paywalls of some sites. Some are as simple as setting the useragent to a crawler bot (like `googlebot`).

- Maybe an option to fill in the front matter (https://jekyllrb.com/docs/front-matter/) with a metadata given in the payload?

Show HN: I made a modern web UI for Hacker News | Sep 2022

Do you accept feature requests/think this might be a good idea?

I've been thinking about creating this + adding https://github.com/mozilla/readability to grab the links that are text articles and present them in-page (and cleaned up, just the text+images+similar, removing all the sidebars, popups, etc) instead of having to go to a 3rd party website with all the popups and such.

It'd have to be either a personal website or a browser extension like yours, since I wouldn't be able to host a given article for anyone to read (for copyright reasons), but I can have a modified browser that loads a 3rd party article different.

Google cracks down on VPN based adblockers | Aug 2022

Expand Context ↕

It's a feature of Reader actually. The library they use has a "isProbablyReaderable()" function to determine if a page can be simplified, and their extension uses that to decide if Reader mode should be available or not.

[0] https://github.com/mozilla/readability/#isprobablyreaderable...

Reasons to ditch Chrome and use Firefox | May 2022

Expand Context ↕

Because Reader Mode is using heuristics to fetch the content of the article, and that can fail if the article is weirdly formatted in HTML. You can read more about this on the repo for Mozilla's Readability, which is what Firefox uses under the hood: https://github.com/mozilla/readability

How does Firefox's Reader View work? (2020) | Mar 2022

I've been working on several web extractors project, so I think I could share some of my findings while working on them. Granted it's been several months since I worked on it so I might be forgetting some things.

There are several open source projects for extracting web contents. However, there are three extractors that I've worked with and give us good result:

- readability.js[1], web extractor by Mozilla that used in Firefox.

- dom-distiller[2], web extractor by Chromium team, written in Java.

- trafilatura[3], Python package by Adrien Barbaresi from BBAW[4].

First, readability.js, as expected is the most famous extractor. It's a single file Javascript library with modest 2,000+ lines of code, released under Apache license. Since it's in JS, you can use it wherever you want, either in web page using `script` tag or by using it in Node project.

Next, DomDistiller is extractor that used in Chromium. It uses Java language with whopping 14,000+ lines of code and can only be used as part of Chromium browser, so you can't exactly use it as standalone library or CLI.

Finally, Trafilatura is a Python package released under GPLv3 license. Created in order to build a text databases[5] for NLP research, it mainly intended for German web pages. However, as development continues, it works really great with other languages. It's a bit slow though compared to Readability.js.

All of those three work in similar way: extract metadata, remove unneeded contents, and finally returns the cleaned up content. Their differences (that I remembered) are:

- In Readability, they insist to make no special rules for any website, while DomDistiller and Trafilatura give a small exception for popular sites like Wikipedia. Thanks to this, if you use Readability.js in Wikipedia pages, it will shows `[edit]` button thorough the extracted content.

- Readability has a small function to detect whether a web page can be converted to reader mode. While it's not really accurate, it's quite convenient to have.

- In DomDistiller, the metadata extraction is more thorough than the others. It supports OpenGraph, Schema.org, and even the old IE Reading View mark up tags.

- Since DomDistiller is only usable within Chromium, it has the advantage to be able to use CSS styling to determine if an element is important or not. If an element is styled to be invisible (e.g. `display: none`) then it will be deemed unimportant. However, according to a research[6] this step is actually doesn't really affect the extraction result.

- DomDistiller also has an experimental feature to find and extract next page in sites that separated its article to several partial pages.

- For Trafilatura, since it was created for collecting web corpus, it main ability is extracting text and the publication date of a web page. For the latter, they've created a Python package named htmldate[7] whose only purpose is to extract the publication or modification date for a web page.

- Trafilatura also has an experimental feature to remove elements that repeated too often. The idea is if the element occured too often, then it's not important to the reader.

I've found benchmark[8] that compare the performance between the extractors, and it said that Trafilatura has the best accuracy compared to the others. However, before you start rushing to use Trafilatura, you should remember that Trafilatura is intended for gathering web corpus, so it's really great for extracting text content, but IIRC is not as good as Readability.js and DomDistiller for extracting a proper article with images and embedded iframes (depending on how you look, it could be a feature though).

By the way, if you are using Go and need to use a web extractor, I already ported all three of them to Go[9][10][11] including their dependencies[12][13], so have fun with it.

[1]: https://github.com/mozilla/readability

[2]: https://github.com/chromium/dom-distiller

[3]: https://github.com/adbar/trafilatura

[4]: https://www.bbaw.de/en/

[5]: https://www.dwds.de/d/k-web

[6]: https://arxiv.org/abs/1811.03661

[7]: https://github.com/adbar/htmldate

[8]: https://github.com/scrapinghub/article-extraction-benchmark

[9]: https://github.com/go-shiori/go-readability

[10]: https://github.com/markusmobius/go-domdistiller

[11]: https://github.com/markusmobius/go-trafilatura

[12]: https://github.com/markusmobius/go-htmldate

[13]: https://github.com/markusmobius/go-dateparser

Ask HN: Full-text browser history search forever? | Mar 2022

Expand Context ↕

I've had a lot of success by running HTML pages through mozilla's readability[0] tool (actually the go port of it[1]) before indexing it.

[0]: https://github.com/mozilla/readability

[1]: https://github.com/go-shiori/go-readability

Ask HN: Web Page Accessibility Parser | Dec 2021

https://outline.com/

Afaik https://github.com/mozilla/readability is what Firefox uses and that's where you can report websites who don't parse well.

The most underused browser feature: reader mode | Aug 2021

Expand Context ↕

I've just poked through both the GetPocket site (https://getpocket.com/publisher/) and Mozilla's Readability Library GitHub page (https://github.com/mozilla/readability) without seeing obvious guidelines.

My general suspicion is that adhering to a simple HTML5 documemnt structure, and possible use of microformats (https://microformats.io/) goes a long way.

Update: there's some discussion here: https://news.ycombinator.com/item?id=28301113

The most underused browser feature: reader mode | Aug 2021

Any developers who'd like to contribute to improving how article content is extracted from web pages should check out Mozilla's Readability repository: https://github.com/mozilla/readability

I'm currently trying to bring the PHP port up to speed here: https://github.com/fivefilters/readability.php

We use an older version as part of our article extraction for Push to Kindle: https://www.fivefilters.org/push-to-kindle/

Open source Pocket code | Aug 2021

Expand Context ↕

I think this[0] comes close to what is used to extract text from an HTML document. Fetching can be done via any HTTP client. Will need jsdom to convert the text to DOM before feeding it to readability.

[0]: https://github.com/mozilla/readability

A Unix-style personal search engine and web crawler for your digital footprint | Jul 2021

Looks very much like one of the ideas I've been thinking of building! The way I planned to do it was to use a similar approach to rga for files ( https://github.com/phiresky/ripgrep-all ) and having a webextension to pull all webpages I vist (filtered via something like https://github.com/mozilla/readability ), dump that into either sqlite with FTS5 or postgres with FTS for search.

A good search engine for "my stuff" and "stuff I've seen before" is not available for most people in my experience. Pinboard and similar sites fill some of that role, but only for things that you bookmark (and I'm not sure they do full-text search of the documents).

---

Two things I'd mention are:

1. Digital footprint usually means your info on other sites, not just things I've accessed. If I read a blog that is not part of my footprint, but if I leave a comment on that blog that comment is part of it. The term is also mostly used in a tracking and negative context (although there are exceptions), so you might want to change that: https://en.wikipedia.org/wiki/Digital_footprint

2. I don't really get what makes it UNIX-style (or what exactly you mean by that? There seems to be many definitions), and the readme does not seem to clarify much besides expecting me to notice it by myself.

Back to the Future with RSS | Jul 2021

Expand Context ↕

If you're trying to build one yourself, have a look at the open source Readability code[1]. It was originally developed by Arc90 and is now used by Apple and Mozilla in their browser reader views. The code has been ported to a number of different languages.

I work on a service called Full-Text RSS[2] that used a PHP port of Readability, coupled with site-specific extraction rules[3] to identify and extract article content from each feed item. It then produces a full-text version of the given feed. The idea is you subscribe to the full-text version in whichever feed reader you use and it will give you full-text articles where you had partial content before.

[1] https://github.com/mozilla/readability

[2] https://www.fivefilters.org/full-text-rss/

[3] https://github.com/fivefilters/ftr-site-config

Zenreader: A 4.7 Inches E-Ink RSS Reader Powered by ESP32 | Apr 2021

Expand Context ↕

https://github.com/mozilla/readability

I had no idea that Readability.js was available as a standalone library. That’s awesome!

My Hacking Adventures with Safari Reader Mode | Sep 2020

On a related note, does anyone know how the Safari Reader Mode works and where one can get the code? I'm able to get some of the code from inspector but it seems like some of the core functionality is built into the Safari engine. The alternatives out there (like Firefox's implementation, readability.js[0]) don't seem to be as good.

[0] https://github.com/mozilla/readability

Show HN: Open-Source Memex – Alternative Approach to Roam/Obsidian | Sep 2020

Expand Context ↕

Vaporware is better than going nowhere! (Get it...noware...haha).

Congrats on getting started.

I agree with Obsidian - I think that most people forget the maintenance time it takes to build a lifelong Knowledge Management System.

I like your idea - document similarity is a well known area in ML.

Feel free to take my Chrome Extension and use the parts where it tracks key paragraphs in an article (using a user's click/ hover/ attention behaviour) and use that as the corpus for your ML similarity models.

Intuitively it makes more sense to run document similarity on key points/ paragraphs than the whole web page.

If you want the whole web page though, there's code in the Chrome Extension that use's Mozilla's readability lib (https://github.com/mozilla/readability) to purify the web content.

Show HN: Hndex.org – a full-text search engine of articles submitted to HN | Aug 2020

Expand Context ↕

I suspect it might be something mentioned in https://github.com/masukomi/arc90-readability. I believe it's what powers "readability" view in modern browsers (firefox https://github.com/mozilla/readability, safari).

Reddit's website uses DRM for fingerprinting | Jul 2020

Expand Context ↕

Apparently it's a pretty complicated, holistic system of pulling just the main text.

FF uses this for its reader view: https://github.com/mozilla/readability

Show HN: Increase your readability and productivity with this simple extension | Dec 2019

Expand Context ↕

Not only the readability mode of Firefox is awesome, but they've also opensourced it separately, in case you feel you can submit a PR for a persistence of some kind or create your own extension that makes use of it.

https://github.com/mozilla/readability

I use it profusely along with Puppeteer in Node and it's an awesome building block for web scraping!

Ask HN: What can Firefox do to get you to make the switch? | Nov 2019

Expand Context ↕

If you have examples of sites where Firefox's reading mode doesn't work, you can file bugs here: https://github.com/mozilla/readability

Ebola Is Now a Disease We Can Treat | Oct 2019

Expand Context ↕

I've used this wrapper[0] for Firefox's reader mode[1], and it worked well:

[0] https://www.npmjs.com/package/readability-wrapper

[1] https://github.com/mozilla/readability

Ask HN: How can we destroy AMP? | Aug 2019

Expand Context ↕

I've no idea about Safari's reader mode, but you can have a look how the readability feature in Firefox works: https://github.com/mozilla/readability

I'd expect the Safari feature to work quite similar.

Show HN: TLDR This – Auto summarize any article or webpage in a click | Jul 2019

Expand Context ↕

There is a Python library called Newspaper that is designed to do that. I believe this is what outline.com uses.

There is also a JS library called readability which is what is used by Firefox's reader mode.

https://newspaper.readthedocs.io/en/latest/

https://github.com/mozilla/readability

Show HN: TLDR This – Auto summarize any article or webpage in a click | Jul 2019

Expand Context ↕

Firefox's reader's view is able to do that.

Check out their Github repo here - https://github.com/mozilla/readability

>You need at least one

tag around the text, you want to see in Reader View and at least 516 characters in 7 words inside the text.

Source - https://stackoverflow.com/questions/30661650/how-does-firefo...

Show HN: Convert article in current tab to readable form and upload it to IPFS | Jan 2019

Expand Context ↕

Firefox uses Readability.js for its reader mode

https://github.com/mozilla/readability

Ask HN: How does outline.com work? | Nov 2018

Just took a look at this, here's my guess.

- Pretend they're a crawler such as Google and pull down the HTML, potentially executing javascript

- Once it's pulled down, clean it up using open source code such as readability https://github.com/mozilla/readability

- Store that result as a document in a nosql database

Once they have pulled the article down once they don't need to get it again.

Pocket’s 30M Users Are Great for Publishers | Aug 2018

Expand Context ↕

https://github.com/mozilla/readability (that powers Firefox's reading mode)

https://github.com/buriy/python-readability

https://github.com/goose3/goose3

https://github.com/codelucas/newspaper

For a complete solution, you can also look at Wallabag, which can be self-hosted.

Instapaper is going independent | Jul 2018

Expand Context ↕

https://github.com/mozilla/readability

Launch HN: Supermedium (YC W18) – A full VR browser for web-based VR content | Feb 2018

Expand Context ↕

I couldn't find a reference for Reliability? Is there source?

-I don't know how the FF reader mode is coded-

Edit-- https://github.com/mozilla/readability

Buku v3.0 – Command Line Bookmark Manager | Apr 2017

Expand Context ↕

Also, Mozilla's Readability library[0] should help you out to extract only relevant content (this is what's behind Firefox's reading mode). So, the only semi-difficult part is the NLP.

[0] https://github.com/mozilla/readability

Instapaper is joining Pinterest | Aug 2016

Expand Context ↕

Firefox maintains a fork of Readability for Firefox's reader mode here:

https://github.com/mozilla/readability

Mozilla Fathom: Find meaning in the web | Jul 2016

Expand Context ↕

Not exactly, though inspired/reacting to that approach. The reader view is based on this: https://github.com/mozilla/readability

Firefox for iOS Now Available for Preview in New Zealand | Sep 2015

Expand Context ↕

Firefox's reader mode is JS library that parses the page. It does not share any code with Pocket. https://github.com/mozilla/readability

Firefox 36.0 released | Feb 2015

Expand Context ↕

We've put the source we're using up on github ( https://github.com/mozilla/readability ) and plan to use the same library on all platforms. Any contributions are welcome.