What does HackerNews think of gumbo-parser?

An HTML5 parsing library in pure C99

Language: HTML

It uses libcurl and gumbo (https://github.com/google/gumbo-parser). Gumbo is apparently written in pure C99 (interestingly Curl is written in the even older C89 standard). Will've been more amusing if article was written considering that and used C99.
oops... I saw a markup parser and automatically thought XML, but you are right! HTML is actually a whole different beast!

As it turns out, seems like nim also has an html parser [1], but I'm guessing something like Google's gumbo [2] could be more reliable, but you would have to write bindings for nim.

1: https://nim-lang.org/docs/htmlparser.html

2: https://github.com/google/gumbo-parser

Think of error cases and malicious input instead of well-written APIs.

When I was testing Gumbo [1] on all the HTML files in Google's index, I ran into one that had 48,000 levels of nested HTML tags. (Ironically, Gumbo itself handled this fine, but the test script I used to verify the output died.) I posted it to Memegen with a link to the original source URL, and then got called out for crashing Chrome. Apparently I wasn't the only one who didn't think about recursion limits in a parser. (Both bugs have since been fixed.)

What was the wayward HTML file? It was an XML file served with the wrong content type. The file contained 48,000 self-closing tags. When you parse an unknown element with a self-closing tag in HTML, it ignores the trailing />, treats it as an ordinary unknown element, and happily keeps generating a deeper and deeper DOM.

A stack overflow that results in a segfault is a pretty serious DOS vulnerability in a JSON parser. You probably could take down a good portion of the Internet by sending JSON requests to their AJAX endpoints that consist of 1M of {.

[1] https://github.com/google/gumbo-parser

Ah very cool, had seen various python libraries about HTML5, but not gumbo (or at least I had starred it).

https://github.com/google/gumbo-parser

Is the modified version you use a personal version or a well-known fork?

"He developed the well respected Reva Forth, used by hobbyists around the world. "

Reva Forth always seemed nice but I could never get it working on anything other than Windows.

As such, I have little faith that "8th" will place any value on portability either, e.g., that it will be ported to BSD, Plan 9 or other RPi-compatible OS.

Meanwhile, there are plenty of more portable, open source Forths to choose from.

Example:

    ftp -4o cforthu.zip https://codeload.github.com/pahihu/cforthu/zip/master
Some HN commenters are questioning the peculiar security claims.

The author discloses that 8th depends on a number of third party libraries. Would this mean that each of those third parties would also have to make similar security claims to 8th?

For example, 8th uses an HTML5 parsing library from Google called gumbo-parser.

"Non-goals:

Security. Gumbo was initially designed for a product that worked with trusted input files only. We're working to harden this and make sure that it behaves as expected even on malicious input, but for now, Gumbo should only be run on trusted input or within a sandbox. Gumbo underwent a number of security fixes and passed Google's security review as of version 0.9.1."

source: https://github.com/google/gumbo-parser

It may be the case the only input parsed by 8th is trusted or "within a sandbox" but without the source code how would this be verified?

The best library I've found for this sort of thing is gumbo. https://github.com/google/gumbo-parser

With its help I've created scrapers and crawlers that digest even the most disgusting HTML.

The main reason is so you can statically analyze the template source code. For example, pair Gumbo [1] with an HTML-based templating language and you can identify every image referenced in the HTML and automatically coalesce them into a sprite sheet. You can replace dynamically-generated images with data: URLs and load the data at the end of the page, minimizing latency. You can validate your templates to make sure they never generate invalid HTML. You can run any inline scripts found through Closure Compiler or Babel to minify them (or, in Babel's case, support ES6 directly in templates). You can do CSS renaming automatically on all elements.

A secondary reason is to constrain your users so they don't end up writing constructs that will be impossible to manipulate with tooling. Early in Bazel's [2] history at Google, BUILD files were just ordinary Python, and you could use any Python code you wanted, and all that mattered were the calls you made into the build rules. This worked great, with people writing all sorts of list comprehensions and conditionals to minimize code duplication. And then tools came out to reformat BUILD files, and automatically query for dependencies, and detect dead binaries, and all sorts of other cool stuff. The tools were a lot more useful than the ability to include arbitrary code, but (because of the halting problem) they would choke whenever they encountered arbitrary flow constructs. So now arbitrary code is limited to the portions of the build system that explicitly allow it.

The EDSL approach (where you just use a regular programming language and a bunch of calls into the engine) can be a good idea, but you give up a lot in tooling. Usually it's most useful in the early stages of a project, where you don't have manpower to write the tooling anyway, and then becomes a legacy hassle as you assign more engineers, write more code, and get more leverage from writing tools.

[1] https://github.com/google/gumbo-parser

[2] http://bazel.io/

Yeah, one of the compilers I wrote just used JSON as the AST, with it being generated by a GUI interface. Another used HTML with annotations (although go figure, I wrote an HTML parser [1] for it, because there weren't any C++ options at the time that didn't bring along a browser engine). A third had a custom front-end but then emitted Java source code as the back-end.

The interesting thing is that the more experience you get, the more alternatives you find to writing your own language. Could you use Ruby or Python as the front-end, much like Rails [2], Rake [3], or Bazel [4]? Could you build up a data-structure to express the computation, and then walk that data-structure with the Interpreter pattern? [5] Could you get away with a class library or framework, much like how Sawzall has been replaced by Flume [6] and Go libraries within Google?

In general, you want to use the tool with the least power that actually accomplishes your goals, because every increase in power is usually accompanied by an increase in complexity. There are a bunch of solutions with less power than a full programming language that can still get you most of the way there.

[1] https://github.com/google/gumbo-parser

[2] http://rubyonrails.org/

[3] http://rake.rubyforge.org/

[4] http://bazel.io/

[5] https://en.wikipedia.org/?title=Interpreter_pattern

[6] http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/F...

Gumbo HTML parser:

https://github.com/google/gumbo-parser

It's one of the most conformant (if not the most conformant - 0.10.0 passes all html5lib-trunk tests) standalone HTML parsing libraries out there. It has third-party bindings in nearly a dozen different languages. The API is simple, the code is robust and well-tested, and being written in C, it's often a fair bit faster than alternatives.

According to that page, Rakudo is a still a preview release in development. The Python 3.0-3.4 series have been full production releases.

Now, I know that open-source projects are sometimes quite conservative with versioning, and one project's 0.9 may be more stable than another's 2.0. (I maintain an HTML parser [1] that's still on 0.9.3 and yet is more robust and better tested than one that is on 3.8.2.) Is this actually the case with Rakudo, though? You can write real production software with Python 3.4; can you with Rakudo #86?

[1] https://github.com/google/gumbo-parser

It's because at some point, your code will encounter low-tech technology that doesn't let you set the tabstop. This might be a terminal app, or an actual dumb terminal, or a cut & paste into a textbox on a webpage, or a third-party tool that expects spaces only, or a refactoring tool that you wrote yourself but don't have time to get working with tabs, or a parser that can count characters but has no logic for tabs. When this happens, your tab-based formatting will get messed up, while your spaces-based formatting will look exactly the way you wrote it.

A lot of mysteries about why people don't use clearly-superior technological solutions are solved by understanding that ubiquity is a feature, in many cases the most important feature. I don't use vim because it's the best text editor; I use it because I can be reasonably sure every single UNIX-like system I ever log into will have it, and once it's in my fingers from having to learn it on a remote server, I might as well use it for daily programming. I don't use HTTP because it's an efficient protocol; I use it because every single device, library, and language speaks it. I didn't write Gumbo [1] in C because I like the language; I did it because every modern language can bind to C libraries, and so this lets a maximum number of people use it.

[1] https://github.com/google/gumbo-parser

The "move" started a couple years ago. I open-sourced a Google project about 18 months ago and the recommended hosting solution was GitHub:

https://github.com/google/gumbo-parser

(Ironically, when posted on Hacker News, one of the first comments was that it was ironic that it was posted to GitHub and not Google Code:

https://news.ycombinator.com/item?id=6210282

The rules for building a successful open-source project are pretty similar to those for building a successful startup: make something people want.

Glancing at the project description, the biggest problem I see is that pretty much the only value provided is "my own opinions". I have my own opinions; I don't need yours. I don't even use the widely-used boilerplate packages like Twitter Bootstrap because invariably they do something contrary to the needs of my site, and then ripping them out is more effort than not using them to begin with.

I've got a couple moderately-successful open-source projects with Write Yourself a Scheme in 48 Hours [1] and Gumbo HTML5 Parser [2], and what they have in common is that they solve a problem that people have that they're too lazy to fix themselves, in a way that takes less effort than diving into the problem would. In Write Yourself a Scheme, that problem is "I want to learn Haskell", and the lazyness is "but I don't want to have to butt my head against these annoying monad things, and specific API calls, and undocumented type-system corners. I want all that explained to me." For Gumbo, the problem is "I want to parse HTML", and the lazyness is "but I don't want to spend my time implementing the 400+ clauses of the spec".

If you think in terms of "What can I do for other people that they don't want to do for themselves?", you will end up with many more users. Projects that do all the fun stuff but none of the hard stuff end up fun, but useless. Projects that do all the hard stuff and leave the fun stuff to other people get used.

[1] http://en.wikibooks.org/wiki/Write_Yourself_a_Scheme_in_48_H...

[2] https://github.com/google/gumbo-parser

> There isn't an obvious way to parse broken HTML, and every HTML parser does it differently

At least with HTML 5 we have both a spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/) and a library to parse it (https://github.com/google/gumbo-parser)

1) 20% time still exists. I used it to write an HTML parser [1] that's had some modest success, and I have coworkers that have 20%ed on robotics, quantum computation, elementary school education, Project Loon, Flu Trends, and a variety of other interesting things. Google Now came out of a 20% project. It is something that you have to take a lot of initiative on to pull off successfully, but the opportunity is still available.

[1] https://github.com/google/gumbo-parser

BeautifulSoup is great, as long as you're using open source HTML5 parser from Google. https://github.com/google/gumbo-parser