As it turns out, seems like nim also has an html parser [1], but I'm guessing something like Google's gumbo [2] could be more reliable, but you would have to write bindings for nim.
When I was testing Gumbo [1] on all the HTML files in Google's index, I ran into one that had 48,000 levels of nested HTML tags. (Ironically, Gumbo itself handled this fine, but the test script I used to verify the output died.) I posted it to Memegen with a link to the original source URL, and then got called out for crashing Chrome. Apparently I wasn't the only one who didn't think about recursion limits in a parser. (Both bugs have since been fixed.)
What was the wayward HTML file? It was an XML file served with the wrong content type. The file contained 48,000 self-closing tags. When you parse an unknown element with a self-closing tag in HTML, it ignores the trailing />, treats it as an ordinary unknown element, and happily keeps generating a deeper and deeper DOM.
A stack overflow that results in a segfault is a pretty serious DOS vulnerability in a JSON parser. You probably could take down a good portion of the Internet by sending JSON requests to their AJAX endpoints that consist of 1M of {.
https://github.com/google/gumbo-parser
Is the modified version you use a personal version or a well-known fork?
Reva Forth always seemed nice but I could never get it working on anything other than Windows.
As such, I have little faith that "8th" will place any value on portability either, e.g., that it will be ported to BSD, Plan 9 or other RPi-compatible OS.
Meanwhile, there are plenty of more portable, open source Forths to choose from.
Example:
ftp -4o cforthu.zip https://codeload.github.com/pahihu/cforthu/zip/master
Some HN commenters are questioning the peculiar security claims.The author discloses that 8th depends on a number of third party libraries. Would this mean that each of those third parties would also have to make similar security claims to 8th?
For example, 8th uses an HTML5 parsing library from Google called gumbo-parser.
"Non-goals:
Security. Gumbo was initially designed for a product that worked with trusted input files only. We're working to harden this and make sure that it behaves as expected even on malicious input, but for now, Gumbo should only be run on trusted input or within a sandbox. Gumbo underwent a number of security fixes and passed Google's security review as of version 0.9.1."
source: https://github.com/google/gumbo-parser
It may be the case the only input parsed by 8th is trusted or "within a sandbox" but without the source code how would this be verified?
With its help I've created scrapers and crawlers that digest even the most disgusting HTML.
A secondary reason is to constrain your users so they don't end up writing constructs that will be impossible to manipulate with tooling. Early in Bazel's [2] history at Google, BUILD files were just ordinary Python, and you could use any Python code you wanted, and all that mattered were the calls you made into the build rules. This worked great, with people writing all sorts of list comprehensions and conditionals to minimize code duplication. And then tools came out to reformat BUILD files, and automatically query for dependencies, and detect dead binaries, and all sorts of other cool stuff. The tools were a lot more useful than the ability to include arbitrary code, but (because of the halting problem) they would choke whenever they encountered arbitrary flow constructs. So now arbitrary code is limited to the portions of the build system that explicitly allow it.
The EDSL approach (where you just use a regular programming language and a bunch of calls into the engine) can be a good idea, but you give up a lot in tooling. Usually it's most useful in the early stages of a project, where you don't have manpower to write the tooling anyway, and then becomes a legacy hassle as you assign more engineers, write more code, and get more leverage from writing tools.
[1] https://github.com/google/gumbo-parser
[2] http://bazel.io/
The interesting thing is that the more experience you get, the more alternatives you find to writing your own language. Could you use Ruby or Python as the front-end, much like Rails [2], Rake [3], or Bazel [4]? Could you build up a data-structure to express the computation, and then walk that data-structure with the Interpreter pattern? [5] Could you get away with a class library or framework, much like how Sawzall has been replaced by Flume [6] and Go libraries within Google?
In general, you want to use the tool with the least power that actually accomplishes your goals, because every increase in power is usually accompanied by an increase in complexity. There are a bunch of solutions with less power than a full programming language that can still get you most of the way there.
[1] https://github.com/google/gumbo-parser
[3] http://rake.rubyforge.org/
[4] http://bazel.io/
[5] https://en.wikipedia.org/?title=Interpreter_pattern
[6] http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/F...
https://github.com/google/gumbo-parser
It's one of the most conformant (if not the most conformant - 0.10.0 passes all html5lib-trunk tests) standalone HTML parsing libraries out there. It has third-party bindings in nearly a dozen different languages. The API is simple, the code is robust and well-tested, and being written in C, it's often a fair bit faster than alternatives.
Now, I know that open-source projects are sometimes quite conservative with versioning, and one project's 0.9 may be more stable than another's 2.0. (I maintain an HTML parser [1] that's still on 0.9.3 and yet is more robust and better tested than one that is on 3.8.2.) Is this actually the case with Rakudo, though? You can write real production software with Python 3.4; can you with Rakudo #86?
A lot of mysteries about why people don't use clearly-superior technological solutions are solved by understanding that ubiquity is a feature, in many cases the most important feature. I don't use vim because it's the best text editor; I use it because I can be reasonably sure every single UNIX-like system I ever log into will have it, and once it's in my fingers from having to learn it on a remote server, I might as well use it for daily programming. I don't use HTTP because it's an efficient protocol; I use it because every single device, library, and language speaks it. I didn't write Gumbo [1] in C because I like the language; I did it because every modern language can bind to C libraries, and so this lets a maximum number of people use it.
https://github.com/google/gumbo-parser
(Ironically, when posted on Hacker News, one of the first comments was that it was ironic that it was posted to GitHub and not Google Code:
Glancing at the project description, the biggest problem I see is that pretty much the only value provided is "my own opinions". I have my own opinions; I don't need yours. I don't even use the widely-used boilerplate packages like Twitter Bootstrap because invariably they do something contrary to the needs of my site, and then ripping them out is more effort than not using them to begin with.
I've got a couple moderately-successful open-source projects with Write Yourself a Scheme in 48 Hours [1] and Gumbo HTML5 Parser [2], and what they have in common is that they solve a problem that people have that they're too lazy to fix themselves, in a way that takes less effort than diving into the problem would. In Write Yourself a Scheme, that problem is "I want to learn Haskell", and the lazyness is "but I don't want to have to butt my head against these annoying monad things, and specific API calls, and undocumented type-system corners. I want all that explained to me." For Gumbo, the problem is "I want to parse HTML", and the lazyness is "but I don't want to spend my time implementing the 400+ clauses of the spec".
If you think in terms of "What can I do for other people that they don't want to do for themselves?", you will end up with many more users. Projects that do all the fun stuff but none of the hard stuff end up fun, but useless. Projects that do all the hard stuff and leave the fun stuff to other people get used.
[1] http://en.wikibooks.org/wiki/Write_Yourself_a_Scheme_in_48_H...
At least with HTML 5 we have both a spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/) and a library to parse it (https://github.com/google/gumbo-parser)