Hacker Newsnew | past | comments | ask | show | jobs | submit | more parrt's commentslogin

Howdy! It is definitely the case that most production languages use handbuilt parsers. In talking with these compiler developers, they are obsessed with control (speed, error reporting, ...) and want very specific data structures built during the parse. They also tend to use existing lexer infrastructure that is baked into their development environments. I commented more on this topic here: http://stackoverflow.com/questions/16503217/antlr-for-commer...

The thing to remember is that the vast majority of parsers out there are not for these production language compilers. Compare the number of people you know that have built parsers for a DSL, data, documents, or whatever to the number of people you know that built compilers. ANTLR's niche is for your everyday parsing needs. It generates fast ALL(*) parsers in a multitude of languages and accepts all grammars without complaint, with the minor constraint that it cannot handle indirect left recursion. (Direct left recursion as an expressions is totally okay.) For a speed shootout with other tools, see OOPSLA paper http://www.antlr.org/papers/allstar-techreport.pdf

Excellent discussion!


I've heard many compilers started with yacc/bison, but later went hand-written for better error messages.

Currently, jq (pretty much, awk for json https://stedolan.github.io/jq/) uses bison, but sometimes has unhelpful syntax error messages. I'm not the maintainer or anything, but trying to trigger error messages more precisely (i.e. diagnosis) was a nightmare.

While a generator can't match a custom parser for control, is it plausible to have arbitrarily precise hooks in ANTLR for diagnostic messages?

BTW unfortunately deadlink http://stackoverflow.com/questions/16503217/antlr-for-commer...


> Compare the number of people you know that have built parsers for a DSL, data, documents, or whatever

The vast majority of those people that I know of have not even been aware of tools like ANTLR, and usually resort to regexp-based abominations instead unless they happen to know about bison/yacc (slightly more likely than ANTLR).

I agree those are cases where a tool like ANTLR might very well be preferable to someone hacking together a regexp-driven mess, but my personal experience is that the people most likely to know about these tools are also the people most likely to know how to handwrite reasonably clean parsers.

Some subset certainly will opt for the tools some of the time (I have done that myself, even though I much prefer handwritten parsers; I've also written parser generation tools myself, out of frustration with the existing ones... Im still frustrated enough by my own tools to end up handwriting parsers... (EDIT: Clarified; I could use a good parser to check my English sometimes) ), but the parser-generation field is a peculiar one.

I have no doubt that sooner or later someone will crack this nut (maybe ANTLR will be that tool one day) and handwriting parsers will go the way of handwriting assembler and be something you do only in rare cases, but we're far from there still.

Note that I'm not saying ANTLR is a bad tool. I'm saying this is a space where a lot of the potential audience are a bunch of very difficult to convince and picky people (myself included...), and where a large other part of the audience doesn't know the field at all and are unaware of the tools.


Thanks for the response. Yes I think we are agreeing -- I have 2 of your books, and have read many of your ANTLR-related papers, and they were definitely the thing that taught me the most about top-down parsing. (I also used to share an office with Guido van Rossum and I remember he had your books too.)

I ported the POSIX shell grammar to both ANTLR v3 and v4 (which was basically changing yacc-style BNF to EBNF). But as mentioned, I discovered that the grammar only covers about 1/4 of the language. bash generates code with the same grammar using yacc, but fills in the rest with hand-written code. Every other shell I've encountered uses a hand-written parser. bash says they regret using yacc here:

http://www.aosabook.org/en/bash.html

I agree with you that there is a Pareto or long tail distribution in parser use cases. Most languages CAN use something like ANTLR or bison. But the parent was making a different claim:

What an odd statement, in light of the innumerable deployments of Bison / ANTLR parsers you certainly use at least once a day (if you spend any time at all in a terminal).

I would say that is FALSE, because most parsers that your fingers pass through are HAND-WRITTEN, because of the Pareto distribution. 99% of anyone's usage is of probably a dozen or so parsers, and they are either hand-written or generated by custom code generators, not general-purpose code generators like ANTLR or yacc.

-----

As feedback from a user of parsing tools, you might also be interested in my article here:

https://news.ycombinator.com/item?id=13628412

Someone is asking if there are any parsing tools that generate a "lossless syntax tree".

Also, based on my experience with ANTLR v3 vs. v4, I ask the question why use a concrete syntax tree at all? Nobody answered that question in the comments. I don't understand why that is a good representation, other than the fact that you might not want to clutter your grammar with semantic actions ("pure declarative syntax").

To me the parse tree / CST seems to be resource-heavy while containing unnecessary information, and also lacking some crucial information like where there's whitespace and comments.

To summarize my article, I'm researching code representations in the wild for both style-preserving source translation (like go fix, lib2to3 in Python) and auto-formatting (like go fmt).

It's definitely possible I misunderstood something since my experience was relatively limited, but I have read a lot of the docs and bought the books.


> why use a concrete syntax tree at all?

Do you mean instead of an AST? I find the syntax tree better for non-compiler applications like translators.


My claim is that neither the parse tree/CST or AST is good for applications like translators. Instead I defined another term Lossless Syntax Tree, which is the data structure I want:

http://www.oilshell.org/blog/2017/02/11.html

I researched "production" implementations, and found that they use something like a Lossless Syntax Tree (not an AST or CST):

https://github.com/oilshell/oil/wiki/Lossless-Syntax-Tree-Pa...

Examples: Clang, Microsoft's Roslyn platforms, RedBaron/lib2to3 for Python, scalameta, and Go. The defining property of the LST is that it can be round-tripped back to the source. This is called out in this C# design doc, along with some conventions for associating whitespace with syntax tree nodes:

https://github.com/dotnet/roslyn/wiki/Roslyn-Overview

What do you think of that claim? (If you prefer not to use this deep comment thread, feel free to contact me by e-mail instead at andychup@gmail.com.)


An interesting idea. Not sure anyone is working on that.


Interesting, though it is a framework very similar to previous work using Box combinations. Here, we don't require any work from a language expert. We simply sniff your project, and then make new files look like those. Handling a new language requires no coding.


A small followup to Jurgen's post. Python's indentation is meaningful so any change to it would mean we changed the program. In that sense, python is not a good target for this tool. Also, as a simple implementation expedient for this version, I assume that '\n' is not significant.


The hard part of building a code formatter by hand is coding all the formatting rules, not the parsing. All formatters are based upon parsers so that is a constant across them. Creating a grammar from exemplars is still an unsolved problem.


Yep. no definition of "good style". The tool simply makes new files look like the rest of your project.


I should add Adaptive LL(* ), ALL(* ), of ANTLR 4 to the mix here. It handles any grammar you give it and generates a correct parser except for one small caveat: no indirect left-recursion. It's the culimation of 25 years of focused effort to take LL-based parsing to the limit of power while maintaining simplicity and efficiency. See tool shootout in OOPSLA '14 paper I just presented http://www.antlr.org/papers/allstar-techreport.pdf Until we change the requirements of a parser generator, I'm done. :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: