Decision trees are the fundamental building block of gradient boosting machines and Random Forests™, probably the two most popular machine learning models for structured data. Visualizing decision trees is a tremendous aid when learning how these models work and when interpreting models. Unfortunately, current visualization packages are rudimentary and not immediately helpful to the novice. So, we've created a general package called animl for scikit-learn decision tree visualization and model interpretation.
This is cool. I like it, and will probably use it in my work, but it feels like there’s a lot going on. I don’t like how some of the final leaf nodes seem to be shown differently than the nodes higher up. Sometimes different chart types, sometimes reversed axes. I would also reccomend use of swarm plots for showing your regression scatter plots. Swarm plots are sexy, but not in the laughably uncomfortably way of the very similar violin plot.
Yep, the leaves are predictor nodes whereas internal nodes are decision nodes. They are doing different things so we figured we should show them using different visualizations.
Wow, I wondered why you put a TM on Random Forests. I guess it is trademark of Salford Systems, which is kind of weird. Maybe we can just call them random forests and ignore that.
> Although owners of trademarked names may suggest otherwise, publishers are not obligated to denote the trademark status of a name when that name is mentioned in text. Authors representing trademark owners frequently feel obligated to use the trademark or registered-trademark symbol (™ or ®) after the first mention of their product names but often do not use these symbols consistently to indicate the trademark status of other names not owned by their particular sponsor or employer.
The people who own the trademark may feel obligated to use those marks, but nobody else ever is.
There's a lot of "folk law" (that is, urban legends repeated by the ignorant) surrounding this concept, so if you think I'm wrong, please do yourself and the rest of us a favor and research good cites to show that there's actual law saying I'm wrong. Thanks.
I'm often guilty of this too - but we really should put the (tm) there. It's nice that they made code of the algorithm publicly available and all they ask is that we respect their trademark in return. I think that's more than fair. :)
(I discussed this a few years with the co-inventor of random forests, Adele Cutler, and she confirmed that this is something that she wants to see happen.)
Not the answer to your question, but in case it helps anyone: trademarks are unrelated to patents. You can use a random forest but you can not call them “random forest”. “Aleatory jungle” is fine, though.
"stochastic treeset". Sounds way more scientific, which can be required to convince a pointy-hair boss. "Random" forest sounds... well, I can flip a coin too, how is that going to solve my problem?
For the same reason, "naive" bayes classifier are very hard to sell, to the point I stopped naming them and now just tell "a very fast machine learning algorithm", unless specifically asked.
Well, as the author of ANTLR, I will disagree with your assessment of the tool as you can imagine. It never claimed to be a general tool. Specifically, it cannot handle indirect left-recursion; hence, not a general context free grammar parser. It is however the sweet spot between generality and speed. That quote was having a bit of fun using a tiny bit of hyperbole. If you take a look at our academic paper you will see that the general parsers are like walking a minefield. One never knows when they will hit a landmine and the parser takes overnight to parse a file. It took me 30 years to find the sweet spot in power and speed (with the help of Sam Harwell). I welcome the introduction of new tools, but I'm not sure your assessment is accurate nor is your understanding of the parsing landscape.
Also as the paper says, "GLR return multiple
parse trees (forests) for ambiguous grammars because they
were designed to handle natural language grammars, which are often intentionally ambiguous. For computer languages, ambiguity is almost always an error." When was the last time you wanted to parse a language where the same syntax meant two different things? C++ and a few other cases, sure, but most languages are specifically designed to be unambiguous.
You are welcome to use a tool that handles all context free grammars, but the speed and ambiguity issues are not something I care to deal with.
You also mischaracterize ANTLR's handling of left recursion. It handles immediate left recursion through a rewrite automatically such as for expressions. It does not handle indirect left recursion. You may have not seen ANTLR 4, and are basing your assessment on ANTLR 3? There is no aborting at compile time for left recursion, except in the unusual case of indirect left record.
> That quote was […] using a tiny bit of hyperbole.
Then you certainly won't have a problem changing the wording to make it true, right? The Owl author explains the limitations of his software up-front, why don't you?
> It never claimed to be a general tool.
The documentation does give the impression.
> If you take a look at our academic paper
Consider that HN is populated by craftsmen and businessmen, not scientists. We judge the tools on how well they work in practical terms.
> One never knows when they will hit a landmine and the parser takes overnight to parse a file.
No wonder, I see it's O(n⁴) for ALL()/ANTLR. You need to keep up with the current state of the art, which is O(n³).
> It took me 30 years
Time and effort does not count for anything, the quality of the end result does.
> I'm not sure […] your understanding of the parsing landscape [is accurate]
I can see the practical results of the various software I tried. Does the software get the task done? Yes, fine, that's a candidate for further consideration. No, fine, I can eliminate it. Documentation does not say…? Now one has to come up with experiments to find out the flaws and limitations, multiplied by every single user. That's a waste of people's time.
> When was the last time you wanted to parse a language where the same syntax meant two different things? […] ambiguity is almost always an error.
That was 2014, and I did not design that language. It does contain ambiguities, it can't be helped. No one gains anything by simply wishing it weren't so; there was a task to be done. There exists software that can cope, you should take that as an incentive to improve yours.
> but the speed and ambiguity issues are not something I care to deal with.
… or don't improve. But why would you refuse to? Is not your reputation at stake?
> You also mischaracterize ANTLR's handling of left recursion. […] It does not handle indirect left recursion.
That's what I meant. I was incorrect when I wrote "any grammar that's left-recursive".
It always amazes me that so many "famous" people still take time to read HN and respond to questions (even insulting ones)! I used Antlr4 in the past and have to say it's a well-designed system, though at the time (2016) the Python runtime which I was most interested in was not very stable.
I've implemented several parser generators myself (nothing as polished as Antlr) and worked on a "rapid prototyping" parser generator tool a while ago (https://github.com/adewes/parsejoy). The idea was a tool that could creates parsers on-the-fly without requiring compilation of any code, and let the user control and modify parsing logic using Lua if necessary. I started writing it in Go but switched to C++ later in order to get more predictable performance.
In the tool I implemented several parsing strategies, notably "parsing expression grammars (PEGs)" as well as a GLR parser (based on the Elkhound paper). And while I find GLR parsing powerful it's not without problems as e.g. errors are hard to debug. That said, it's pretty amazing that you can throw almost any grammar at a GLR parser it and it will (usually) be able to parse it. As you said though writing a "bad" grammar can yield exponential time complexity on some input strings.
PEGs are also nice in general but require you to put more work into the grammar as you are basically required to resolve ambiguities yourself using the "and" or "not" operators. Also, a naive implementation of a PEG parser will have horrid performance as it basically evaluates all possible alternatives via backtracking until it hits a valid one, and for a real-world languages (I've implemented e.g. a Python parser for testing) this will ruin your performance due to the large nesting depth of the rules (Python has around 8-12 levels of rules I think). Using a prefix tree can alleviate this though, as can packrat parsing, which comes with its own cost though as remembering parse results requires memory allocation.
Anyway, in terms of popularity and number of actually implemented grammars, Antlr4 is probably by far the most relevant OS parser generator out there. Thanks for writing it :)
One suggestion I have is to consider improving the tooling for tokenization, as it's often a problem that can be quite challenging in itself. For example, tokenizing Python code is not easy as it's not context-free (due to the indentation that indicates the nesting) and the presence of several types of newline characters (those that occur inside bracketed expressions vs. those that occur outside of them)
Howdy. Agreed. It'd be nice to have a simpler "match x if NOT followed by y" then and something to handle context-sensitive lexical stuff like Python. I often just send all char to the parser and do scannerless parsing. :)
Gradient boosting machines (GBMs) are currently very popular and so it's a good idea for machine learning practitioners to understand how GBMs work. The problem is that understanding all of the mathematical machinery is tricky and, unfortunately, these details are needed to tune the hyper-parameters. (Tuning the hyper-parameters is required to get a decent GBM model unlike, say, Random Forests.) Our goal in this article is to explain the intuition behind gradient boosting, provide visualizations for model construction, explain the mathematics as simply as possible, and answer thorny questions such as why GBM is performing “gradient descent in function space.” We've split the discussion into three morsels and a FAQ for easier digestion. Written by Terence Parr and Jeremy Howard.
>"The problem is that understanding all of the mathematical machinery is tricky and, unfortunately, these details are needed to tune the hyper-parameters."
You don't need to understand anything about the math to run a random, or grid, or bayesian optimization, or whatever search of the hyperparameter space.
True, people use a grid search, but I am always very uncomfortable using things as black boxes. How does tree depth affect generality etc...? Effectively using a model means understanding your tools, in my view, but easy to get started w/o the math as you say!
Besides having a basic understanding of what the parameters do (this is depth, this is learning rate, etc) I don't see what insights are to be gained. The optimal parameters depend on the particulars of your dataset, that's why everyone just does a search.
Maybe I am wrong, does this tutorial contain a derivation from the math that shows something like "if your data has these properties then you should set maximum depth to be this value and learning rate to be this value, and dropout to be that value"?
Complex ML models can behave very counterintuitively in response to simple hyperparameter changes, which is why it's more pragmatic to check a lot of combinations (e.g. grid search). CPU time is cheap, research time isn't.
Id think this is much more useful than anything about the math. How much can you deduce thats described there from the math, isn't this all just figured out by playing around with it?
The main point of this article is really to explain how gradient boosting works and why. The math is really there to show what the algorithm looks like in its general form. The Discussion of parameters was really just a bit of motivation. Think of this as a good explanation of why it is performing gradient descent in function space. That tends to be very hard to explain.
Ahh it’s because of people like you that I’ll always have a job. He’s right folks Please don’t try to understand the math behind it, actually the less you know the better.
Please don't get personal or swipey in HN comments. Rather, the idea is: if you have a substantive point to make, make it thoughtfully; if you don't, please don't comment until you do.
Where did I say not to try to understand the math behind it? I say its not required to tune hyperparameters.
To me, its doubtful that someone whose first impulse is to generate strawman arguments is doing a good job in analyzing data. Thats the basis for NHST, which ml tools like this are trying to get away from...
I’m not sure about the connection to category theory. This is mostly an attempt to explain why this model works, that it is performing gradient descent in a particular space. We find that extremely challenging to explain to students. I would be interested to know if you feel the article helps in that regard. Thanks
My intuition from reading the explanations is you are performing gradient descent on gradient descent. If you understand gradient descent I don't see how this is challenging to explain to students. GBM just refines the problem space to a smaller context, but it's still the same mathematical operation of approximation. I think it gets confusing because of that 'recursive' nature. Disconnecting the explanation from the math that it's based on might be simpler to explain.
The key insight seems to be that chasing residuals (for MSE) or sign vectors (for MAE) is chasing a vector (ie direction not just magnitude) and that vector is also a gradient. So chasing residual is performing gradient descent.
Having more than one way of explaining it can be helpful. Sign vector, direction - these things can be described in terms of orders of sets, monotonicity of compositional functions yielding recursive structures. I haven't done the work work so take this with a handful of salt.
I know that as a student who has taken machine learning classes, half the time it feels like the way I'm being taught is also like this ridiculous way of speaking to me to describe how my own learning is going. Do I understand the problem correctly, do I have to jump ahead, or take a step back. The language can sometimes matter, so the only reason I recommend category theory is because it helps take a step outside of the space that is describing all these successive optimizations. It allows to describe the overall structure of how each mathematical operation interacts in relation to each other - in terms of sets of both numbers and functions that are either partially ordered or totally ordered, and from that, I would think more things could then be said about the relation of each independent optimization function in relation to the context (problem space) it is contained in. Something confusing seems to be that the actual calculation can have an unknown effect on the resulting function - so being able to think relationally about an individual calculation to the full computation - I dunno - that just seems very interesting to me.
Again, handful of salt, I'm no machine learning expert nor am I an expert in category theory, and I'm sure I'm not being as precise as I'd want to be if this was something I did career wise (I just code stuff). Hobbyist interest that is a remnant of a time I once believed I could work on a PhD.
Point is, I'm just making sure you are just as well aware as I am of myself, that I just see an interesting connection in how machine learning is growing, and I like category theory. It's the math of math, or the logic of math, something like that. At the very least, having more than one way of seeing the problem can't hurt, can it? If they both yield the same statements, that's at least saying something slightly more concrete?
I'm not sure if I have any key insights to offer, just that the balance of each individual calculation versus the whole direction of machine learning seems to be something of profound importance, at least from my perspective. Being able to generalize and say 'there exists an ordered structure or there does not' on top of it - that seems like something I vaguely identify as important. Allowing to at least differentiate between computable function spaces versus ones that cycle, which could tie all that back into the computation from which it came, which I suppose the ideal is, program programs that program programs?
Just rambling though, thanks for humoring me!
I don't know what you mean by chasing residual but I'm assuming it's that little tiny margin of error you just can't seem to catch up to. I don't know whether this is possible, but so much of my insight on that is based on real life. The only places I've found reason to use any machine learning techniques is to describe things about computable functions.
I'll definitely look further into what you've said in the future in a precision sense, it does certainly seem interesting. I personally am just never quite sure what I understand and what I don't.
Gradient boosting machines (GBMs) are currently very popular and so it's a good idea for machine learning practitioners to understand how GBMs work. The problem is that understanding all of the mathematical machinery is tricky and, unfortunately, these details are needed to tune the hyper-parameters. (Tuning the hyper-parameters is required to get a decent GBM model unlike, say, Random Forests.) Our goal in this article is to explain the intuition behind gradient boosting, provide visualizations for model construction, explain the mathematics as simply as possible, and answer thorny questions such as why GBM is performing “gradient descent in function space.” We've split the discussion into three morsels and a FAQ for easier digestion. Written by Terence Parr and Jeremy Howard.
I was also surprised when I saw that there was no standard notation for Jacobian matrices. We use the numerator notation in the article, but point out that there are papers that use the denominator notation. I think I remember from engineering school that we used numerator notation so we stuck with that.
This is what index notation is good for, and I encourage everyone to learn it. Jacobians are dx_a/dx_b, two indices, and clearly b belongs to the derivative. Whether it's rows or columns is an implementation detail of how you're storing these numbers.
Index notation also seems natural for programming: an element A[i,j] or a slice Z[3,4,:] are precisely this.