I understand your concern, but I don't really think the problem is within Clojure people: When I started out explaining the structure to other people and that most operations have O(log n) runtime, people dismissed Clojure as being too slow for practical purposes. Part of this is because programmers, not Clojure devs in particular, connotate O(log n) with "much slower than O(1)". However, this data structure is incredibly much faster than the "O(log n)" people are generally used to.
And this is the problem. As you yourself say, most programmers usually associate/assume "Big-O notation is intended to describe how the time required for an operation grows as the number of items being handled gets large".
But formally, this is not the case at all. O(g(x)) tells you the that the worst case input to the algorithm will require less operations (usually operations, sometimes memory or other things) than the function g multiplied by some positive constant. g is usually determined by the size of the input. (This definition is also simplified somewhat, but it illustrates the problem)
We can say that ArrayLists.add have O(n) runtime and that quicksort have O(n²) runtime, but they both also have O(n³) and O(n!) runtime.
In addition, Big-O does not describe how the algorithm runs in the average case, or how frequent the worst case is. In and of itself, the Big-O notation tells us almost nothing about an algorithm – you barely conclude anything from it. I guess that is part of why it has become this very simplified term with a very wibbly-wobbly definition from time to time.
So when you're saying that calling it "practically O(1)" is lying, I agree if this was in an academic context. But for the average programmer (regardless of language!), O(g(x)) is usually thought of as the "worst case runtime for an algorithm". Since the actual runtime is – well – effectively constant, I considered it okay.
---
That being said, I have formally explained the persistent vector if you're interested in the academic version. The background chapter in http://hypirion.com/thesis.pdf should cover it.
I understand that log-time algorithms have a marketing problem. I think the correct way to address that is the one I already suggested: give benchmark numbers for lookup, relative to ArrayList, for a typical collection size (or maybe for a few sample sizes -- 32, 1024, and 32768, perhaps). These will clearly show both that lookup is fast on small seqs and that it slows down logarithmically.
I would be more sympathetic to your argument that O(log n) is "practically" O(1) if you would put it in context. It's certainly true that how much the difference matters depends on the application. For someone doing real-time work (HFT, for example) it could be critical. For the average web app, it matters extremely little; the programmer productivity gains offered by functional collections far outweigh the small cost in CPU cycles. I totally agree with that argument, and have made it myself many times, but I still wouldn't say that log time is "practically" constant time for all applications.
Agreed on the benchmarks – that's the next (and last) part on the list I'm going to do with this series.
And yes, of course, at some point, constant factors break into applications. Would it be better if add in a short comment on "practically O(1) for non-HPC applications", or would you phrase it differently to give it better context?
Ah, good -- numbers will clarify the situation considerably.
Personally my preference would be to argue that for applications that aren't terribly performance-sensitive, log time in general just isn't that bad. Even binary trees are quite usable for many purposes. (I believe the Linux kernel uses red-black trees for various things.) For a general-purpose sequence data structure that will be used in a variety of ways, a very good case can be made that it's worth accepting log-time lookup in order to get log-time insertion, subsequence, and concatenation -- operations that all take linear time on an ArrayList (and linear time will kill you far sooner than log time!).
On top of that, Hickey, following Bagwell, has done a great job at keeping the constant factors down, as (I presume) the benchmark numbers will show.
Another point I would make is that true random access into a seq is not the only way elements are retrieved. Often this is done via an iterator. I presume the Clojure seq iterator takes amortized constant time per element (it should); if so, I think it would be worth pointing that out.
All that said, I guess I could live with "practically O(1) for non-HPC applications".
Sorry I was so combative. I had not looked closely at Clojure seqs, so I do appreciate this series of posts you have written.
Instead of telling people that it's O(log n) you should just show benchmarks instead.
Hitting L1 takes 1ns but main memory takes 100ns. So although you can say accessing an element in an array is always O(1) = O(100), it is hiding a factor of a 100 depending on how it is accessed. Having a log in the big Oh usually means it is jumping around a lot and is no longer a cache friendly algorithm, so that worries me more than whether it is log_2(x) or log_2(x)/5.
That's a red herring: You're asking why the majority of software is written in imperative languages instead of functional ones, which is a very complex question with very many factors. It is not an argument which invalidates nor counters the author's claim: Namely that "mostly functional languages" is unfeasible and only increases the bugs caused by side effects.
Constant time in practise is a notable difference from saying constant time.
I can always sort a 32-bit int array in worst case linear time using counting sort, which in theory is better than the worst case runtime of quicksort, O(n²). Is it better in practise? Of course not, the constant factor is just too high, both for time and memory.
I've managed to do the opposite: Scala didn't stick, Clojure did. I guess there's some subjective reason why or why not Clojure/Scala would stick with someone.
However, it's false that Clojure doesn't have a company behind it: Cognitect is certainly backing up Clojure by developing ClojureScript and Clojure itself, Clojure consulting, the Datomic database and the Pedestal "framework".
And if you ask me, its just so good that we haven't needed a revision in 20 years.
What about threading and/or concurrency? CLTL doesn't mention it, yet it is something which is now considered a requirement for real, general purpose programming languages.
Could you elaborate on how generics makes code less safe? In my experience, the lack of generics is much more type unsafe, as you often have to do casts at runtime (and handle them) when you have to implement your own data structures. In contrast, languages with generics handle those at compile time.
Just for reference, a single goroutine has a 8 kb stack size. Having one million of these things in memory at once would require 8 Gb of memory. Obviously, the scheduler kicks in, but 8 kb chunks of memory still has to be assigned and freed for every routine started and stopped.
Using val King = 13; val Queen = 12; in Scala is not equivalent with using Clojure keywords, as their printed variant will be 13 and 12, not :king and :queen. Reading debugging output or logs would force you to manually convert those values.
Ordering for free is valuable, I guess, but it sort of depends on the situation. Sometimes face cards are worth 10, other times they are worth 11, 12, 13. If you use val King = 10;, then it suddenly is impossible to distinguish between face cards and tens.
Right, if you wanted to have an ordering around, you would use an enum. The enum's values would print as Ace, 2, 3, 4, ..., Jack, Queen, King, and then when you wanted to implement a particular game the game could define the mapping from rank -> set of possible values. You wouldn't want to map each rank to a single value, since in commonly played games aces are worth both 1 and 11 or both 1 and 14.
If you didn't want the ordering (or the compiler warning when your blackjack implementation omits queens), you could use Scala's symbols, which are more or less the same as Clojure's keywords:
No, your first sentence is wrong. The following definition stores the ranks as unboxed integers, but prints them differently:
case class Rank private (val rank : Int) extends AnyVal {
override def toString() = {
rank match {
case 1 => "A"
case 11 => "J"
case 12 => "Q"
case 13 => "K"
case _ => rank.toString()
}
}
}
object Rank {
val Ace = Rank(1)
val Two = Rank(2)
// ...
val King = Rank(13)
}
Author here: I think you talk about the persistent hashmaps, not the persistent vectors. I've been looking for papers explaining Clojure's persistent collections, and Bagwell seems to cover the hash maps and hash sets quite well. However, I've not seen a paper on the persistent vectors, which was quite a bummer, and that was the reason I started explaining them in the first place.
If you have a reference to a paper explaining something similar (or the actual implementation), I'd love to put it in the post for others.
Looking at the source the persistent vectors are virtually identical to Bagwell's paper. Rich did add a couple tweaks, namely moving the bitvector that indicates what slots of a node are occupied from being a word in the node object to being embedded in the 64bit integers stored in each node slot. When a node is filled enough to span 2 cache lines, around 9 slots on typical hardware with 64 byte lines, and the next desired index fragment is the 9th slot or higher, this avoids touching the first cache line, potentially saving a cache miss. This is why the nodes are 32 way: 32bits for the bitvector and 32bits for the offset in the underlying storage array fit in one 64bit word which can be written atomically (inside a transient obviously). Rich goes through this in one of his talks but I don't recall which.
The modification to go from mutable to immutable isn't an invention either. Anyone who's read any of the functional data structure literature will be familiar with path copying being one of the two general ways of making any data structure persistent.
From the perspective of these data structures there's little difference between a vector with integer indexes and a hashmap. The hashmap just requires a preliminary step of hashing the key to an integer.
The immutable vector
data structure as pioneered by the programming language Clojure
[4] strikes a good balance between read and write performance and
supports many commonly used programming patterns in an effi-
cient manner. In Clojure, immutable vectors are an essential part
of the language implementation design. Ideal Hash Tries (HAMTs)
[1] were used as a basis for immutable hash maps and the same
structure, 32-way branching trees, was used for immutable vectors.
I'm pretty sure they picked the word pioneered for a reason. If Rich Hickey didn't invent them, then Tiark & Bagwell didn't invent RRB-Trees.
Well, it's arguable either way IMHO. I'd give priority to Bagwell because he first published his work academically in 2000. At the time he worked for Odersky, the author of the Scala language. So these structures were in Scala's implementation first, then adapted and improved for Clojure.
Phil Bagwell was loosely associated with my group in 2000 but did not work for me then. His work at the time was theoretical; the first practical implementation is Clojure's. Scala's implementations only appeared in version 2.8, in 2010.
Since Rich Hickey isn't in academia, but instead a working programmer, using publication dates doesn't feel like it is the best criterion for determining who did what when.
I do agree that the structure is influenced by the paper, so I've added that in. I don't agree that they are "virtually identical" however, because there are significant differences leading to a lot of performance improvements (never any hash collisions, insertion and removal only at the end, resulting in perfect balancing, etc) and, as swanodette mentioned, Bagwell himself mentioned that they were pioneered by Hickey.
> Rich goes through this in one of his talks but I don't recall which.
There's no bitvector in clojure persistent vectors. The difference between vectors and a hashmap without the hashing stage is that a persistent vector can't be sparse!
"In Clojure, immutable vectors are an essential part
of the language implementation design. Ideal Hash Tries (HAMTs)
[1] were used as a basis for immutable hash maps and the same
structure, 32-way branching trees, was used for immutable vectors."
And this is the problem. As you yourself say, most programmers usually associate/assume "Big-O notation is intended to describe how the time required for an operation grows as the number of items being handled gets large".
But formally, this is not the case at all. O(g(x)) tells you the that the worst case input to the algorithm will require less operations (usually operations, sometimes memory or other things) than the function g multiplied by some positive constant. g is usually determined by the size of the input. (This definition is also simplified somewhat, but it illustrates the problem)
We can say that ArrayLists.add have O(n) runtime and that quicksort have O(n²) runtime, but they both also have O(n³) and O(n!) runtime.
In addition, Big-O does not describe how the algorithm runs in the average case, or how frequent the worst case is. In and of itself, the Big-O notation tells us almost nothing about an algorithm – you barely conclude anything from it. I guess that is part of why it has become this very simplified term with a very wibbly-wobbly definition from time to time.
So when you're saying that calling it "practically O(1)" is lying, I agree if this was in an academic context. But for the average programmer (regardless of language!), O(g(x)) is usually thought of as the "worst case runtime for an algorithm". Since the actual runtime is – well – effectively constant, I considered it okay.
---
That being said, I have formally explained the persistent vector if you're interested in the academic version. The background chapter in http://hypirion.com/thesis.pdf should cover it.