The direct counter-argument to "worst representation" is usually "representation with fewest assumptions", waveform as shown here is getting close. Though recording environment, equipment, how the sound actually gets digitized, etc. also come into play, there are relatively few assumptions in the "waveform" setup described here.
I would say in the neural network literature at large, and in audio modeling particularly, this continual back and forth of pushing DSP-based knowledge into neural nets, on the architecture side or data side, versus going "raw-er" to force models to learn their own versions of DSP-style transforms has been and will continue to be a see-saw, as we try to find what works best, driven by performance on benchmarks with certain goals in mind.
These types of push-pull movements also dominate computer vision (where many of the "correct" DSP approaches fell away to less-rigid, learned proxies), and language modeling (tokenization is hardly "raw", and byte based approaches to-date lag behind smart tokenization strategies), and I think every field which approaches learning from data will have various swings over time.
CCD bitstreams are also not "raw", so people will continue to push down in representation while making bigger datasets and models, and the rollercoaster will continue.
I very much enjoy the observation that LLM's appear to function optimally when trained on "tokens" and not the pure unfiltered stream of characters. I think I am ultimately attempting to express an analogous belief that the individual audio samples here are as meaningless as the individual letters are to an LLM.
Instead of "representation with the fewest assumptions" I would maybe suggest that the optimal input for a model may be the representation where the data is broken apart as far as it can be while still remaining meaningful. I have suggested in other replies that this is perhaps achieved with quadrature samples or even perhaps with something such as a granular decomposition -- something akin to a "token" of audio instead of language.
On the loops / sampling front: I always thought RAVE [0][1][2] was a very interesting approach, that really embraces latent spaces and sample/stretch type approaches in the waveform space
Research into "pure" unconditional generation can often lead to gains in the conditional setting. See literally any GAN research, VQ-VAE, VAE, diffusion, etc - all started from "unconditional/low information" pretty much. Both directly (in terms of modeling) and indirectly (by forcing you to really reason about what conditioning is telling you about the modeling, and what's in the data), these approaches really force you to think about what it means to just "make music".
Also, I think artistic uses (such as Dadabots, who heavily used SampleRNN) show clearly that "musicians" like interesting tools, even if uncontrolled in some cases. Tools to exactly execute an idea are important (DAW-like), but so are novelty generating machines like (many) unconditional generators end up being. Jukebox is another nice example of this.
On the "good for elevator music" comment - the stuff I've heard from these models is rarely relaxing enough to be in any elevator I would ride. But there are snippets of inspiration in there for sure.
Generally, I do favor controllable models with lots of input knobs and conditioning for direct use, but there's space for many different approaches in pushing the research forward.
Different creators will work all kind of odd models into their workflows, even things that are objectively less "high quality", and not really controllable. To me, that's a great thing and reason enough to keep pushing unsupervised learning forward.
This work is another classic in the "neural nets meet spreadsheets" genre [0]. Really helps visualize what is going on in (at least some) latent spaces.
This is so awesome!! I'm going to have to show this in the embeddings video I'm working on when I discuss non-text embeddings and CLIP.
While I created spreadsheet-are-all-you-need.ai as teaching tool, as I've been playing with it I've been having a growing suspicion the spreadsheet interface for AI might be useful beyond teaching, either as a power user control interface or for interpretability. For example, making simple changes to the architecture of GPT and observing how it changes the model behavior can be as simple as cloning a tab and a few spreadsheet functions. Of course, you can do the same in python as well so it remains to be seen.
It's actually available in beta. It was announced while I working on this project but I kept going with pure Excel functions because I wanted to illustrate the transformer without abstractions getting in the way. It would make many aspects easier but also make it easier to hide a lot.
That being said, Python+Excel makes a ton of sense in general. And in this project, it would help in the tutorials. For example, in the embeddings tutorial I'm working on I wanted use PCA plots and SVD to illustrate the workings of embeddings but neither are natively supported in Excel without paid plug-ins. But both are easy in Python.
This same technique, extended can work well for detecting plagiarism from the underlying corpus as well, by tracking a trie of "good" completions in the n-gram sense, and a longer trie of "no-good" completions. This technique was (to my knowledge) first shown in [0], and particularly [1] is a really interesting video discussing these topics around max-order grams even in a Markovian setting. I used this technique a bit in symbolic music generation and was quite pleased with the results, always planned to work it into whatever next models.
I think there are a lot of methods from these older Markovian setups that can be employed in the outputs samplers of modern models, as well as the inclusion of structured searches and so on. Parts of deep learning have always focused on structured output search, but historically the LLM style generative setting has not employed these approaches (though I find beam search for generative settings needs tweaking, it usually works pretty well in smaller scale problems for me).
There was a really nice post on doing this kind of thing with CRF back in 2015 [0]. Open source data, and code on github. Also a nice tutorial on structured prediction using CRF type models.
Would be interesting if you could prompt, LoRA distill, or use modern LLM tricks against a well-labeled and curated set, similar to how other tagging problems are handled with modern pretrained models.
What since Adam? Learning rate scales / schedules? I cannot think of many big massive changes since ~2014, most of the setups from that era (grad clip + medium-ish LR, some ramp up or roll-off at the end) work fine today for me.
(Note: There are many, many great optimization papers since 2014 - I just don't see them show up in general recipes in open source too often)
I disagree with this. Binarized MNIST samples of any reasonable quality are (still) tricky to get right without a hierarchical system (read: VQ-VAE tokens or some such encoder space). Same with really solid CIFAR-10. "Scaling down" is a different problem than scaling up, not everything transfers but saying "everything works on
MNIST / CIFAR-10" in generative modeling is a bit glib.
Would much prefer to see early work with solid small scale results on arXiV, than have people hold concepts for another 6 months scaling up. Let that be for a v2, if you cannot put early but concrete results on arXiV where else is there?
Recalling that a lot of nice papers are mostly MNIST / CIFAR-10 level results at first, followed by scale (thinking of VQ-VAE, PixelCNN / RNN, PerceiverAR, many others that worked well at scale later). That doesn't mean every result will scale up, but we have a lot of tricks to scale "small-scale" models using pretrained latent spaces and so on. The first diffusion results were also pretty small scale... different time but I don't think things are so different today.
That said, I can agree that you need to be a bit in the weeds on the research side to be diving deep on this - but I expect lots of followup clarifications or blog posts on this type of work.
Previously TortoiseTTS was associated with PlayHT in some way, although the exact connection is a bit vague [0].
From the descriptions here it sounds a lot like AudioLM / SPEAR TTS / some of Meta's recent multilingual TTS approaches, although those models are not open source, sounds like PlayHT's approach is in a similar spirit. The discussion of "mel tokens" is closer to what I would call the classic TTS pipeline in many ways... PlayHT has generally been kind of closed about what they used, would be interesting to know more.
If you are interested in some recent open to sample-from work pushing on this kind of random expressiveness (sometimes at the expense of typical "quality" in terms of TTS), Bark is pretty interesting [1]. Though the audio quality suffers a bit from how they realize sequences -> waveforms, the prosody and timing is really interesting.
I assume the key factor here is high quality, emotive audio with good data cleaning processes. Probably not even a lot of data, at least in the scale of "a lot" in speech, e.g. ASR (millions of hours) or TTS (hundreds to thousands). As opposed to some radically new architectural piece never before seen in the literature, there are lots of really nice tools for emotive and expressive TTS buried in recent years of publications.
Tacotron 2 is perfectly capable of this type of stuff as well, as shown by Dessa [2] a few years ago (this writeup is a nice intro to TTS concepts). With the limit largely being, at some point you haven't heard certain phonetic sounds before in a voice, and need to do something to get plausible outcomes for new voices.
Maybe 'Image Quilting for Texture Synthesis and Transfer', Efros and Freeman [0]?
There's some neural / patch blends from 2016 that I always thought were interesting (CNN-MRF) [1], and I think there's a renaissance in those approaches recently (combined with other generators / prompts etc.). You can also argue ViT is "patch based" in a major sense... I am still a big believer in patch + combinations + warping (non-parameteric synthesis) generally, some cool older work from Apple on that in speech land [2].
I go as far as arguing BPE / wordpiece / sentencepiece / tokenizers in general are key for modern approaches (as were word vocab selections in the earlier days of NMT), because they find 'good enough' patches (tokens) for a higher level model to stitch together while still having some creativity / generalization available... but we focus on the model details rather than the importance of the tokenizer (and tokenizer distribution) in publication many times.
I would say in the neural network literature at large, and in audio modeling particularly, this continual back and forth of pushing DSP-based knowledge into neural nets, on the architecture side or data side, versus going "raw-er" to force models to learn their own versions of DSP-style transforms has been and will continue to be a see-saw, as we try to find what works best, driven by performance on benchmarks with certain goals in mind.
These types of push-pull movements also dominate computer vision (where many of the "correct" DSP approaches fell away to less-rigid, learned proxies), and language modeling (tokenization is hardly "raw", and byte based approaches to-date lag behind smart tokenization strategies), and I think every field which approaches learning from data will have various swings over time.
CCD bitstreams are also not "raw", so people will continue to push down in representation while making bigger datasets and models, and the rollercoaster will continue.