It is an undergrad course, though it is cross-listed for masters students as well. At CMU, the prerequisites chain looks like this: 15-122 (intro imperative programming, zero background assumed, taken by first semester CS undergrads) -> 15-213 (intro systems programming, typically taken by the end of the second year) -> 15-445 (intro to database systems, typically taken in the third or fourth year). So in theory, it's about one year of material away from zero experience.
Check out the full version of Towards Scalable Dataframe Systems from VLDB 2020 [0]. They propose an algebra for dataframes and section 4.4's example succinctly describes the pivot operator.
Pretty much. Plus, from my perspective - if a company is willing to screw over your advisor/professor, you know that they won't hesitate to screw you over too.
> ... you should not recruit the university department/group/students against your peers ...
As a student who chose to stay at CMU for a PhD because of this group, it is quite the opposite situation - you may also misunderstand the nature of the "ban" (students can still apply directly to the company).
From the student perspective, we benefit from knowing the reputation of potential employers. For example: CompanyX went back on their promises so don't trust them unless they give it to you right away, CompanyY has a culture of being stingy, the people who went to CompanyZ love it there, and so on.
So it's more like (1) providing additional data about the company's past behavior, and (2) not actively giving the company a platform. I personally find this great for students.
You're correct, but for additional context, this paper will actually be presented at VLDB 2024 [0].
> All papers published in this issue will be presented at the 50th International Conference on Very Large Data Bases, Guangzhou, China, 2024.
And that's because in the submission guidelines [1],
> The last three revision deadlines will be May 15, June 1, and July 15, 2023. Note that the June deadline is on the 1st instead of the 15th, and it is the final revision deadline for consideration to present at VLDB 2023; submissions received after this deadline will roll over to VLDB 2024.
So whether it is (2023) or (2024) is a little ambiguous.
For an eh in the other direction: I overpaid PA state taxes in 2020 by a decent chunk. The last time I called, they said that they're still processing amended returns from 2019 (which you can verify by going to their "Where's my refund" page and looking at the year dropdown).
[first author here] I'm not sure why this is on the front page. Speaking only on my own behalf, I like to think of this as a paper that's motivated by problems that I kept running into while re-implementing papers related to self-driving database systems [0] research.
My TLDR would be: existing research has focused on trying to develop better models of database system behavior, but look at recent trends in modeling. Transformers, foundation models, AutoML -- modeling is increasingly "solved", as long as you have the right training data. Training data is the bottleneck now. How can we optimize the training data collection pipeline? Can we engineer training data that generalizes better? What opportunities arise when you control the entire pipeline?
Elaborating on that, I think you can abstract existing training data collection pipelines into these four modules:
- [Synthesizer]: The field has standardized on the use of various synthetic workloads (e.g., TPC-C, TPC-H, DSB) and common workload trace formats for real-world workloads (e.g., postgres_log, MySQL general query log). Research on workload forecasting and dataset scaling exists. In 2023, why can't I say "assuming trends hold, show me what my workload and database state will look like 3 months from now"?
- [Trainer]: Given a workload and state (e.g., from the Synthesizer), existing research executes the workload on the state to produce training data. But executing workloads in real-time kind of sucks. Maybe you have a workload trace that's one month long, well, I don't want to wait one month for training data. But I can't just smash all the queries together either, that wouldn't be representative of actual deployment conditions. So right now, I'm intrigued by the idea of executing workloads in faster than real-time. Think of a fast-forward button on physics simulators, where you can reduce simulation fidelity in exchange for speed. Can we do that for databases? I'm also interested in playing tricks to help the training data generalize across different hardware, and in general, there seems to be a lot of unexplored opportunity here. Actively working on this!
- [Planner]: Given the training data (e.g., from the Trainer) and an objective function (e.g., latency, throughput), you might consider a set of tuning actions that improve the objective (e.g., build some indexes, change some knob settings). But how should you represent these actions? For example, a number of papers one-hot encode the possible set of indexes, but (1) you cannot actually do this in practice, there are too many indexes, and (2) you lose the notion of "distance" between your actions (e.g., indexes on the same table should probably be considered "related" in some way). Our research group is currently exploring some ideas here.
- [Decider]: Finally, once you're done applying all this domain-specific stuff to encode the states and actions, you're solidly in the realm of "learning to pick the best action" and can probably hand it off to a ML library. Why reinvent the wheel? :P That said, you can still do interesting work here (e.g., UDO is intelligent about batched action evaluation), but it's not something that I'm currently that interested in (relative to the other stuff above, which is more of an uncharted territory).
If anyone is at SIGMOD this week, I'm happy to chat! :)
reply