Well, this depends. If you're looking at some generic things that are similar to...

kburman · on Jan 12, 2022

> Running CLIP even on 27 million frames (1 day of footage) is super expensive. We've built some infra that makes processing video efficient (forms of parallelization + filtering), without you having to think about it.

With your massive parallel infra you're still processing 27 millions frames, right?

> If you're looking at some generic things that are similar to what CLIP was trained on, this would work. Say you're interested in specific physical security metrics, or monitoring defective parts, or specific things about traffic, etc. CLIP might just say "people walking" or "car in intersection" or "part on conveyer belt" which isn't meaningful enough if all your images are exactly that, but with other small differences.

That's exactly transfer learning is for. I am saying to do it with because you automatically get a pretrained model which can do much more than that. Imagine doing a search "a person wearing red cap and yellow hand bag walking toward the exit" or "A person wearing a shirt with mark written on it". Can your system do it right now?

mvoodarla · on Jan 12, 2022

> With your massive parallel infra you're still processing 27 millions frames, right?

No, it's not. We first run a cheap filter like a motion detector on all of the video, which is inexpensive. We then stack other, more expensive filters on top of this depending on the use-case and eventually run the most expensive metadata-generating models at the end. We also don't do this on every single frame and can interpolate information using surrounding frames. Our parallel infra speeds this up further.

> That's exactly transfer learning is for. I am saying to do it with because you automatically get a pretrained model which can do much more than that. Imagine doing a search "a person wearing red cap and yellow hand bag walking toward the exit" or "A person wearing a shirt with mark written on it". Can your system do it right now?

The issue is that there are very few text-image pair datasets out there, and building a good one is difficult. We constantly use transfer learning in-house when working with different customer data and typical classifier / detector models but haven't yet had success doing so with CLIP. Our system can't semantically search through video just yet but we're exploring the most feasible ways for doing this still. There's some interesting work on this which we've been reading recently:

https://ddkang.github.io/papers/2022/tasti-paper.pdf

https://vcg.ece.ucr.edu/sites/g/files/rcwecm2661/files/2021-...