I looked very closely at the videos for a while and managed to find some minor continuity errors (like different numbers of buttons on people's button-down shirts at different times, or different sizes or styles of earrings, or arguably different interpretations of which finger is which in an intermittently-obscured hand). I also think that the cycling woman's shorts appear to cover more of her left leg than her right leg, although that's not physically impossible, and the bear seemingly has a differently-sized canine tooth at different times.
But I guess it took me multiple minutes to find these problems, watching each video clip many times, rather than having any of them jump out at me. So, it's not like literally full consistent object persistence, but at a casual viewing it was very persuasive.
Maybe people who shoot or edit video frequently would notice some of these problems more quickly, because they're more attuned to looking for continuity problems?
I think it's fascinating to watch what the issues/complaints are. I'm in no way saying you're complaining but I think looking at what people point out are the issues is a great measure of progress.
Here we're looking at video, of high quality individual frames, where the inconsistencies are maybe clear and maybe not - but compared to craion (around the time of dalle): https://i.ytimg.com/vi/lcoitxKbw_0/maxresdefault.jpg it's wild how that's changed. And this capability was a vast improvement over things before (at least ones that weren't fixed goals, the GAN approach to faces in headshots was very lifelike before this)
> But I guess it took me multiple minutes to find these problems
I’m no video editor but I noticed straight away that The characters’ eyes and hair tend to change, sometimes dramatically as they turn their head. Also, the head movement tends to be jerky or abrupt especially in the middle of the turn.
There are lots of inconsistencies in these clips of the type you would never find even in a hastily put together amateur film. I wonder how you would even add continuity support into a generative video model. It's got its training data, its model, its algorithms for generating data... but could you say "make sure this shirt always has 6 buttons in this scene"? Does it even understand what a button is? Or a shirt? Or a thing?
It seems to me that eventually these systems are going to have to be grounded in some hard truths about our world. Like, there are things called objects, objects can be distinct, objects can have relationships between other objects, etc. Then the generative network would have to generate data around these priors. Or maybe they already have that, I don't know how they work.
Hopefully continuity (of relevant features) would be the results of the training process, eventually. In videos from the wild, the number of buttons on a shirt basically never changes during a scene. That kind of information is in the training data already. So it is theoretically possible for a model to learn that should stay consistent, in contrast to other properties that affect the shirt, like lighting or pose. But we are still in the very early days of kinda-working video generation, and certainly in terms of temporal consistency.
Did you miss the fish?[0] You should see the error in first viewing
What about the woman with glasses? Her face literally "jumps"[1] Same with this guy's hands[2]
Interesting, we notice that [1] has "sora" in the name though I think it is a reference to the main image on sora[3]
Not sure if the gallery is weird to anyone else, but it doesn't exactly show new images and the position indicator is wonky.
The thing that makes me most suspicious is seeing the numbers on these demos. 1, 2, 4 (terrifying to me), 5, 65, 66, 68, 72, 73, 83, 85, 86 (is this Simone Giertz? Vic Michaelis?). The part that is tough about evaluating generative models is the cherry picking for demonstrations. You have to do it or people tear your work apart but also in doing so you give a false impression of what your work can actually do.
IMO it has gotten out of hand and is not benefiting anyone. It makes these papers more akin to advertising than communication of research. We talk about integrity of the research community and why we argue over borderline works but come on, if you can get a better review by more samples, you can get better reviews by paying more, not by doing better work. A pay to play system is far worse for the integrity of ML (or any science) than arguing over borderline works.
Edit: I think it is also a bit problematic that this is posted BEFORE the arxiv link or GitHub goes live. I'd appeal to the HN community to not upvote these kinds of works until at least the paper is live.
You can see this in Sora videos too if you look closely to things like leaves of trees, you can tell some sort of bucketing is going on temporally even on SOTA models
I mean at the end of the day neither is standard video editing, how many times have we all found inconsistencies in TV shows or random water bottles showing up and disappearing in scenes... I imagine diffusion video creation will be similar eventually funny anecdotes of what we saw that time in LOTR 10
I am super picky when it comes to art and I think these look like complete shit when compared to what I have seen from Sora.
Not even in the same ballpark. Even when things are wrong in Sora it seems like the imagery is still very crisp. If I watched these videos for 5 minutes I know I would get a headache.
But I guess it took me multiple minutes to find these problems, watching each video clip many times, rather than having any of them jump out at me. So, it's not like literally full consistent object persistence, but at a casual viewing it was very persuasive.
Maybe people who shoot or edit video frequently would notice some of these problems more quickly, because they're more attuned to looking for continuity problems?