I looked very closely at the videos for a while and managed to find some minor continuity errors (like different numbers of buttons on people's button-down shirts at different times, or different sizes or styles of earrings, or arguably different interpretations of which finger is which in an intermittently-obscured hand). I also think that the cycling woman's shorts appear to cover more of her left leg than her right leg, although that's not physically impossible, and the bear seemingly has a differently-sized canine tooth at different times.
But I guess it took me multiple minutes to find these problems, watching each video clip many times, rather than having any of them jump out at me. So, it's not like literally full consistent object persistence, but at a casual viewing it was very persuasive.
Maybe people who shoot or edit video frequently would notice some of these problems more quickly, because they're more attuned to looking for continuity problems?
I think it's fascinating to watch what the issues/complaints are. I'm in no way saying you're complaining but I think looking at what people point out are the issues is a great measure of progress.
Here we're looking at video, of high quality individual frames, where the inconsistencies are maybe clear and maybe not - but compared to craion (around the time of dalle): https://i.ytimg.com/vi/lcoitxKbw_0/maxresdefault.jpg it's wild how that's changed. And this capability was a vast improvement over things before (at least ones that weren't fixed goals, the GAN approach to faces in headshots was very lifelike before this)
> But I guess it took me multiple minutes to find these problems
I’m no video editor but I noticed straight away that The characters’ eyes and hair tend to change, sometimes dramatically as they turn their head. Also, the head movement tends to be jerky or abrupt especially in the middle of the turn.
There are lots of inconsistencies in these clips of the type you would never find even in a hastily put together amateur film. I wonder how you would even add continuity support into a generative video model. It's got its training data, its model, its algorithms for generating data... but could you say "make sure this shirt always has 6 buttons in this scene"? Does it even understand what a button is? Or a shirt? Or a thing?
It seems to me that eventually these systems are going to have to be grounded in some hard truths about our world. Like, there are things called objects, objects can be distinct, objects can have relationships between other objects, etc. Then the generative network would have to generate data around these priors. Or maybe they already have that, I don't know how they work.
Hopefully continuity (of relevant features) would be the results of the training process, eventually. In videos from the wild, the number of buttons on a shirt basically never changes during a scene. That kind of information is in the training data already. So it is theoretically possible for a model to learn that should stay consistent, in contrast to other properties that affect the shirt, like lighting or pose. But we are still in the very early days of kinda-working video generation, and certainly in terms of temporal consistency.
Did you miss the fish?[0] You should see the error in first viewing
What about the woman with glasses? Her face literally "jumps"[1] Same with this guy's hands[2]
Interesting, we notice that [1] has "sora" in the name though I think it is a reference to the main image on sora[3]
Not sure if the gallery is weird to anyone else, but it doesn't exactly show new images and the position indicator is wonky.
The thing that makes me most suspicious is seeing the numbers on these demos. 1, 2, 4 (terrifying to me), 5, 65, 66, 68, 72, 73, 83, 85, 86 (is this Simone Giertz? Vic Michaelis?). The part that is tough about evaluating generative models is the cherry picking for demonstrations. You have to do it or people tear your work apart but also in doing so you give a false impression of what your work can actually do.
IMO it has gotten out of hand and is not benefiting anyone. It makes these papers more akin to advertising than communication of research. We talk about integrity of the research community and why we argue over borderline works but come on, if you can get a better review by more samples, you can get better reviews by paying more, not by doing better work. A pay to play system is far worse for the integrity of ML (or any science) than arguing over borderline works.
Edit: I think it is also a bit problematic that this is posted BEFORE the arxiv link or GitHub goes live. I'd appeal to the HN community to not upvote these kinds of works until at least the paper is live.
You can see this in Sora videos too if you look closely to things like leaves of trees, you can tell some sort of bucketing is going on temporally even on SOTA models
I mean at the end of the day neither is standard video editing, how many times have we all found inconsistencies in TV shows or random water bottles showing up and disappearing in scenes... I imagine diffusion video creation will be similar eventually funny anecdotes of what we saw that time in LOTR 10
I am super picky when it comes to art and I think these look like complete shit when compared to what I have seen from Sora.
Not even in the same ballpark. Even when things are wrong in Sora it seems like the imagery is still very crisp. If I watched these videos for 5 minutes I know I would get a headache.
Normally I don't mind spelling errors - and there are plenty in the examples - but my question is, did the system really produce "lunch" when the prompt was "they have launch at restraunt" (verbatim from the sample)? I would imagine it got restaurant right, but I would have expected it to produce something like a rocket launch image instead of figuring out the author meant lunch.
transformers / attention is very robust against typos as they take the entire context into account just like we do. launch any free LLM and ask them questions with typos that you would notice and auto-correct and you'll see that the models just don't care and understand them. actually they are so resilient that they understand very garbled text without breaking a sweat.
I often use ChatGPT in learning spanish, I find it's great for explaining distinctions between words with similar meanings where a dictionary isn't always a lot of help.
I am constantly surprised by how well it copes with my typos, grammatical errors and generally poor spelling.
There's honestly something uncanny about how well they do.
In the "early days" of GPT-4 I tried testing it as a way to get around poor transcription for an in-car voice assistant. It managed: "I'm how dew yew say... Freud?" => Turn up the temperature... which was nonsense most people would stare at for a long time before making any sense of.
The model can only attend to context that is part of the input. Most likely they created the image grid by independently feeding the model each prompt together with the reference image. (And the point is to show off that the model output remains consistent despite this independent generation process.)
"He felt very frightened and run", "There is a huge amount of treasure in the house!"
I suspect that some grammar and spelling issues may be the authors themselves. For example "A Asian Man": "a" instead of "an" is a common mistake for many Asian languages due to not having similar forms in their languages. So considering consistent article errors, I expect this to be an issue from the authors. Not sure the "M" capitalization. Similar things with "The man have breakfast", "They have launch at restaurant", "They play in (the) amusement part."
Considering the comics have similar types of error (the squirrel one clearer) I'd chalk it up to language barrier instead of the process. Though LeCun is not wearing gloves on the moon, and well...
This is unbelievably good. Seems better than Sora even in terms of natural look and motion in videos.
The video of two girls talking seems so natural. There are some artifacts but the movement is so natural and clothes and other things around are not continuously changing.
I hope it does become open source, which i suspect it won't because it's coming from byte dance.
I don't know if thats true, theirs a massive flicker in the guys hair (the one thats mostly black background and black shirt) half way through it completely loses tracking on his hair and it like snap changes.
If you compare this with current state of openly available video models (assuming this will be open too) this is still a leap. If it is going to be closed like Sora than it's comparable. Sora has different kind of artifacts.
These artifacts are an improvement over current state.
Github link is broken, and I honestly find it frustrating that the only link to code is the theme source and credits?? Is it really that important to give the static page theme that much real estate instead of actual code release for the project?
Progress comes in spurts. Due to the negative reactions to AI by some (artists), the system wants it to appear that nothing is happening so that the next wave of AI can be created in relative peace, at which time it will be too late to stop it.
We have been conditioned to only react to hype and "news", rather than analyze reality and see the danger.
The global capitalist system, or the emergent behaviour that comes out of a mass of humanity addicted to technological development through wealth accumulation.
It's a term I use for emergent behaviour. And some philosophers of technology would disagree with you, such as the panpsychists. We are just a bag of cells and yet we speak of "wanting" things even though we might just be deterministic bags of blood.
What are you talking about? ChatGPT-3 came out less than 4 years ago. Stable diffusion's first version around that too. In less than 4 years we went from nothing to making janky but believable video clips. This is not fast enough for you?
I’m just saying that compared to only a few months ago, everything seems to have stagnated. There used to be a lot more news and things getting released left and right.
One day we won't have 3D engines or GPU's but AI chips that generate the scenes without calculating a single triangle or loading a single texture. We just stream in a scene, IP asset seeds provide the characters, plot and story. But even those can be generated in real-time. Video games, movies, anything will be on demand. No one will act. No one will draw. We will just sit and ask for more. Strange times.
Love how under "Multiple Characters Generation" the white guy is "A Man," whereas the someone else is "An Asian Man." Reminds me of Daryl Gates and the "normal people" quote, thence patrol cars being called "black and normals."
The Moon in the sky seen from the surface of the Moon is wrong? Poetic? Funny? Recursive? A demonstration that these models don't understand anything? Add to the list.
Hi guys, thanks for your interest. The paper and the code are now released: https://github.com/HVision-NKU/StoryDiffusion. Currently, only the comics-related codes are made public. We are waiting for the company's assessment for the release of the video-related codes.
Its really challenging to think of the positive, constructive uses for this technology without thiking of the myriad, life and societal effecting uses for this. Just interpersonally the use of this technology is heavily weighted towards destruction and deception. I don't know where this ends or where researchers who release this technology think this will go, but I can't imagine its going anywhere good for all of us.
I went to buy an air fryer. There were several specific-air-fryer-model recipe books available. But they were all garbage auto-generated stuff.
I complained to amazon, and they said since I hadn't purchased the book they couldn't do anything. So I bought the book, complained, and returned it.
The chapters devoted to the details of the specific air fryer model were either very general (almost quotes of product description on amazon), or just plain wrong.
What I thought I would get would be like the magic lantern books about specific camera models. Instead it was auto-generated pages of nonsense.
I don’t think this form of generative AI needs to become a source of spam, carefully designed platforms can let people enjoy their niche content without making them feel isolated
Not really useful to give up the fight in the infancy of something with as much surface area as generative AI.
Is being used to create spam is not the same as needs to be spam, and we mostly just need platforms that leverage generative AI natively to bridge the gap.
My users don't find what these tools generate to be spam. They're enjoying a classic format with a novel level of flexibility and (understandably) find that very fun.
There is a video of two girls. One girl seems to be sticking out her tongue and then blowing a kiss, but the tongue is appearing again mid-kiss. Very arousing stuff I'll say. Keep up the good work microsft or goggle or whoever made it.
But I guess it took me multiple minutes to find these problems, watching each video clip many times, rather than having any of them jump out at me. So, it's not like literally full consistent object persistence, but at a casual viewing it was very persuasive.
Maybe people who shoot or edit video frequently would notice some of these problems more quickly, because they're more attuned to looking for continuity problems?