Oddly, it still uses TensorFlow like the original GPT-2 release despite OpenAI's declared switch to PyTorch, and it has dependency hell so it's not easy to create a wrapper tool for it.
Since it's still the GPT-2 architecture, it might be possible to port the weights to Huggingface Transformers (for the RGB generation), and then write a wrapper to extend it for the image rendering. (filed an issue here: https://github.com/huggingface/transformers/issues/5088 )
> Oddly, it still uses TensorFlow like the original GPT-2 release despite OpenAI's declared switch to PyTorch
Note the footnote in the paper where evaluations was interrupted by the move to MS Azure. This is relatively old work, and since it's literally GPT-2 but images, it's no surprise they didn't bother to rewrite it in PyTorch. If their next big from-scratch thing (presumably the multimodal GPT-like one-model-to-rule-them-all research I've been looking forward to ever since the TR article) is TensorFlow, then I'll be surprised.
Unfortunately, there probably won’t be a next big-from-scratch thing unless it ties into their API somehow. It feels like we’ve seen the last of OpenAI open sourcing anything but toy models.
I'm still learning from deep learning papers and videos before dipping my toes in myself, is there a summary of why PyTorch vs TensorFlow? Does it matter for me?
Nowadays, there isn't a huge practical difference in terms of performance/tooling aside from edge cases and deployment options. It mostly depends on your syntax preference. (although there are flame wars from both sides)
I noted the TensorFlow usage because the original GPT-2 release was TensorFlow 1.X, which led to issues when TensorFlow 2.0 was released soon after.
For model training, I strongly recommend using the higher-level Keras APIs for TensorFlow/pytorch-lightning API for PyTorch than the respective base tools.
There is a huge difference in developer experience still. When TF fails it typically offers little to no clue as to what you can do to rectify the situation. PyTorch is a lot more helpful most of the time. PyTorch is worth choosing on this basis alone, unless you intend to deploy to mobile hardware, where TFLite is the only real viable option.
TF can compile to optimized C++. When you use PyTorch with TPU, it'll internally use the TF XLA backend. If you intend to deploy on mobile, TF is easy and PyTorch is pain. But TF has obscure bugs and error messages.
Disclosure: I work on Google Cloud (and know the OpenAI folks).
Fwiw, I think 2500 "V100-days" here is an extrapolation of ~1 2048-TPUv3 day. So approximately $8/hr * 24 hours * 2048 => $400k, if you know, you could make use of that TPU Pod for much of the rest of the days of the year :).
But yes, this is still the realm of "Do you have millions of dollars of ML infrastructure" (if you want it quickly). I'm kind of hoping gwern et al. will use Colab to slowly train a variant of this for free :).
We've moved up in the world! These days we'd train on one of our TFRC TPU pods. Colab is a lot of trouble and better avoided...
(We actually can create a TPU-2048 pod. For a short while, anyway, before it preempts. But we wouldn't need to if we used lucidrain's efficient attention implementations I mention in my other comment.)
Is there a way to guarantee a specific GPU for Colab, and also have it run in the background without the runtime disconnecting? Even Colab Pro doesn't seem to guarantee a specific resource without the disconnects.
The problem with superresolution is that by design, they focus on details and textures for upscaling, and avoid adding in any large-scale structure. So if you train it separately (ie there is no backpropagation from the final highres 224px image through the superresolution model all the way back through iGPT), you cripple the original model.
There is only so much about objects and the world you can learn from 64px thumbnails, and they mention that this is probably a reason iGPT wins on the tiny images like CIFAR but loses to the semi-supervised CNNs on ImageNet: because the CNNs are compute-efficient enough that they can train at the standard 224px and see all of the details and structure that disappears at iGPT's 64px, and learn end-to-end.
If electricity cost $0.1/Whr, a cup of tea would cost ~$10 just to run the kettle. You've overestimated the cost of electricity by roughly a thousand-fold.
Residential electricity prices in California are more like 0.19 USD/kWh, which is 0.00019 USD/Wh.
I wonder what the cost is when taking second order effects into the equation. Leverage hides the true cost, burying the reality further and further from the obvious.
I’ll have to do a writeup sometime explaining why these numbers aren’t as surprising as they seem. They’re also not as wasteful as they seem.
Roughly, the boxes would be turned on anyway. Might as well put them to work.
And yeah, it draws more power when in use (by a lot). But the data center probably isn’t paying a huge premium on top of what they would already pay for electricity.
So all that’s left is feeling generally queasy about using so much electricity, rather than a practical feeling of “this is bad because it costs a lot.”
Economies of scale have some dramatic effects at the high ends: I doubt the datacenter’s costs are anywhere near linear for the amount of Whr they’re consuming. And even if they are, it’s still a rounding error considering the profitability of GoogSoftBook.
At some point it becomes linear though, there is a base cost to electricity. Google builds datacenters near coal power plants specifically to get (literally) dirt cheap electricity.
This is true. But researchers are divorced from those realities. The day-to-day researchers almost never have to think like "If I run this, will it cost a ton?" Those kinds of agreements are usually made on a company basis, i.e. there's no cost to just leave a model running and see what happens. (Which is a huge advantage, by the way.)
Just a correction here: we build datacenters near large sources of renewable / hydro plants (I consider hydro a borderline case). We buy renewable credits for everything and are even trying to timeshift load now to match renewable generation peaks.
I appreciate them buying credits, but as nice as those windmills and solar panels are, when the sun doesn't shine, and the wind doesn't blow, what gets burned is coal. Anyway I suppose that's more a policy issue of The Netherlands, I think it's quite shameful we still have coal plants. Those credits hide the real cost of renewable energy.
In case anyone looking through the linked article is also wondering why the images look odd, it's because (buried in the middle):
> motivated by early color display palettes, we create our own 9-bit color palette to represent pixels. Using this palette yields an input sequence length 3 times shorter than the standard (R, G, B) palette, while still encoding color faithfully.
Makes total sense now. In fact, those images remind me so much of how photos looked on the early internet, since many were palletized.
I'd be surprised if this architecture scales to larger resolutions, but any move towards "general learning" is really the interesting next step to me, not scaling up an inefficient architecture.
Can they train the same GPT model on both text and images tasks at once, and would either task benefit at all from training on the other task?
Even GPT-3 seems to have trouble with world-modeling, it writes convincing text that has all the signs and form of good prose, but the output repeatedly violates physics and common sense in funny ways.
I know just enough about machine learning to have dangerously unrealistic expectations, but I'd like if I could reasonnably hope to see signs of a shared representation or shared knowledge between say image labeling and language modeling. This looks like a very concrete data point to take if you care about generality. Maybe then we can seriously talk about world-modeling.
Of course it's not going to scale much past this, it's quadratic and already hitting painful compute levels.
However, if you were starting this research today, you'd use any of half-a-dozen different self-attention variants which are roughly linear, including OA's own Sparse Transformers (which they did use to generate images, just on a far smaller scale which wouldn't be adequate to show competitive performance with SimCLR etc). With those, it's perfectly possible to do self-attention over whole images at 256px or higher.
(As a matter of fact, Aydao has been working on a StyleGAN which just uses self-attention every layer instead of convolutions, using one of the new attentions; you can see some generated image samples from Flowers here: https://github.com/tensorfork/tensorfork/issues/31 GAN loss, not autoregressive pixel likelihood, but it makes the point.)
It's been done. For example it is possible to take an image through a CNN, generate box proposals and make a list of embeddings extracted from the boxes. That would be a bag of visual tokens. Then continue the sequence with text tokens and train it to solve a reasoning task.
This is the question I want answered. I feel that in the human brain, many words evoke imagery. For instance, when I hear the word bleak, I think of a grey setting
Upscaling NNs already exist, but that's beside the point. This is a demonstration that no domain-specific knowledge needs to be encoded in your architecture. Manually wiring different NNs together to split a task into separate subtasks would be the opposite.
Far more interesting would be general training optimizations that could train GPT faster on any kind of task.
What an excellent video. I know very little about this field but I was able to make sense of the things you've explained. I need to learn some more fundamentals I guess, but I'll surely be revisiting your video after that.
Attention is all you need is the title of the 2017 paper that introduced the Transformer architecture.
It reflects the evolution of NLP models: (RNN) —> (RNN + attention) —> (attention) and the idea that you don’t need the recurrent component. Just look at all elements of the sequence at once, applying varying weights (attention).
"As further proof, features from the model achieve state-of-the-art performance on a number of classification datasets and near state-of-the-art unsupervised accuracy on ImageNet."
Impressive stuff! This performs well even without domain-specific architecture choices.
One more step for ML. It used to be that we needed hand designed image features. Now we can learn even the image priors (spatial locality and translation invariance) from data.
Transformers are basically learning relations between pairs of input tokens, moving the problem to a more abstract level than predicting directly on tokens. While CNNs excel at benefiting from those two forms of invariance, transformers have permutation invariance, they can predict on sets, graphs and non-euclidean spaces.
> Now we can learn even the image priors (spatial locality and translation invariance) from data.
Right. The attention layers even learn attention patterns which look like convolution layer kernels! But better, presumably: https://arxiv.org/abs/1911.03584
The post states, "When we train GPT-2 on images unrolled into long sequences of pixels, which we call iGPT, we find that the model appears to understand 2-D image characteristics such as object appearance and category." This, to me (former philosophy student) feels like a very low bar for "understanding." Could the model _explain_ 2D image characteristics, or can it only generate them? I'm sure this debate will rage on for a while, but when it comes to intelligence, I believe we ought to be more rigorous with our use of the word.
Author here. You're absolutely right that "understanding" is a fuzzy word. As you pointed out, part of reason we hold this belief is that the model can generate diverse samples and successfully complete out-of-distribution inputs. But the other part is that the model learns (without labels) useful features for classifying objects. Would be very interesting to test it on a broader set of datasets which measure other 2D image characteristics.
I work in computer vision; "understanding" here is meant to mean there is a statistical understanding. The model better represents the distribution of things that look like real images. I must say these examples are incredibly good at generating coherent scenes compared to any similar attempts I've seen. I do think it represents an advance in the that kind of understanding.
I think your definition of "understanding" is more the realm of AI, and why so many people in my field sigh when the term "AI" is used when we're really just talking about ML. But then again, you could absolutely in the current day train an ML model to look at images and then produce a natural language "explanation" of the image. It might not be able to make leaps in deductive logic, but it would be able to explain more than just a list of what is in the image. Is this "understanding"? Maybe that question is philosophy.
A convolutional net learns kernels that most people would say display some understanding over the domain of images as well. At points they might even serve as explanations. I think taking "understanding" to be "able to explain rigorously" is the highest bar possible here. A lot of people understand concepts that they are unable to explain. Animals understand a lot but lack a way of explaining things as well.
This may be my imagination, or maybe it’s just because the images are so small, but these seem to have fewer of the slightly disturbing artifacts you see in CNN generated images.
There's some definite overfitting apparent in the completion of the bottom half of the cat (blue bkg) photo. Every completion has a funny index card covering the bottom including the "original", while autocomplete should be more robust than recreating the original photo.
You might take a look at The Bitter Lesson [1], it's referenced by the article and linked around on this thread.
> One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.
> The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries.
I did take a look at the article before writing my comment, and I disagree with its premise as well as conclusion. The human mind has a variety of specialized functions. Our ocular cortex does a sort of convolution on a 2D field with three color and one alpha ‘sensors’ (cones and rods). If specialization were less powerful than generality, why isn’t our brain one giant lobe with no diversity in neuron topography?
The idea that specialization is not as powerful as computation fails the most basic test of a proactive, rather than retroactive, theory. Can you make proactive claims about what works in any given domain? Is the solution to take the hungriest algorithm and apply it? What about feature engineering, cleaning, parameter tuning, analysis, etc.? Is the most power hungry solution still the most effective? In my opinion, part of the reason humans aren’t just giant computation blobs is that we thrive on constraints (physical, sexual, emotional).
There’s a difference between creativity and elegance! Elegance is more about the beauty of the solution rather than the value of the outcome. The power of a blunt instrument is impotent at best, destructive at worst.
Well, there is a whole seperate line of research concerning the topic of these input perturbations ranging from PGD to just Gaussian noise. This model does not claim to defend against any of those.
Author here. I ran some early experiments a while ago, and it looked like adversarial examples for convnet classifiers didn't transfer to transformer classifiers and vice versa. Definitely worth looking more into!
I didn't notice any obvious visual differences, but I'm also not an expert on adversarial examples. The transformer models were similarly susceptible to attacks, but while adversarial examples transferred well within a model class (~40%), they did not across model classes (~5%). These are rough numbers from memory, don't hold me accountable!
We now understand better the capabilities of the transformer, which is the hottest thing in AI in the last 2 years. The transformer apparently can learn images even if nobody tells it about the properties of space. CNN's on the other part rely heavily on these properties (spatial locality and translation invariance).
i can imagine that in 15-25 years, consuming too much content esp. "AI" generated content will have guidelines around it. The possibilities with this stuff is just starting, it could 'overload' youtube and other hosting services once 'creators' begin to normalize its power (10 yrs from now thers an easy photoshp/adobe plugin to generate random video/image scenes with actors and generated voices/movement (animated rigs aka mixamo))
By extracting the features for the image (that is, the encoding, as in transfer learning in computer vision or natural language processing with, for instance, VGG or Bert, respectively) and feeding this to the classifier.
Oddly, it still uses TensorFlow like the original GPT-2 release despite OpenAI's declared switch to PyTorch, and it has dependency hell so it's not easy to create a wrapper tool for it.
Since it's still the GPT-2 architecture, it might be possible to port the weights to Huggingface Transformers (for the RGB generation), and then write a wrapper to extend it for the image rendering. (filed an issue here: https://github.com/huggingface/transformers/issues/5088 )