Image GPT

minimaxir · on June 17, 2020

The model is open sourced on GitHub: https://github.com/openai/image-gpt

Oddly, it still uses TensorFlow like the original GPT-2 release despite OpenAI's declared switch to PyTorch, and it has dependency hell so it's not easy to create a wrapper tool for it.

Since it's still the GPT-2 architecture, it might be possible to port the weights to Huggingface Transformers (for the RGB generation), and then write a wrapper to extend it for the image rendering. (filed an issue here: https://github.com/huggingface/transformers/issues/5088 )

gwern · on June 17, 2020

> Oddly, it still uses TensorFlow like the original GPT-2 release despite OpenAI's declared switch to PyTorch

Note the footnote in the paper where evaluations was interrupted by the move to MS Azure. This is relatively old work, and since it's literally GPT-2 but images, it's no surprise they didn't bother to rewrite it in PyTorch. If their next big from-scratch thing (presumably the multimodal GPT-like one-model-to-rule-them-all research I've been looking forward to ever since the TR article) is TensorFlow, then I'll be surprised.

sillysaurusx · on June 17, 2020

Unfortunately, there probably won’t be a next big-from-scratch thing unless it ties into their API somehow. It feels like we’ve seen the last of OpenAI open sourcing anything but toy models.

Of course, I would be delighted to be mistaken.

elAhmo · on June 17, 2020

It is quite amazing that this model is so concise and can be expressed with just a few hundred lines of code. (similar models as well, in general)

Looking at the samples I am amazed by some of the results.

bigdict · on June 17, 2020

They say they used 1024 TPUs for training, maybe TF has better perf on that hardware at this point.

EDIT: v2 of the paper says 2048 TPUs.

AaronFriel · on June 17, 2020

I'm still learning from deep learning papers and videos before dipping my toes in myself, is there a summary of why PyTorch vs TensorFlow? Does it matter for me?

minimaxir · on June 17, 2020

Nowadays, there isn't a huge practical difference in terms of performance/tooling aside from edge cases and deployment options. It mostly depends on your syntax preference. (although there are flame wars from both sides)

I noted the TensorFlow usage because the original GPT-2 release was TensorFlow 1.X, which led to issues when TensorFlow 2.0 was released soon after.

For model training, I strongly recommend using the higher-level Keras APIs for TensorFlow/pytorch-lightning API for PyTorch than the respective base tools.

m0zg · on June 17, 2020

There is a huge difference in developer experience still. When TF fails it typically offers little to no clue as to what you can do to rectify the situation. PyTorch is a lot more helpful most of the time. PyTorch is worth choosing on this basis alone, unless you intend to deploy to mobile hardware, where TFLite is the only real viable option.

fxtentacle · on June 18, 2020

TF can compile to optimized C++. When you use PyTorch with TPU, it'll internally use the TF XLA backend. If you intend to deploy on mobile, TF is easy and PyTorch is pain. But TF has obscure bugs and error messages.

currymj · on June 18, 2020

just use PyTorch, it’s generally going to be the nicest.

TF2 is much nicer than TF1 but most of the changes were just to make it more like PyTorch.

jszymborski · on June 17, 2020

Apparently this takes "2500-V100 days" which is an insane amount of resources for images of this resolution.

For context, this is equivalent to 100 $10,000 GPUs running for 25 days, 24/7.

https://twitter.com/jm_alexia/status/1273349716915470340?s=2...

boulos · on June 17, 2020

Disclosure: I work on Google Cloud (and know the OpenAI folks).

Fwiw, I think 2500 "V100-days" here is an extrapolation of ~1 2048-TPUv3 day. So approximately $8/hr * 24 hours * 2048 => $400k, if you know, you could make use of that TPU Pod for much of the rest of the days of the year :).

But yes, this is still the realm of "Do you have millions of dollars of ML infrastructure" (if you want it quickly). I'm kind of hoping gwern et al. will use Colab to slowly train a variant of this for free :).

gwern · on June 18, 2020

We've moved up in the world! These days we'd train on one of our TFRC TPU pods. Colab is a lot of trouble and better avoided...

(We actually can create a TPU-2048 pod. For a short while, anyway, before it preempts. But we wouldn't need to if we used lucidrain's efficient attention implementations I mention in my other comment.)

boulos · on June 18, 2020

Oh, I missed that! Fwiw, I’m a big proponent of 1/4-pod slices anyway :).

forgingahead · on June 18, 2020

Is there a way to guarantee a specific GPU for Colab, and also have it run in the background without the runtime disconnecting? Even Colab Pro doesn't seem to guarantee a specific resource without the disconnects.

bjourne · on June 18, 2020

Nope. You need to have it open in a browser tab and to keep it from disconnecting you run this script:

    function ConnectButton(){document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click(); console.log("Connect pushed"); } setInterval(ConnectButton,60000);

:) You still get booted every 6 hour or so.

fxtentacle · on June 17, 2020

... for 64x64 px resolution.

This is truly a brute force approach.

akx · on June 18, 2020

But add an AI superresolution model after that and you're off to the races!

gwern · on June 18, 2020

The problem with superresolution is that by design, they focus on details and textures for upscaling, and avoid adding in any large-scale structure. So if you train it separately (ie there is no backpropagation from the final highres 224px image through the superresolution model all the way back through iGPT), you cripple the original model.

There is only so much about objects and the world you can learn from 64px thumbnails, and they mention that this is probably a reason iGPT wins on the tiny images like CIFAR but loses to the semi-supervised CNNs on ImageNet: because the CNNs are compute-efficient enough that they can train at the standard 224px and see all of the details and structure that disappears at iGPT's 64px, and learn end-to-end.

gowld · on June 17, 2020

Cost to train the model (which can be used for many images, and images can be upscaled using AI):

$10000 * 2500days / 5yr = $15000 hardware cost

200W * 2500day * (0.10 USD / Whr) = $1.2M in electricity

slavik81 · on June 17, 2020

If electricity cost $0.1/Whr, a cup of tea would cost ~$10 just to run the kettle. You've overestimated the cost of electricity by roughly a thousand-fold.

Residential electricity prices in California are more like 0.19 USD/kWh, which is 0.00019 USD/Wh.

andybak · on June 17, 2020

Found the Brit.

_fx6v · on June 18, 2020

I wonder what the cost is when taking second order effects into the equation. Leverage hides the true cost, burying the reality further and further from the obvious.

sillysaurusx · on June 17, 2020

I’ll have to do a writeup sometime explaining why these numbers aren’t as surprising as they seem. They’re also not as wasteful as they seem.

Roughly, the boxes would be turned on anyway. Might as well put them to work.

And yeah, it draws more power when in use (by a lot). But the data center probably isn’t paying a huge premium on top of what they would already pay for electricity.

So all that’s left is feeling generally queasy about using so much electricity, rather than a practical feeling of “this is bad because it costs a lot.”

Economies of scale have some dramatic effects at the high ends: I doubt the datacenter’s costs are anywhere near linear for the amount of Whr they’re consuming. And even if they are, it’s still a rounding error considering the profitability of GoogSoftBook.

tinco · on June 17, 2020

At some point it becomes linear though, there is a base cost to electricity. Google builds datacenters near coal power plants specifically to get (literally) dirt cheap electricity.

sillysaurusx · on June 18, 2020

This is true. But researchers are divorced from those realities. The day-to-day researchers almost never have to think like "If I run this, will it cost a ton?" Those kinds of agreements are usually made on a company basis, i.e. there's no cost to just leave a model running and see what happens. (Which is a huge advantage, by the way.)

boulos · on June 18, 2020

Just a correction here: we build datacenters near large sources of renewable / hydro plants (I consider hydro a borderline case). We buy renewable credits for everything and are even trying to timeshift load now to match renewable generation peaks.

tl;dr: not coal plants :).

tinco · on June 18, 2020

Is it just a coincidence that the Eemshaven (eu-west4) datacenter is close to a coal plant then?

I'm pretty sure we're going to close that plant soon though.

nl · on June 18, 2020

> Is it just a coincidence that the Eemshaven (eu-west4) datacenter is close to a coal plant then?

Presumably, since:

"Our data center in Eemshaven was the first to be powered by 100% renewable energy from day one."

https://www.google.com/about/datacenters/locations/eemshaven...

tinco · on June 18, 2020

I appreciate them buying credits, but as nice as those windmills and solar panels are, when the sun doesn't shine, and the wind doesn't blow, what gets burned is coal. Anyway I suppose that's more a policy issue of The Netherlands, I think it's quite shameful we still have coal plants. Those credits hide the real cost of renewable energy.

nl · on June 19, 2020

I'm not sure - they say the data center is "the first Google datacenter to be powered by 100% renewable energy from day one."

https://www.blog.google/around-the-globe/google-europe/dutch...

And they seem pretty serious about removing non-renewables altogether.

https://storage.googleapis.com/gweb-sustainability.appspot.c...

indiandennis · on June 17, 2020

Electricity rates are kWh, so I think it would be 120k for electricity, not 1.2 mil.

Scaevolus · on June 17, 2020

$1200, off by a factor of 1000.

gowld · on June 18, 2020

$.1 for KWhr not When, so $1200 electricity.

Good use of units makes it easy to spot and fix the mistake.

fpgaminer · on June 17, 2020

In case anyone looking through the linked article is also wondering why the images look odd, it's because (buried in the middle):

> motivated by early color display palettes, we create our own 9-bit color palette to represent pixels. Using this palette yields an input sequence length 3 times shorter than the standard (R, G, B) palette, while still encoding color faithfully.

Makes total sense now. In fact, those images remind me so much of how photos looked on the early internet, since many were palletized.

tux3 · on June 17, 2020

I'd be surprised if this architecture scales to larger resolutions, but any move towards "general learning" is really the interesting next step to me, not scaling up an inefficient architecture. Can they train the same GPT model on both text and images tasks at once, and would either task benefit at all from training on the other task?

Even GPT-3 seems to have trouble with world-modeling, it writes convincing text that has all the signs and form of good prose, but the output repeatedly violates physics and common sense in funny ways.

I know just enough about machine learning to have dangerously unrealistic expectations, but I'd like if I could reasonnably hope to see signs of a shared representation or shared knowledge between say image labeling and language modeling. This looks like a very concrete data point to take if you care about generality. Maybe then we can seriously talk about world-modeling.

gwern · on June 17, 2020

Of course it's not going to scale much past this, it's quadratic and already hitting painful compute levels.

However, if you were starting this research today, you'd use any of half-a-dozen different self-attention variants which are roughly linear, including OA's own Sparse Transformers (which they did use to generate images, just on a far smaller scale which wouldn't be adequate to show competitive performance with SimCLR etc). With those, it's perfectly possible to do self-attention over whole images at 256px or higher.

(As a matter of fact, Aydao has been working on a StyleGAN which just uses self-attention every layer instead of convolutions, using one of the new attentions; you can see some generated image samples from Flowers here: https://github.com/tensorfork/tensorfork/issues/31 GAN loss, not autoregressive pixel likelihood, but it makes the point.)

gwern · on June 18, 2020

* lucidrains, not Aydao

visarga · on June 19, 2020

It's been done. For example it is possible to take an image through a CNN, generate box proposals and make a list of embeddings extracted from the boxes. That would be a bag of visual tokens. Then continue the sequence with text tokens and train it to solve a reasoning task.

Another example: Unicoder-VL https://arxiv.org/pdf/1908.06066.pdf

unwoundmouse · on June 17, 2020

This is the question I want answered. I feel that in the human brain, many words evoke imagery. For instance, when I hear the word bleak, I think of a grey setting

Udik · on June 17, 2020

> I'd be surprised if this architecture scales to larger resolutions

Could a different NN be used for upscaling?

the8472 · on June 17, 2020

Upscaling NNs already exist, but that's beside the point. This is a demonstration that no domain-specific knowledge needs to be encoded in your architecture. Manually wiring different NNs together to split a task into separate subtasks would be the opposite.

Far more interesting would be general training optimizations that could train GPT faster on any kind of task.

CShorten · on June 18, 2020

I made a video explaining this paper if interested! https://youtu.be/7rFLnQdl22c

superasn · on June 18, 2020

What an excellent video. I know very little about this field but I was able to make sense of the things you've explained. I need to learn some more fundamentals I guess, but I'll surely be revisiting your video after that.

mleonhard · on June 18, 2020

The video assumes that viewers already have a lot of knowledge in the subject.

etaioinshrdlu · on June 17, 2020

This looks like a new iteration on the Pixel-RNN idea: https://arxiv.org/pdf/1601.06759.pdf

GPT2 definitely looks better.

bigdict · on June 17, 2020

No, this is a new iteration on the AIAYN idea.

the8472 · on June 17, 2020

Also an instance of The Bitter Lesson

elcomet · on June 17, 2020

What's this ?

bigdict · on June 17, 2020

Attention is all you need is the title of the 2017 paper that introduced the Transformer architecture.

It reflects the evolution of NLP models: (RNN) —> (RNN + attention) —> (attention) and the idea that you don’t need the recurrent component. Just look at all elements of the sequence at once, applying varying weights (attention).

ot · on June 17, 2020

"Attention is all you need"

jaredtn · on June 17, 2020

"As further proof, features from the model achieve state-of-the-art performance on a number of classification datasets and near state-of-the-art unsupervised accuracy on ImageNet."

Impressive stuff! This performs well even without domain-specific architecture choices.

visarga · on June 17, 2020

One more step for ML. It used to be that we needed hand designed image features. Now we can learn even the image priors (spatial locality and translation invariance) from data.

Transformers are basically learning relations between pairs of input tokens, moving the problem to a more abstract level than predicting directly on tokens. While CNNs excel at benefiting from those two forms of invariance, transformers have permutation invariance, they can predict on sets, graphs and non-euclidean spaces.

gwern · on June 17, 2020

> Now we can learn even the image priors (spatial locality and translation invariance) from data.

Right. The attention layers even learn attention patterns which look like convolution layer kernels! But better, presumably: https://arxiv.org/abs/1911.03584

jdelman · on June 18, 2020

The post states, "When we train GPT-2 on images unrolled into long sequences of pixels, which we call iGPT, we find that the model appears to understand 2-D image characteristics such as object appearance and category." This, to me (former philosophy student) feels like a very low bar for "understanding." Could the model _explain_ 2D image characteristics, or can it only generate them? I'm sure this debate will rage on for a while, but when it comes to intelligence, I believe we ought to be more rigorous with our use of the word.

markchen90 · on June 18, 2020

Author here. You're absolutely right that "understanding" is a fuzzy word. As you pointed out, part of reason we hold this belief is that the model can generate diverse samples and successfully complete out-of-distribution inputs. But the other part is that the model learns (without labels) useful features for classifying objects. Would be very interesting to test it on a broader set of datasets which measure other 2D image characteristics.

xvedejas · on June 18, 2020

I work in computer vision; "understanding" here is meant to mean there is a statistical understanding. The model better represents the distribution of things that look like real images. I must say these examples are incredibly good at generating coherent scenes compared to any similar attempts I've seen. I do think it represents an advance in the that kind of understanding.

I think your definition of "understanding" is more the realm of AI, and why so many people in my field sigh when the term "AI" is used when we're really just talking about ML. But then again, you could absolutely in the current day train an ML model to look at images and then produce a natural language "explanation" of the image. It might not be able to make leaps in deductive logic, but it would be able to explain more than just a list of what is in the image. Is this "understanding"? Maybe that question is philosophy.

inertiatic · on June 18, 2020

A convolutional net learns kernels that most people would say display some understanding over the domain of images as well. At points they might even serve as explanations. I think taking "understanding" to be "able to explain rigorously" is the highest bar possible here. A lot of people understand concepts that they are unable to explain. Animals understand a lot but lack a way of explaining things as well.

woah · on June 17, 2020

This may be my imagination, or maybe it’s just because the images are so small, but these seem to have fewer of the slightly disturbing artifacts you see in CNN generated images.

gowld · on June 17, 2020

The huge artifacts obscure any uncanny valley small artifacts.

eanzenberg · on June 17, 2020

There's some definite overfitting apparent in the completion of the bottom half of the cat (blue bkg) photo. Every completion has a funny index card covering the bottom including the "original", while autocomplete should be more robust than recreating the original photo.

ashtonbaker · on June 17, 2020

I think that the edge of the index card is included in the top half of the image.

speedgoose · on June 18, 2020

The popular meme section is totally overfitted. It even reproduces the artefacts of the Jpeg compression on the surprised pikachu.

aabhay · on June 17, 2020

In the design field, there’s an adage — constraints inspire creativity.

This work seems so unconstrained in its use of computation, that is almost screams to me that they must be going about it the wrong way.

samgriesemer · on June 18, 2020

You might take a look at The Bitter Lesson [1], it's referenced by the article and linked around on this thread.

> One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

> The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries.

[1]: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

aabhay · on June 19, 2020

I did take a look at the article before writing my comment, and I disagree with its premise as well as conclusion. The human mind has a variety of specialized functions. Our ocular cortex does a sort of convolution on a 2D field with three color and one alpha ‘sensors’ (cones and rods). If specialization were less powerful than generality, why isn’t our brain one giant lobe with no diversity in neuron topography?

The idea that specialization is not as powerful as computation fails the most basic test of a proactive, rather than retroactive, theory. Can you make proactive claims about what works in any given domain? Is the solution to take the hungriest algorithm and apply it? What about feature engineering, cleaning, parameter tuning, analysis, etc.? Is the most power hungry solution still the most effective? In my opinion, part of the reason humans aren’t just giant computation blobs is that we thrive on constraints (physical, sexual, emotional).

gwern · on June 18, 2020

We don't want 'creativity'. (Elegance is, as Boltzmann said, for tailors.) We want power.

aabhay · on June 19, 2020

There’s a difference between creativity and elegance! Elegance is more about the beauty of the solution rather than the value of the outcome. The power of a blunt instrument is impotent at best, destructive at worst.

fizixer · on June 17, 2020

As someone not up-to-date with literature, are one-pixel/few-pixels/small-delta attack issues resolved yet?

janhenr · on June 17, 2020

Well, there is a whole seperate line of research concerning the topic of these input perturbations ranging from PGD to just Gaussian noise. This model does not claim to defend against any of those.

markchen90 · on June 17, 2020

Author here. I ran some early experiments a while ago, and it looked like adversarial examples for convnet classifiers didn't transfer to transformer classifiers and vice versa. Definitely worth looking more into!

aquajet · on June 18, 2020

Could you elaborate a bit more? What were the differences between transformer adversarial examples and cnn adversarial examples?

markchen90 · on June 18, 2020

I didn't notice any obvious visual differences, but I'm also not an expert on adversarial examples. The transformer models were similarly susceptible to attacks, but while adversarial examples transferred well within a model class (~40%), they did not across model classes (~5%). These are rough numbers from memory, don't hold me accountable!

gauthamzz · on June 17, 2020

What are some real world use cases for something like this?

visarga · on June 19, 2020

We now understand better the capabilities of the transformer, which is the hottest thing in AI in the last 2 years. The transformer apparently can learn images even if nobody tells it about the properties of space. CNN's on the other part rely heavily on these properties (spatial locality and translation invariance).

core-questions · on June 18, 2020

Making risque images of people for blackmail?

abledon · on June 18, 2020

i can imagine that in 15-25 years, consuming too much content esp. "AI" generated content will have guidelines around it. The possibilities with this stuff is just starting, it could 'overload' youtube and other hosting services once 'creators' begin to normalize its power (10 yrs from now thers an easy photoshp/adobe plugin to generate random video/image scenes with actors and generated voices/movement (animated rigs aka mixamo))

2sk21 · on June 17, 2020

So how would this model be used for a classification task?

msapaydin · on June 17, 2020

By extracting the features for the image (that is, the encoding, as in transfer learning in computer vision or natural language processing with, for instance, VGG or Bert, respectively) and feeding this to the classifier.